Doc2EDAG: An End-to-End Document-level Framework for Chinese Financial Event Extraction

Most existing event extraction (EE) methods merely extract event arguments within the sentence scope. However, such sentence-level EE methods struggle to handle soaring amounts of documents from emerging applications, such as finance, legislation, health, etc., where event arguments always scatter across different sentences, and even multiple such event mentions frequently co-exist in the same document. To address these challenges, we propose a novel end-to-end model, Doc2EDAG, which can generate an entity-based directed acyclic graph to fulfill the document-level EE (DEE) effectively. Moreover, we reformalize a DEE task with the no-trigger-words design to ease the document-level event labeling. To demonstrate the effectiveness of Doc2EDAG, we build a large-scale real-world dataset consisting of Chinese financial announcements with the challenges mentioned above. Extensive experiments with comprehensive analyses illustrate the superiority of Doc2EDAG over state-of-the-art methods. Data and codes can be found at https://github.com/dolphin-zs/Doc2EDAG.


Introduction
Event extraction (EE), traditionally modeled as detecting trigger words and extracting corresponding arguments from plain text, plays a vital role in natural language processing since it can produce valuable structured information to facilitate a variety of tasks, such as knowledge base construction, question answering, language understanding, etc.
In recent years, with the rising trend of digitalization within various domains, such as finance, legislation, health, etc., EE has become an increasingly important accelerator to the development of * This work was done during the internship of Shun Zheng at Microsoft Research Asia, Beijing, China. 20052006200720082009201220132014 Year 0 10 20 30 # Announcements (x1000) Figure 1: The rapid growth of event-related announcements considered in this paper. business in those domains. Take the financial domain as an example, continuous economic growth has witnessed exploding volumes of digital financial documents, such as financial announcements in a specific stock market as Figure 1 shows, specified as Chinese financial announcements (ChFi-nAnn). While forming up a gold mine, such large amounts of announcements call EE for assisting people in extracting valuable structured information to sense emerging risks and find profitable opportunities timely.
Given the necessity of applying EE on the financial domain, the specific characteristics of financial documents as well as those within many other business fields, however, raise two critical challenges to EE, particularly arguments-scattering and multi-event. Specifically, the first challenge indicates that arguments of one event record may scatter across multiple sentences of the document, while the other one reflects that a document is likely to contain multiple such event records. To intuitively illustrate these challenges, we show a typical ChFinAnn document with two Equity Pledge event records in Figure 2. For the first event, the entity 1 "[SHARE1]" is the correct Pledged Shares at the sentence level (ID 5). However, due to the capital stock increment (ID 7), Although a great number of efforts (Ahn, 2006;Ji and Grishman, 2008;Liao and Grishman, 2010;Hong et al., 2011;Riedel and McCallum, 2011;Li et al., 2013Li et al., , 2014Chen et al., 2015;Yang and Mitchell, 2016;Nguyen et al., 2016;Sha et al., 2018;Zhang and Ji, 2018;Nguyen and Nguyen, 2019;Wang et al., 2019) have been put on EE, most of them are based on ACE 2005 2 , an expert-annotated benchmark, which only tagged event arguments within the sentence scope. We refer to such task as the sentence-level EE (SEE), which obviously overlooks the arguments-scattering challenge. In contrast, EE on financial documents, such as ChFi-nAn, requires document-level EE (DEE) when facing arguments-scattering, and this challenge gets much harder when coupled with multi-event.
The most recent work, DCFEE , attempted to explore DEE on ChFinAnn, by employing distant supervision (DS) (Mintz et al., 2009) to generate EE data and performing a two-stage extraction: 1) a sequence tagging model for SEE, and 2) a key-event-sentence detection model to detect the key-event sentence, coupled with a heuristic strategy that padded missing arguments from surrounding sentences, for DEE. However, the sequence tagging model for SEE cannot handle multi-event sentences elegantly, and even worse, the context-agnostic argumentscompletion strategy fails to address the argumentsscattering challenge effectively.
In this paper, we propose a novel end-to-end model, Doc2EDAG, to address the unique challenges of DEE. The key idea of Doc2EDAG is to transform the event table into an entity-based directed acyclic graph (EDAG). The EDAG format can transform the hard table-filling task into several sequential path-expanding sub-tasks that are more tractable. To support the EDAG generation efficiently, Doc2EDAG encodes entities with document-level contexts and designs a memory mechanism for path expanding. Moreover, to ease the DS-based document-level event labeling, we propose a novel DEE formalization that removes the trigger-words labeling and regards DEE as directly filling event tables based on a document. This no-trigger-words design does not rely on any predefined trigger-words set or heuristic to filter multiple trigger candidates, and still perfectly matches the ultimate goal of DEE, mapping a document to underlying event tables.
To evaluate the effectiveness of our proposed Doc2EDAG, we conduct experiments on a realworld dataset, consisting of large scales of financial announcements. In contrast to the dataset used by DCFEE where 97% 3 documents just contained one event record, our data collection is ten times larger where about 30% documents include multiple event records. Extensive experiments demonstrate that Doc2EDAG can significantly outper-form state-of-the-art methods when facing DEEspecific challenges.
In summary, our contributions include: • We propose a novel model, Doc2EDAG, which can directly generate event tables based on a document, to address unique challenges of DEE effectively.
• We reformalize a DEE task without trigger words to ease the DS-based document-level event labeling.
• We build a large-scale real-world dataset for DEE with the unique challenges of arguments-scattering and multi-event, the extensive experiments on which demonstrate the superiority of Doc2EDAG.
Note that though we focus on ChFinAnn data in this work, we tackle those DEE-specific challenges without any domain-specific assumption. Therefore, our general labeling and modeling strategies can directly benefit many other business domains with similar challenges, such as criminal facts and judgments extraction from legal documents, disease symptoms and doctor instructions identification from medical reports, etc.

Related Work
Recent development on information extraction has been advancing in building the joint model that can extract entities and identify structures (relations or events) among them simultaneously. For instance, (Ren et al., 2017;Zheng et al., 2017;Zeng et al., 2018a;Wang et al., 2018) focused on jointly extracting entities and inter-entity relations. In the meantime, the same to the focus of this paper, a few studies aimed at designing joint models for the entity and event extraction, such as handcrafted-feature-based (Li et al., 2014;Yang and Mitchell, 2016;Judea and Strube, 2016) and neural-network-based (Zhang and Ji, 2018; Nguyen and Nguyen, 2019) models. Nevertheless, these models did not present how to handle argument candidates beyond the sentence scope. (Yang and Mitchell, 2016) claimed to handle event-argument relations across sentences with the prerequisite of well-defined features, which, unfortunately, is nontrivial.
In addition to the modeling challenge, another big obstacle for democratizing EE is the lack of training data due to the enormous cost to obtain expert annotations. To address this problem, some researches attempted to adapt distant supervision (DS) to the EE setting, since DS has shown promising results by employing knowledge bases to automatically generate training data for relation extraction (Mintz et al., 2009). However, the vanilla EE required the trigger words that were absent on factual knowledge bases. Therefore, ) employed either linguistic resources or predefined dictionaries for trigger-words labeling. On the other hand, another recent work (Zeng et al., 2018b) showed that directly labeling event arguments without trigger words was also feasible. However, they only considered the SEE setting and their methods cannot be directly extended to the DEE setting, which is the main focus of this work.
Traditionally, when applying DS to relation extraction, researchers put huge efforts into alleviating labeling noises (Riedel et al., 2010;Lin et al., 2016;Zheng et al., 2019). In contrast, this work shows that combining DS with some simple constraints can obtain pretty good labeling quality for DEE, where the reasons are two folds: 1) both the knowledge base and text documents are from the same domain; 2) an event record usually contains multiple arguments, while a common relational fact only covers two entities.

Preliminaries
We first clarify several key notions: 1) entity mention: an entity mention is a text span that refers to an entity object; 2) event role: an event role corresponds to a predefined field of the event table; 3) event argument: an event argument is an entity that plays a specific event role; 4) event record: an event record corresponds to an entry of the event table and contains several arguments with required roles. For example, Figure 2 shows two event records, where the entity "[PER]" is an event argument with the Pledger role.
To better elaborate and evaluate our proposed approach, we leverage the ChFinAnn data in this paper. ChFinAnn documents contain firsthand official disclosures of listed companies in the Chinese stock market and have hundreds of types, such as annual reports and earnings estimates. While in this work, we focus on those eventrelated ones that are frequent, influential, and mainly expressed by the natural language.

Document-level Event Labeling
As a prerequisite to DEE, we first conduct the DSbased event labeling at the document level. More specifically, we map tabular records from an event knowledge base to document text and regard wellmatched records as events expressed by that document. Moreover, we adopt a no-trigger-words design and reformalize a novel DEE task accordingly to enable end-to-end model designs.
Event Labeling. To ensure the labeling quality, we set two constraints for matched records: 1) arguments of predefined key event roles must exist (non-key ones can be empty) and 2) the number of matched arguments should be higher than a certain threshold. Configurations of these constraints are event-specific, and in practice, we can tune them to directly ensure the labeling quality at the document level. We regard records that meet these two constraints as the well-matched ones, which serve as distantly supervised ground truths. In addition to labeling event records, we assign roles of arguments to matched tokens as token-level entity tags. Note that we do not label trigger words explicitly. Besides not affecting the DEE functionality, an extra benefit of such no-trigger-words design is a much easier DS-based labeling that does not rely on predefined trigger-words dictionaries or manually curated heuristics to filter multiple potential trigger words.
DEE Task Without Trigger Words. We reformalize a novel task for DEE as directly filling event tables based on a document, which generally requires three sub-tasks: 1) entity extraction, extracting entity mentions as argument candidates, 2) event detection, judging a document to be triggered or not for each event type, and 3) event table filling, filling arguments into the table of triggered events. This novel DEE task is much different from the vanilla SEE with trigger words but is consistent with the above simplified DS-based event labeling.

Doc2EDAG
The key idea of Doc2EDAG is to transform tabular event records into an EDAG and let the model learn to generate this EDAG based on documentlevel contexts. Following the example in Figure 2, Figure 3 typically depicts an EDAG generation process and Figure 4 presents the overall workflow of Doc2EDAG, which consists of two key stages:  Figure 3: An EDAG generation example that starts from event triggering and expands sequentially following the predefined order of event roles. document-level entity encoding (Section 5.1) and EDAG generation (Section 5.2). Before elaborating each of them in this section, we first describe two preconditioned modules: input representation and entity recognition.
Input Representation. In this paper, we denote a document as a sequence of sentences. Formally, after looking up the token embedding table V ∈ R dw×|V | , we denote a document d as a sentence sequence [s 1 ; s 2 ; · · · ; s Ns ] and each sentence s i ∈ R dw×Nw is composed of a sequence of token embeddings as [w i,1 , w i,2 , · · · , w i,Nw ], where |V | is the vocabulary size, N s and N w are the maximum lengths of the sentence sequence and the token sequence, respectively, and w i,j ∈ R dw is the embedding of j th token in i th sentence with the embedding size d w .
Entity Recognition. Entity recognition is a typical sequence tagging task. We conduct this task at the sentence level and follow a classic method, BI-LSTM-CRF (Huang et al., 2015), that first encodes the token sequence and then adds a conditional random field (CRF) layer to facilitate the sequence tagging. The only difference is that we employ the Transformer (Vaswani et al., 2017) instead of the original encoder, LSTM (Hochreiter and Schmidhuber, 1997). Transformer encodes a sequence of embeddings by the multiheaded self-attention mechanism to exchange contextual information among them. Due to the superior performance of the Transformer, we employ it as a primary context encoder in this work and name the Transformer module used in this stage as Transformer-1. Formally, for each sentence tensor s i ∈ R dw×Nw , we get the encoded one as h i = Transformer-1(s i ), where h i ∈ R dw×Nw shares the same embedding size d w and sequence length N w . During training, we employ roles of matched arguments as entity labels with the classic BIO (Begin, Inside, Other) scheme and wrap h i with a CRF layer to get the entity-recognition loss L er . As for the inference, we use the Viterbi decoding to get the best tagging sequence.

Document-level Entity Encoding
To address the arguments-scattering challenge efficiently, it is indispensable to leverage global contexts to better identify whether an entity plays a specific event role. Consequently, we utilize document-level entity encoding to encode extracted entity mentions with such contexts and produce an embedding of size d w for each entity mention with a distinct surface name.
Entity & Sentence Embedding. Since an entity mention usually covers multiple tokens with a variable length, we first obtain a fixed-sized embedding for each entity mention by conducting a max-pooling operation over its covered token embeddings. For example, given l th entity mention covering j th to k th tokens of i th sentence, we conduct the max-pooling over [h i,j , · · · , h i,k ] to get the entity mention embedding e l ∈ R dw . For each sentence s i , we also take the maxpooling operation over the encoded token sequence [h i,1 , · · · , h i,Nw ] to obtain a single sentence embedding c i ∈ R dw . After these operations, both the mention and the sentence embeddings share the same embedding size d w .
Document-level Encoding. Though we get embeddings for all sentences and entity mentions, these embeddings only encode local contexts within the sentence scope. To enable the awareness of document-level contexts, we employ the second Transformer module, Transformer-2, to facilitate the information exchange between all entity mentions and sentences. Before feeding them into Transformer-2, we add them with sentence position embeddings to inform the sentence order. After the Transformer encoding, we utilize the max-pooling operation again to merge multiple mention embeddings with the same entity surface name into a single embedding. Formally, after this stage, we obtain document-level contextaware entity mention and sentence embeddings as e d = [e d 1 , · · · , e d Ne ] and c d = [c d 1 , · · · , c d Ns ], respectively, where N e is the number of distinct entity surface names. These aggregated embeddings serve the next stage to fill event tables directly.

EDAG Generation
After the document-level entity encoding stage, we can obtain the document embedding t ∈ R dw by operating the max-pooling over the sentence tensor c d ∈ R dw×Ns and stack a linear classifier over t to conduct the event-triggering classification for each event type. Next, for each triggered event type, we learn to generate an EDAG.
EDAG Building. Before the model training, we need to build the EDAG from tabular event records. For each event type, we first manually define an event role order. Then, we transform each event record into a linked list of arguments following this order, where each argument node is either an entity or a special empty argument NA. Finally, we merge these linked lists into an EDAG by sharing the same prefix path. Since every complete path of the EDAG corresponds to one row of the event table, recovering the table format from a given EDAG is simple.
Task Decomposition. The EDAG format aims to simplify the hard table-filling task into several tractable path-expanding sub-tasks. Then, a natural question is how the task decomposition works, which can be answered by the following EDAG recovering procedure. Assume the event triggering as the starting node (the initial EDAG), there comes a series of path-expanding sub-tasks following a predefined event role order. When considering a certain role, for every leaf node of the current EDAG, there is a path-expanding sub-task that decides which entities to be expanded. For each entity to be expanded, we create a new node of that entity for the current role and expand the path by connecting the current leaf node to the new entity node. If no entity is valid for expanding, we create a special NA node. When all sub-tasks for the current role finish, we move to the next role and repeat until the last. In this work, we leverage the above logic to recover the EDAG from pathexpanding predictions at inference and to set associated labels for each sub-task when training.
Memory. To better fulfill each path-expanding sub-task, it is crucial to know entities already contained by the path. Hence, we design a memory mechanism that initializes a memory tensor m with the sentence tensor c d at the beginning and updates m when expanding the path by appending either the associated entity embedding or the zero-padded one for the NA argument. With this design, each sub-task can own a distinct memory tensor, corresponding to the unique path history. e r Figure 4: The overall workflow of Doc2EDAG, where we follow the example in Figure 2 and the EDAG structure in Figure 3, and use stripes to differentiate different entities (note that the number of input tokens and entity positions are imaginary, which do not match previous ones strictly, and here we only include the first three event roles and associated entities for brevity).
Path Expanding. For each path-expanding subtask, we formalize it as a collection of multiple binary classification problems, that is predicting expanding (1) or not (0) for all entities. To enable the awareness of the current path state, history contexts and the current event role, we first concatenate the memory tensor m and the entity tensor e d , then add them with a trainable eventrole-indicator embedding, and encode them with the third Transformer module, Transformer-3, to facilitate the context-aware reasoning. Finally, we extract the enriched entity tensor e r from outputs of Transformer-3 and stack a linear classifier over e r to conduct the path-expanding classification.
Optimization. For the event-triggering classification, we calculate the cross-entropy loss L tr . During the EDAG generation, we calculate a cross-entropy loss for each path-expanding subtask, and sum these losses as the final EDAGgeneration loss L dag . Finally, we sum L tr , L dag and the entity-recognition loss L er together as the final loss, L all = λ 1 L er + λ 2 L tr + λ 3 L dag , where λ 1 , λ 2 and λ 3 are hyper-parameters.
Inference. Given a document, Doc2EDAG first recognizes entity mentions from sentences, then encodes them with document-level contexts, and finally generates an EDAG for each triggered event type by conducting a series of pathexpanding sub-tasks.
Practical Tips. During training, we can utilize both ground-truth entity tokens and the given EDAG structure. While at inference, we need to first identify entities and then expand paths sequentially based on embeddings of those entities to recover the EDAG. This gap between training and inference can cause severe error-propagation problems. To mitigate such problems, we utilize the scheduled sampling (Bengio et al., 2015) to gradually switch the inputs of document-level entity encoding from ground-truth entity mentions to model recognized ones. Moreover, for pathexpanding classifications, false positives are more harmful than false negatives, because the former can cause a completely wrong path. Accordingly, we can set γ(> 1) as the negative class weight of the associated cross-entropy loss.

Experiments
In this section, we present thorough empirical studies to answer the following questions: 1) to what extent can Doc2EDAG improve over stateof-the-art methods when facing DEE-specific challenges? 2) how do different models behave when facing both arguments-scattering and multievent challenges? 3) how important are various components of Doc2EDAG?

Experimental Setup
Data Collection with Event Labeling. We utilize ten years (2008-2018) ChFinAnn 4 documents and human-summarized event knowledge bases to conduct the DS-based event labeling. We focus on five event types: Equity Freeze (EF), Equity Repurchase (ER), Equity Underweight (EU), Equity Overweight (EO) and Equity Pledge (EP), which belong to major events required to be disclosed by the regulator and may have a huge impact on the company value. To ensure the labeling quality, we set constraints for matched document-record pairs  as Section 4 describes. Moreover, we directly use the character tokenization to avoid error propagations from Chinese word segmentation tools. Finally, we obtain 32, 040 documents in total, and this number is ten times larger than 2, 976 of DCFEE and about 53 times larger than 599 of ACE 2005. We divide these documents into train, development, and test set with the proportion of 8 : 1 : 1 based on the time order. In Table 1, we show the number of documents and the multi-event ratio (MER) for each event type on this dataset. Note that a few documents may contain multiple event types at the same time.
Data Quality. To verify the quality of DS-based event labeling, we randomly select 100 documents and manually annotate them. By regarding DS-generated event tables as the prediction and human-annotated ones as the ground-truth, we evaluate the labeling quality based on the metric introduced below. Table 2 shows this approximate evaluation, and we can observe that DS-generated data are pretty good, achieving high precision and acceptable recall. In later experiments, we directly employ the automatically generated test set for evaluation due to its much broad coverage.
Evaluation Metric. The ultimate goal of DEE is to fill event tables with correct arguments for each role. Therefore, we evaluate DEE by directly comparing the predicted event table with the groundtruth one for each event type. Specifically, for each document and each event type, we pick one predicted record and one most similar ground-truth record (at least one of them is non-empty) from associated event tables without replacement to calculate event-role-specific true positive, false positive and false negative statistics until no record left. After aggregating these statistics among all evaluated documents, we can calculate role-level precision, recall, and F1 scores (all reported in percentage format). As an event type often includes multiple roles, we calculate micro-averaged rolelevel scores as the final event-level metric that reflects the ability of end-to-end DEE directly.
Hyper-parameter Setting. For the input, we set the maximum number of sentences and the maximum sentence length as 64 and 128, respectively. During training, we set λ 1 = 0.05, λ 2 = λ 3 = 0.95 and γ = 3. We employ the Adam (Kingma and Ba, 2015) optimizer with the learning rate 1e −4 , train for at most 100 epochs and pick the best epoch by the validation score on the development set. Besides, we leverage the decreasing order of the non-empty argument ratio as the event role order required by Doc2EDAG, because more informative entities in the path history can better facilitate later path-expanding classifications.
Note that, due to the space limit, we leave other detailed hyper-parameters, model structures, data preprocessing configurations, event type specifications and pseudo codes for EDAG generation to the appendix.

Performance Comparisons
Baselines. As discussed in the related work, the state-of-the-art method applicable to our setting is DCFEE. We follow the implementation described in , but they did not illustrate how to handle multi-event sentences with just a sequence tagging model. Thus, we develop two versions, DCFEE-O and DCFEE-M, where DCFEE-O only produces one event record from one keyevent sentence, while DCFEE-M tries to get multiple possible argument combinations by the closest relative distance from the key-event sentence. To be fair, the SEE stages of both versions share the same neural architecture as the entity recognition part of Doc2EDAG. Besides, we further employ a simple decoding baseline of Doc2EDAG, Greedy-Dec, that only fills one event table entry greedily by using recognized entity roles to verify the necessity of end-to-end modeling.
Main Results. As Table 3 shows, Doc2EDAG achieves significant improvements over all base-    Single-Event vs. Multi-Event. We divide the test set into a single-event set, containing documents with just one event record, and a multi-event set, containing others, to show the extreme difficulty when arguments-scattering meets multievent. Table 4 shows F1 scores for different scenarios. Although Doc2EDAG still maintains the highest extraction performance for all cases, the multi-event set is extremely challenging as the extraction performance of all models drops significantly. Especially, GreedyDec, with no mechanism for the multi-event challenge, decreases most drastically. DCFEE-O decreases less, but is still far away from Doc2EDAG. On the multi-event set, Doc2EDAG increases by 17.7 F1 scores over DCFEE-O, the best baseline, on average.
Ablation Tests. To demonstrate key designs of Doc2EDAG, we conduct ablation tests by evaluating four variants: 1) -PathMem, removing the memory mechanism used during the EDAG generation, 2) -SchSamp, dropping the scheduled sampling strategy during training, 3) -DocEnc, removing the Transformer module used for documentlevel entity encoding, and 4) -NegCW, keeping the negative class weight as 1 when doing pathexpanding classifications. From Table 5, we can observe that 1) the memory mechanism is of prime importance, as removing it can result in the most drastic performance declines, over 10 F1 scores on four event types except for the ER type whose MER is very low on the test set; 2) the scheduled sampling strategy that alleviates the mismatch of entity candidates for event table filling between training and inference also contributes greatly, improving by 5 F1 scores on average; 3) the document-level entity encoding that enhances global entity representations contributes 2.1 F1 scores on average; 4) the larger negative class weight to penalize false positive path expanding can also make slight but stable contributions for all event types.
Case Studies. Let us follow the example in Figure 2, Doc2EDAG can successfully recover the correct EDAG, while DCFEE inevitably makes many mistakes even with a perfect SEE model, as discussed in the introduction. Due to the space limit, we leave another three fine-grained case studies to the appendix.

Conclusion and Future Work
Towards the end-to-end modeling for DEE, we propose a novel model, Doc2EDAG, associated with a novel task formalization without trigger words to ease DS-based labeling. To validate the effectiveness of the proposed approach, we build a large-scale real-world dataset in the financial domain and conduct extensive empirical studies.
Notably, without any domain-specific assumption, our general labeling and modeling strategies can benefit practitioners in other domains directly. As this work shows promising results for the end-to-end DEE, expanding the inputs of Doc2EDAG from pure text sequences to richly formatted ones (Wu et al., 2018) is appealing, and we leave it as future work to explore.