A Joint Neural Model for Information Extraction with Global Features

Most existing joint neural models for Information Extraction (IE) use local task-specific classifiers to predict labels for individual instances (e.g., trigger, relation) regardless of their interactions. For example, a victim of a die event is likely to be a victim of an attack event in the same sentence. In order to capture such cross-subtask and cross-instance inter-dependencies, we propose a joint neural framework, OneIE, that aims to extract the globally optimal IE result as a graph from an input sentence. OneIE performs end-to-end IE in four stages: (1) Encoding a given sentence as contextualized word representations; (2) Identifying entity mentions and event triggers as nodes; (3) Computing label scores for all nodes and their pairwise links using local classifiers; (4) Searching for the globally optimal graph with a beam decoder. At the decoding stage, we incorporate global features to capture the cross-subtask and cross-instance interactions. Experiments show that adding global features improves the performance of our model and achieves new state of-the-art on all subtasks. In addition, as OneIE does not use any language-specific feature, we prove it can be easily applied to new languages or trained in a multilingual manner.


Introduction
Information Extraction (IE) aims to extract structured information from unstructured texts. It is a complex task comprised of a wide range of subtasks, such as named, nominal, and pronominal mention extraction, entity linking, entity coreference resolution, relation extraction, event extraction, and event coreference resolution. Early efforts typically perform IE in a pipelined fashion, 1 http://blender.cs.illinois.edu/software/ oneie which leads to the error propagation problem and disallows interactions among components in the pipeline. As a solution, some researchers propose joint inference and joint modeling methods to improve local prediction (Roth and Yih, 2004;Sil and Yates, 2013;Li et al., 2014;Durrett and Klein, 2014;Miwa and Sasaki, 2014;Lu and Roth, 2015;Yang and Mitchell, 2016;Kirschnick et al., 2016). Due to the success of deep learning, neural models have been widely applied to various IE subtasks (Collobert et al., 2011;Chiu and Nichols, 2016;Chen et al., 2015;Lin et al., 2016). Recently, some efforts  revisit global inference approaches by designing neural networks with embedding features to jointly model multiple subtasks. However, these methods use separate local task-specific classifiers in the final layer and do not explicitly model the interdependencies among tasks and instances. Figure 1 shows a real example where the local argument role classifier predicts a redundant PERSON edge. The model should be able to avoid such mistakes if it is capable of learning and leveraging the fact that it is unusual for an ELECT   ral framework, ONEIE, to perform end-to-end IE with global constraints. As Figure 2 shows, instead of predicting separate knowledge elements using local classifiers, ONEIE aims to extract a globally optimal information network for the input sentence. When comparing candidate information networks during the decoding process, we not only consider individual label scores for each knowledge element, but evaluate cross-subtask and cross-instance interactions in the network. In this example, a graph with the INJURE-VICTIM-ORG (the VICTIM of an INJURE event is an ORG entity) structure is demoted. Experiments show that our framework achieves comparable or better results compared to the state-of-the-art end-to-end architecture .
To the best of our knowledge, ONEIE is the first end-to-end neural IE framework that explicitly models cross-subtask and cross-instance interdependencies and predicts the result as a unified graph instead of isolated knowledge elements. Because ONEIE does not rely on language-specific features, it can be rapidly applied to new languages. Furthermore, global features in our framework are highly explainable and can be explicitly analyzed.

Task
Given a sentence, our ONEIE framework aims to extract an information network representation (Li et al., 2014), where entity mentions and event triggers are represented as nodes, and relations and event-argument links are represented as edges. In other words, we perform entity, relation, and event extraction within a unified framework. In this sec-tion, we will elaborate these tasks and involved terminologies.
Entity Extraction aims to identify entity mentions in text and classify them into pre-defined entity types. A mention can be a name, nominal, or pronoun. For example, "Kashmir region" should be recognized as a location (LOC) named entity mention in Figure 2.
Relation Extraction is the task of assigning a relation type to an ordered pair of entity mentions. For example, there is a PART-WHOLE relation between "Kashmir region" and "India".
Event Extraction entails identifying event triggers (the words or phrases that most clearly express event occurrences) and their arguments (the words or phrases for participants in those events) in unstructured texts and classifying these phrases, respectively, for their types and roles. An argument can be an entity, time expression, or value (e.g., MONEY, JOB-TITLE, CRIME). For example, in Figure 2, the word "injured" triggers an INJURE event and "300" is the VICTIM argument.
We formulate the task of extracting information networks as follows. Given an input sentence, our goal is to predict a graph G = (V, E), where V and E are the node and edge sets respectively. Each node v i = a i , b i , l i ∈ V represents an entity mention or event trigger, where a and b are the start and end word indices, and l is the node type label. Each edge e ij = i, j, l ij ∈ E is represented similarly, whereas i and j denote the indices of involved nodes. For example, in Figure 2, the trigger "injured" is represented as 7, 7, INJURE , the entity mention "Kashmir region" is represented as 10, 11, LOC , and the event-argument edge between them is 2, 3, PLACE .

Approach
As Figure 2 illustrates, our ONEIE framework extracts the information network from a given sentence in four steps: encoding, identification, classification, and decoding. We encode the input sentence using a pre-trained BERT encoder (Devlin et al., 2019) and identify entity mentions and event triggers in the sentence. After that, we compute the type label scores for all nodes and pairwise edges among them. During decoding, we explore possible information networks for the input sentence using beam search and return the one with the highest global score.

Encoding
Given an input sentence of L words, we obtain the contextualized representation x i for each word using a pre-trained BERT encoder. If a word is split into multiple word pieces (e.g., Mondrian → Mon, ##dr, ##ian), we use the average of all piece vectors as its word representation. While previous methods typically use the output of the last layer of BERT, our preliminary study shows that enriching word representations using the output of the third last layer of BERT can substantially improve the performance on most subtasks.

Identification
At this stage, we identify entity mentions and event triggers in the sentence, which will act as nodes in the information network. We use a feedforward network FFN to compute a score vector y i = FFN(x i ) for each word, where each value in y i represents the score for a tag in a target tag set 2 . After that, we use a conditional random fields (CRFs) layer to capture the dependencies between predicted tags (e.g., an I-PER tag should not follow a B-GPE tag). Similar to (Chiu and Nichols, 2016), we calculate the score of a tag patĥ z = {ẑ 1 , ...,ẑ L } as where X = {x 1 , ..., x L } is the contextualized representations of the input sequence,ŷ i,ẑ i is theẑ i -th component of the score vectorŷ i , and Aẑ i−1 ,ẑ i is the (ẑ i−1 ,ẑ i ) entry in matrix A that indicates the transition score from tagẑ i−1 toẑ i . The weights in A are learned during training. We append two special tags <start> and <end> to the tag path asẑ 0 andẑ L+1 to denote the start and end of the sequence. At the training stage, we maximize the log-likelihood of the gold-standard tag path as where Z is the set of all possible tag paths for a given sentence. Thus, we define the identification loss as L I = − log p(z|X).
In our implementation, we use separate taggers to extract entity mentions and event triggers. Note that we do not use types predicted by the taggers. Instead, we make a joint decision for all knowledge elements at the decoding stage to prevent error propagation and utilize their interactions to improve the prediction of node type.

Classification
We represent each identified node as v i by averaging its word representations. After that, we use separate task-specific feed-forward networks to calculate label scores for each node asŷ t i = FFN t (v i ), where t indicates a specific task. To obtain the label score vector for the edge between the i-th and j-th nodes, we concatenate their span representations and calculate the vector asŷ t k = FFN t (v i , v j ). For each task, the training objective is to minimize the following cross-entropy loss where y t i is the true label vector and N t is the number of instances for task t.
If we ignore the inter-dependencies between nodes and edges, we can simply predict the label with the highest score for each knowledge element and thus generate the locally best graphĜ. The score ofĜ can be calculated as where T is the set of tasks. We refer to s (Ĝ) as the local score ofĜ.
Categary Description Role 1. The number of entities that act as <rolei> and <rolej> arguments at the same time.
3. The number of occurrences of <event typei>, <rolej>, and <entity typek> combination. 4. The number of events that have multiple <rolei> arguments. 5. The number of entities that act as a <rolei> argument of an <event typej> event and a <rolek> argument of an <event typel> event at the same time. Relation 6. The number of occurrences of <entity typei>, <entity typej>, and <relation typek> combination. 7. The number of occurrences of <entity typei> and <relation typej> combination. 8. The number of occurrences of a <relation typei> relation between a <rolej> argument and a <rolek> argument of the same event. 9. The number of entities that have a <relation typei> relation with multiple entities. 10. The number of entities involving in <relation typei> and <relation typej> relations simultaneously. Trigger 11. Whether a graph contains more than one <event typei> event.

Global Features
A limitation of local classifiers is that they are incapable of capturing inter-dependencies between knowledge elements in an information network. We consider two types of inter-dependencies in our framework.
The first type of inter-dependency is Crosssubtask interactions between entities, relations, and events. Consider the following sentence. "A civilian aid worker from San Francisco was killed in an attack in Afghanistan." A local classifier may predict "San Francisco" as a VICTIM argument because an entity mention preceding "was killed" is usually the victim despite the fact that a GPE is unlikely to be a VICTIM. To impose such constraints, we design a global feature as shown in Figure 3(a) to evaluate whether the structure DIE-VICTIM-GPE exists in a candidate graph.
Another type of inter-dependency is Crossinstance interactions between multiple event and/or relation instances in the sentence. Take the following sentence as an example. "South Carolina boy, 9, dies during hunting trip after his father accidentally shot him on Thanksgiving Day." It can be challenging for a local classifier to predict "boy" as the VICTIM of the ATTACK event triggered by "shot" due to the long distance between these two words. However, as shown in Figure 3(b), if an entity is the VICTIM of a DIE event, it is also likely to be the VICTIM of an ATTACK event in the same sentence.
Motivated by these observations, we design a set of global feature templates (event schemas) as listed in Table 1 to capture cross-subtask and crossinstance interactions, while the model fills in all possible types to generate features and learns the weight of each feature during training. Given a graph G, we represent its global feature vector as where M is the number of global features and f i (·) is a function that evaluates a certain feature and returns a scalar. For example, Next, ONEIE learns a weight vector u ∈ R M and calculates the global feature score of G as the dot product of f G and u. We define the global score of G as the sum of its local score and global feature score, namely We make the assumption that the gold-standard graph for a sentence should achieve the highest global score. Therefore, we minimize the following loss function whereĜ is the graph predicted by local classifiers and G is the gold-standard graph.
Finally, we optimize the following joint objective function during training

Decoding
As we have discussed, because local classifiers ignore interactions among elements in an information network, they may predict contradictory results or fail to predict difficult edges that require information from other elements. In order to address these issues, ONEIE makes a joint decision for all nodes and their pairwise edges to obtain the globally optimal graph. The basic idea is to calculate the global score for each candidate graph and select the one with the highest score. However, exhaustive search is infeasible in many cases as the size of search space grows exponentially with the number of nodes. Therefore, we design a beam search-based decoder as Figure 4 depicts.
Given a set of identified nodes V and the label scores for all nodes and their pairwise links, we perform decoding with an initial beam set B = {K 0 }, where K 0 is an order-zero graph. At each step i, we expand each candidate in B in node step and edge step as follows.
Node step: We select v i ∈ V and define its candidate set as i denotes the label with the k-th highest local score for v i , and β v is a hyper-parameter that controls the number of candidate labels to consider. We update the beam set by Edge step: We iteratively select a previous node v j ∈ V, j < i and add possible edges between v j and v i . Note that if v i is a trigger, we skip v j if it is also a trigger. At each iteration, we construct a candidate edge set as ij is the label with k-th highest score for e ij and β e is a threshold for the number of candidate labels. Next, we update the beam set by At the end of each edge step, if |B| is larger than the beam width θ, we rank all candidates by global score in descending order and keep the top θ ones.
After the last step, we return the graph with the highest global score as the information network for the input sentence.

Data
We perform our experiments on the Automatic Content Extraction (ACE) 2005 dataset 3 , which provides entity, value, time, relation, and event annotations for English, Chinese, and Arabic. Following Wadden et al. (2019)'s pre-processing 4 , we conduct experiments on two datasets, ACE05-R that includes named entity and relation annotations, and ACE05-E that includes entity, relation, and event annotations. We keep 7 entity types, 6 coarsegrained relation types, 33 event types, and 22 argument roles.
In order to reinstate some important elements absent from ACE05-R and ACE05-E, we create a new dataset, ACE05-E + , by adding back the order of relation arguments, pronouns, and multi-token event triggers, which have been largely ignored in previous work. We also skip lines before the <text> tag (e.g., headline, datetime) as they are not annotated.
In addition to ACE, we derive another dataset, ERE-EN, from the Entities, Relations and Events (ERE) annotation task created under the Deep Exploration and Filtering of Test (DEFT) program because it covers more recent articles. Specifically, we extract 458 documents and 16,516 sentences from three ERE datasets, LDC2015E29, LDC2015E68, and LDC2015E78. For ERE-EN, we keep 7 entity types, 5 relation types, 38 event types, and 20 argument roles.
To evaluate the portability of our model, we also develop a Chinese dataset from ACE2005 and a Spanish dataset from ERE (LDC2015E107). We refer to these datasets as ACE05-CN and ERE-ES respectively.

Experimental Setup
We optimize our model with BertAdam for 80 epochs with a learning rate of 5e-5 and weight decay of 1e-5 for BERT, and a learning rate of 1e-3 and weight decay of 1e-3 for other parameters. We use use the bert-base-multilingual-cased  For global features, we set β v and β e to 2 and set θ to 10. In our experiments, we use random seeds and report averaged scores across runs. We use the same criteria as (Zhang et al., 2019; for evaluation as follows.
• Entity: An entity mention is correct if its offsets and type match a reference entity.
• Relation: A relation is correct if its relation type 5 https://huggingface.co/transformers/ pretrained_models.html is correct and the offsets of the related entity mentions are correct.
• Trigger: A trigger is correctly identified (Trig-I) if its offsets match a reference trigger. It is correctly classified (Trig-C) if its event type also matches the reference trigger.
• Argument: An argument is correctly identified (Arg-I) if its offsets and event type match a reference argument mention. It is correctly classified (Arg-C) if its role label also matches the reference argument mention.

Overall Performance
In Table 3, we compare our results with two models: (1) DYGIE++ , the stateof-the-art end-to-end IE model that utilizes multisentence BERT encodings and span graph propagation; (2) BASELINE that follows the architecture of ONEIE but only uses the output of the last layer of BERT and local classifiers. We can see that our model consistently outperforms DYGIE++ and BASELINE on ACE05-R and ACE05-E.
In , the authors show that combining triggers predicted by a four-model ensemble optimized for trigger detection can improve the performance of event extraction. While we also report our results using a four-model ensemble in Table 4 for fair comparison, we hold the opinion that the single-model scores in Table 3 better reflect the actual performance of ONEIE and should be used for future comparison. Table 5 shows the performance of ONEIE on two new datasets, ACE05-E + and ERE-EN.
In Table 6 we list salient global features learned by the model. Take feature #9 as an example, if a    candidate graph contains multiple ORG-AFF edges incident to the same node, the model will demote this graph by adding a negative value into its global score. We also observe that the weights of about 9% global features are almost not updated, which indicates that they are barely found in both goldstandard and predicted graphs. In Table 8, we perform qualitative analysis on concrete examples.

Porting to Another Language
As Table 7, we evaluate the proposed framework on ACE05-CN and ERE-ES. The results show that ONEIE works well on Chinese and Spanish data without any special design for the new language. We also observe that adding English training data can improve the performance on Chinese and Spanish.

Remaining Challenges
We have analyzed 75 of the remaining errors and in Figure 5 we present the distribution of various error types which need more features and knowledge acquisition to address in the future. In this section, we will discuss some main categories with examples. Need background knowledge. Most of current IE methods ignore external knowledge such as entity attributes and scenario models. For exam-   Global feature category: 3 Analysis: As "Campbell" is likely to be an ENTITY argument of a FINE event, the model corrects its entity type from FAC to PER.  can correct this error based on the first sentence in its Wikipedia page "Kommersant is a nationally distributed daily newspaper published in Russia mostly devoted to politics and business". Rare words. The second challenge is the famous long-tail problem: many triggers, entity mentions (e.g., "caretaker", "Gazeta.ru") and contextual phrases in the test data rarely appear in the training data. While most event triggers are verbs or nouns, some adverbs and multi-word expressions can also serve as triggers. Multiple types per trigger. Some trigger words may indicate both the procedure and the result status of an action. For example, "named" may indicate both NOMINATE and START-POSITION events; "killed" and "eliminate" may indicate both ATTACK and DIE events. In these cases the human ground truth usually only annotates the procedure types, whereas our system produces the resultant event types.
Need syntactic structure. Our model may benefit from deeper syntactic analysis. For example, in the following sentence "As well as previously holding senior positions at Barclays Bank, BZW and Kleinwort Benson, McCarthy was formerly a top civil servant at the Department of Trade and Industry", our model misses all of the employers "Barclays Bank", "BZW" and "Kleinwort Benson" for "McCarthy" probably because they appear in a previous sub-sentence.
Uncertain events and metaphors. Our model mistakenly labels some future planned events as specific events because its lacking of tense prediction and metaphor recognition. For example, START-ORG triggered by "formation" does not happen in the following sentence "The statement did not give any reason for the move, but said Lahoud would begin consultations Wednesday aimed at the formation of a new government". Our model also mistakenly identifies "camp" as a facility, and a DIE event triggered by "dying" in the following sentence "Russia hints 'peace camp' alliance with Germany and France is dying by Dmitry Zaks.".
The IE community is lacking of newer data sets with end-to-end annotations. Unfortunately, the annotation quality of the ACE data set is not perfect due to some long-term debates on the annotation guideline; e.g., Should "government" be tagged as a GPE or an ORG? Should "dead" be both an entity and event trigger? Should we consider designator word as part of the entity mention or not?

Related Work
Previous work (Roth and Yih, 2004;Li et al., 2011) encodes inter-dependency among knowledge elements as global constraints in an integer linear programming framework to effectively remove extraction errors. Such integrity verification results can be used to find knowledge elements that violate the constraints and identify possible instances of detector errors or failures. Inspired by these previous efforts, we propose a joint neural framework with global features in which the weights are learned during training. Similar to (Li et al., 2014)'s method, ONEIE also uses global features to capture cross-subtask and cross-instance interdependencies, while our features are languageindependent and do not rely on other NLP tools such as dependency parsers. Our methods also differ in local features, optimization methods, and decoding procedures.
Some recent efforts develop joint neural models to perform extraction of two IE subtasks, such as entity and relation extraction (Zheng et al., 2017;Katiyar and Cardie, 2017;Bekoulis et al., 2018;Fu et al., 2019;Sun et al., 2019) and event and temporal relation extraction (Han et al., 2019).  design a joint model to extract entities, relations and events based on BERT and dynamic span graphs. Our framework extends  by incorporating global features based on cross-subtask and crossinstance constraints. Unlike  that uses a span-based method to extract mentions, we adopt a CRF-based tagger in our framework because it can extract mentions of any length, not restricted by the maximum span width.

Conclusions and Future Work
We propose a joint end-to-end IE framework that incorporates global features to capture the inter-dependency between knowledge elements. Experiments show that our framework achieves better or comparable performance compared to the state of the art and prove the effectiveness of global features. Our framework is also proved to be languageindependent and can be applied to other languages, and it can benefit from multi-lingual training.
In the future, we plan to incorporate more comprehensive event schemas that are automatically induced from multilingual multimedia data and external knowledge to further improve the quality of IE. We also plan to extend our framework to more IE subtasks such as document-level entity coreference resolution and event coreference resolution.