Abstract Meaning Representation Guided Graph Encoding and Decoding for Joint Information Extraction

The tasks of Rich Semantic Parsing, such as Abstract Meaning Representation (AMR), share similar goals with Information Extraction (IE) to convert natural language texts into structured semantic representations. To take advantage of such similarity, we propose a novel AMR-guided framework for joint information extraction to discover entities, relations, and events with the help of a pre-trained AMR parser. Our framework consists of two novel components: 1) an AMR based semantic graph aggregator to let the candidate entity and event trigger nodes collect neighborhood information from AMR graph for passing message among related knowledge elements; 2) an AMR guided graph decoder to extract knowledge elements based on the order decided by the hierarchical structures in AMR. Experiments on multiple datasets have shown that the AMR graph encoder and decoder have provided significant gains and our approach has achieved new state-of-the-art performance on all IE subtasks.


Introduction
Information extraction (IE) aims to extract structured knowledge as an information network  from unstructured natural language texts, while semantic parsing attempts to construct a semantic graph to summarize the meaning of the input text. Since both of them focus on extracting the main information from a sentence, the output information networks and semantic graphs have a lot in common in terms of node and edge semantics. In an example shown in Figure 1, many knowledge elements in the information network can be perfectly matched to certain nodes in the semantic graph with similar semantic meanings. Moreover, these two types of graphs may also be similar with regard to network topology. Specifically, the nodes that are 1 The programs are publicly available for research purpose at https://github.com/zhangzx-uiuc/AMR-IE.  Figure 1: Comparison of the AMR graph generated from pre-trained AMR parser and information network from IE for the same sentence from ACE05: Scott Peterson now faces death penalty because of murdering his wife Laci and their unborn son at their house.
neighbors or connected via a few hops in the semantic graph are also likely to be close to each other in the corresponding information network. In Figure 1 we can see that "Scott Peterson", which acts as a shared argument for two event triggers "murdering" and "faces", is also directly linked to two main predicates murder-01 and face-01 in the semantic graph. From a global perspective, an information network can be approximately considered as a subgraph of semantic parsing, where the IE nodes are roughly a subset of the nodes in the semantic graph while maintaining similar inter-connections.
To further exploit and make use of such similarities for information extraction, we propose an intuitive and effective framework to utilize information from semantic parsing to jointly extract an in-formation network composed of entities, relations, event triggers and their arguments. We adopt Abstract Meaning Representation (AMR) (Banarescu et al., 2013) which contains rich semantic structures with fine-grained node and edge types as our input semantic graphs. Compared with previous IE models, our proposed model mainly consists of the following two novel components.
AMR-Guided Graph Encoding. The AMR graph topology can directly inform the IE model some global inter-dependencies among knowledge elements, even if they are located far away in the original sentence. Such a property makes it easier for the IE model to capture some non-local longdistance connections for relation and event argument role labeling. We design a semantic graph aggregator based on Graph Attention Networks (GAT) (Velickovic et al., 2018) to let the candidate entity and event trigger nodes to aggregate neighborhood information from the semantic graph for passing message among related knowledge elements. The GAT architecture used in our model is specifically designed to allow interactions between node and edge features, making it possible to effectively leverage the rich edge types in AMR.
AMR-Conditioned Graph Decoding. A large number of nodes in these two types of graphs share similar meanings, which makes it possible to obtain a meaningful node alignment between information networks and semantic graphs. Such an alignment provides potential opportunities to design a more organized way in the decoding part of a joint IE model. Instead of using sequential decoding as in previous models like OneIE (Lin et al., 2020), where the types of knowledge elements are determined in a left-to-right order according to their positions in the original sentence, we propose a new hierarchical decoding method. We use AMR parsing as a condition to decide the order of decoding knowledge elements, where the nodes and edges are determined in a tree-like order based on the semantic graph hierarchy.
Experiment results on multiple datasets show that our proposed model significantly outperforms state-of-the-art on all IE subtasks.

Problem Formulation
We focus on extracting entities, relations, event triggers and their arguments jointly from an input sentence to form an information network. Note that the AMR graphs in our model are not required to be ground-truth but are generated by pretrained AMR parsers. Therefore, we do not incorporate additional information and our problem settings are identical to typical joint information extraction approaches such as DyGIE++  and OneIE (Lin et al., 2020). Given an input sentence S = {w 1 , w 2 , · · · , w N }, we formulate our problem of joint information extraction as follows.
Entity Extraction Entity extraction aims to identify word spans as entity mentions and classify them into pre-defined entity types. Given the set of entity types E, the entity extraction task is to output a collection E of entity mentions: where a i , b i ∈ {1, 2, · · · , N } denote the starting and ending indices of the extracted entity mentions, and e i represents the entity type in a type set E. For example, in Figure 1, the entity mention "Scott Peterson" is represented as (0, 1, PER).

Relation Extraction
The task of relation extraction is to assign a relation type to every possible ordered pair in the extracted entity mentions. Given the identified entity mentions E and pre-defined relation types R, the set of relations is extracted as where ε i and ε j are entity mentions from E and i, j ∈ {1, 2, · · · , |E|}. An example relation mention is ("their", "son", PER-SOC) in Figure 1.

Event Extraction
The task of event extraction includes extracting event triggers and their arguments. Event trigger extraction is to identify the words or phrases that most clearly indicate the occurrence of a certain type of event from an event type set T , which can be formulated as: where p i , q i ∈ {1, 2, · · · , N } denotes the starting and ending indices of the extracted event mentions, and t i represents an event type in T . Given the pre-defined set of event arguments A, the task of event argument extraction is to assign each trigger and entity pair an argument role label to indicate if an entity mention acts as some certain role of the event, which is formulated as extracting an argument set A where τ i and ε j are previously extracted event and entity mentions respectively, and l a ij denotes the event argument role label.
Information Network Construction All of these extracted knowledge elements form an information network G = (V, E) (an example is shown in Figure 1). Each node v i ∈ V is an entity mention or event trigger, and each edge e i ∈ E indicates a relation or event argument role. Thus our problem can be formulated as generating an information network G given an input sentence S.

Our Approach
Given an input sentence S, we first use a pretrained transformer-based AMR parser (Fernandez Astudillo et al., 2020) to obtain the AMR graph for S. We then use RoBERTa  to encode each sentence to identify entity mentions and event triggers as candidate nodes. After that, we map each candidate node to AMR nodes and enforce message passing using a GATbased semantic graph aggregator to capture global inter-dependency between candidate nodes. All the candidate nodes and their pairwise edges are then passed through task-specific feed-forward neural networks to calculate score vectors. During decoding, we use the hierarchical structure in each AMR graph as a condition to decide the order in beam search and find the best candidate graph with the highest global score.

AMR Parsing
We employ a transformer based AMR parser (Fernandez Astudillo et al., 2020) pre-trained on AMR 3.0 annotations 2 to generate an AMR graph G a = (V a , E a ) with an alignment between AMR nodes and word spans in an input sentence S. Each node v a i = (m a i , n a i ) ∈ V a represents an AMR concept or predicate, and we use m a i and n a i to denote the starting and ending indices of such a node in the original sentence. For AMR edges, we use e a i,j to denote the specific relation type between nodes v a i and v a j in AMR annotations.

Embeddings for AMR Relation Clusters
To reduce the risk of over-fitting on hundreds of finegrained AMR edge types, we only consider the edge types that are most relevant to IE tasks, and manually define M = 12 clusters of AMR edge types as shown in Table 1. Note that each ARGx 2 https://catalog.ldc.upenn.edu/LDC2020T02 relation is considered as an individual cluster since each ARGx indicates a distinct argument role. For each edge type cluster, we randomly initialize a d E dimensional embedding and obtain an embedding matrix E ∈ R M ×d E , which will be optimized during the training process.

Entity and Event Trigger Identification
We first identify the entity mentions and event triggers as candidate nodes from an input sentence. Similar to (Lin et al., 2020), we adopt feed forward neural networks constrained by conditional random fields (CRFs) to identify the word spans for entity mentions and event triggers.
Contextual Encoder Given an input sentence S = {w 1 , w 2 , · · · , w N } of length N , we first calculate the contextual word representation x i for each word w i using a pre-trained RoBERTa encoder . If one word is split into multiple pieces by the RoBERTa tokenizer, we take the average of the representation vectors for all word pieces as the final word representation.
CRFs based Sequence Tagging After obtaining the contextual word representations, we use a feedforward neural network FFN to compute a score vectorŷ i = FFN(x i ) for each word, where each element inŷ i represents the score for a certain tag in the tag set 3 . The overall score for a tag patĥ z = {ẑ 1 ,ẑ 2 , · · · ,ẑ N } is calculated by whereŷ i,ẑ i is theẑ i -th element of the score vectorŷ i , and Pẑ i−1 ,ẑ i denotes the transition score from tagẑ i−1 toẑ i from an optimizable matrix P . Similar to (Chiu and Nichols, 2016), the training objective for node identification is to maximize the log-likelihood L I of the gold tag-path z.
We use separate CRF-based taggers for entity and event trigger extraction. Note that we do not use the specific node types predicted by the CRF taggers as the final output classification results for entities and triggers, but only keep the identified entity and trigger spans. The final types of entities and triggers are jointly decided with relation and argument extraction in the subsequent decoding step. Specifically, we will obtain the collections of entity during this step, where a i , b i , p i , q i denote the starting and ending indices of the word spans.

Semantic Graph Aggregator
To make the best use of the shared semantic features and topological features from the AMR parsing for the input sentence, we design a semantic graph aggregator, which enables the candidate entity nodes and event nodes to aggregate information from their neighbors based on the AMR topology.
Initial Node Representation Each entity node, trigger node or AMR node is initialized with a vector representation h 0 i by averaging the word embeddings for all the words in their spans. For example, given an entity node (a i , b i ), its representation vector is calculated by where x k is the word representation from the RoBERTa encoder.
Node Alignment We first try to align each identified entity node and trigger node to one of the AMR nodes before conducting message passing. Take an entity node with its span (a i , b i ) as an example.
, we consider b i as the index of the head word of the entity node, and aim to find (m a i * , n a i * ) that covers b i as the matched AMR node for (a i , b i ), that is, such a node satisfies m a i * b i n a i * . If no nodes can be matched to (a i , b i ) in this way, we turn to search for the nearest AMR node:  where (m a i * , n a i * ) is the AMR node with the shortest distance to the entity node (a i , b i ). We also conduct alignment for event trigger nodes in the same way.

Updated Representations
Heterogeneous Graph Construction After obtaining the matched or nearest AMR node for each identified entity mention and event trigger, we construct a heterogeneous graph with initialized node and edge features as follows. Given an AMR graph G a = (V a , E a ), we consider the following three cases to initialize feature vectors for each node v a i : • Node v a i has been matched to an entity mention or event trigger. We take the representation vector of the matched node (instead of v a i ) as the initialized feature vector.
• Node v a i is not matched to any identified nodes but labeled as the nearest node for an entity mention or event trigger, e.g., (a i , b i ). We add a new node in the AMR topology with the representation vector of (a i , b i ), and link this new node from v a i with an edge type Others defined in Table 1.
• Node v a i is neither matched nor acted as the nearest node to any entities (triggers). We use its own node representation as the initialized feature vector.
For each edge e a i,j , we first map it to an AMR relation cluster according to Table 1 and then look up for its representation e i,j from the embedding matrix E. We use h 0 i to represent the initial feature for each node. An illustration for this step is shown in Figure 2.
Attention Based Message Passing Inspired from Graph Attention Networks (GATs) (Velickovic et al., 2018), we design an L-layer attention based message passing mechanism on an AMR graph topology to enable the entity and trigger nodes to aggregate neighbor information. For the node i in layer l, we first calculate the attention score for each neighbor j ∈ N i based on node features h l i , h l j and edge features e l i,j .
where W, W e are trainable parameters, and f l and σ(·) are a single layer feed-forward neural network and LeakyReLU activation function respectively. Then the neighborhood information h * can be calculated by the weighted sum of neighbor features.
The updated node feature is calculated by a combination of the original node feature and its neighborhood information, where γ controls the level of message passing between neighbors, and W * denotes a trainable linear transformation parameter.
We select the entity and trigger nodes from the graph and take their feature vectors h L i from the final layer as the representation vectors that have aggregated information from the AMR graph (as Fig. 2 illustrates). We use h e i and h t i to denote the features of each entity and trigger respectively.

Model Training and Decoding
In this subsection, we introduce how we jointly decode the output information network given the identified entity and trigger nodes with their aggregated features h e i and h t i . We design a hierarchical decoding method that incorporates the AMR hierarchy as a condition to decide a more organized order for decoding knowledge elements.

Maximizing Scores with Global Features
Similar to OneIE (Lin et al., 2020), we use task-specific feed-forward neural networks to map each node or node pair into a score vector. Specifically, we calculate four types of score vectors s e i , s t i , s r i,j and s a i,j for entity, trigger, relation, and argument role extraction tasks respectively, where the dimension of each score vector is identical to the number of classes in each task.
Therefore, the total score c(G) is formulated as We inherit the approach of using global features in OneIE (Lin et al., 2020) to enforce the model to capture more information on global interactions. The global score g(G) for an information network G is defined as the sum of local score c(G) and the contribution of global features f G .
where u is a trainable parameter. The global feature vector f G is composed of binary values indicating whether the output graph possesses some interdependencies among knowledge elements (e.g., an attacker is likely to be a person being arrested). We use the global feature categories identical to (Lin et al., 2020) during training, and the overall training objective is to maximize the identification loglikelihood, the local score s(G) while minimizing the gap on the global score between ground-trutĥ G and predicted information network G.
Hierarchical Ordered Decoding Given the output score vectors for all nodes and their pairwise edges, the most straightforward way is to output an information network G with the highest global score g(G). Due to the utilization of global features, searching through all possible information networks could incur exponential complexity, thus we take a similar approach based on beam search used in (Lin et al., 2020). Compared with OneIE (Lin et al., 2020), we creatively incorporate the AMR hierarchy to decide a more organized decoding order instead of a simple left-toright order based on the word positions in the original sentence. Specifically, given the nodes and their alignments with AMR, we sort up these nodes according to the positions of their aligned AMR nodes in a top-to-down manner, that is, the aligned AMR node which is nearest to the AMR root node needs to be decoded first. We illustrate the decoding order in Fig. 3 using an example. We use U = {v 1 , v 2 , · · · v K } to denote the sorted identified trigger and entity nodes, and similar to (Lin et al., 2020), we add these nodes step by step from v 1 to v K , and in each step, we obtain all possible subgraphs by enumerating the types of the new  Figure 3: An illustration of ordered decoding, where τ 1 and τ 2 are identified triggers while each ε i,j is identified entity. In this example, the order of beam search decoding is: τ 1 , τ 2 , ε 1,1 , ε 2,1 , ε 1,2 , ε 2,2 , ε 2,3 . node and pairwise edges with other existing nodes. We only keep the top θ subgraphs in each step as candidate graphs to avoid exponential complexity before finally select the graph with the highest global score g(G) in step K as the output.

ACE-2005 Automatic Content Extraction (ACE)
2005 dataset 4 provides fine-grained annotations for entity, relation, and event extraction. We use the same preprocessing and data split as in OneIE (Lin et al., 2020) and DyGIE++  to obtain the ACE05-E corpus with 18,927 sentences. Following (Lin et al., 2020), we keep 7 entity types, 6 relation types, 33 event types, and 22 event argument roles.

ERE-EN
We also adopt another dataset ERE-EN from the Deep Exploration and Filtering of Test (DEFT) program, which includes more recent news articles and political reviews. We extract 17,108 sentences from datasets LDC2015E29, LDC2015E68, and LDC2015E78. Following (Lin et al., 2020), we keep 7 entity types, 5 relation types, 38 event types, and 20 argument roles.
GENIA To further prove that our proposed model is generalizable to other specific domains, we also evaluate our model on biomedical event extraction datasets BioNLP Genia 2011 and 2013 (Kim et al., 2011(Kim et al., , 2013. We ignore all of the trigger-trigger links (nested event structures) and merge all repeated event triggers into unified information networks to make them compatible for comparison with previous models. Since the test sets are blind and not available for merging the annotations, we evaluate the model performance on the official development sets instead. Details of dataset statistics are shown in Table 2

Experimental Setup
We adopt the most recent joint IE models Dy-GIE++  and OneIE (Lin et al., 2020) as baselines in our experiments, and use the same evaluation metrics as (Zhang et al., 2019b;Lin et al., 2020) to report the F1-Score for each IE subtask. Entity: An extracted entity mention is correct only if both the predicted word span (a i , b i ) and entity type e i match a reference entity mention.
Event Trigger: An event trigger is correctly identified (Trg-I) if the predicted span (p i , q i ) matches a reference trigger. It is correctly classified (Trg-C) if the predicted event type t i also matches the reference trigger.
Event Argument: A predicted event argument (τ i , ε j , l a i,j ) is correctly identified (Arg-I) if (τ i , ε j ) matches a reference event argument. It is correctly classified (Arg-C) is the type l a i,j also matches the reference argument role.
Relation: A predicted relation is correct only if its arguments ε i and ε j both match a reference relation mention. We train our model with Adam (Kingma and Ba, 2015) on NVIDIA Tesla V100 GPUs for 80 epochs (approximately takes 10 minutes for 1 training epoch) with a learning rate 1e-5 for RoBERTa parameters and 5e-3 for other parameters. We take the level of message passing γ as 0.001, which is a relatively low level of message passing because we found that too much message passing will result in the loss of own features for the nodes. We use a two-layer semantic graph aggregator and the feature dimensions are 2048 for nodes and 256 for edges. For other hyper-parameters, we keep them strictly identical to (Lin et al., 2020) to enforce fair comparison. Specifically, the FFNs consist of two layers with a dropout rate of 0.4, where the num-  bers of hidden units are 150 for entity and relation extraction and 600 for event extraction, and the beam size is set to 10.

Overall Performance
We report the performance of our AMR-IE model and compare it with previous methods in Table 3 and Table 4. In general, our AMR guided method greatly outperforms the baselines on all IE subtasks including entity, event, and relation extraction. The performance improvement is particularly significant on edge classification tasks such as relation extraction and event argument role labeling, because the model can better understand the relations between knowledge elements with the help of external AMR graph structures. To further show the help of each individual part in our model, we introduce two variants of our model for ablation study and show the results in Table 3. In AMR-IE w/o Enc, we remove the semantic graph aggregator and only keep the ordered decoding, while in AMR-IE w/o Dec, we keep the semantic graph aggregator but use a flat left-to-right decoding order. From the results, we can see that only incorporating the graph encoder is already able to substantially improve the performance on all IE subtasks, because the identified nodes can capture some global interactions through message passing on the AMR topology. Moreover, using an AMR-guided decoding order could further boost the performance especially on the task of event argument extraction.

Influence of Message Passing
We also conduct parameter sensitivity analysis to study the influence of γ defined in Eq.
(2), which controls how much information to aggregate from the neighbor nodes in the AMR graph. We change this parameter from 10 −5 to 10 1 and show the performance trends of IE subtasks on ACE-05E dataset in Fig. 4. We can discover that for each subtask, the model performance experiences an in-   crease as the level of message passing goes stronger. However, when γ continually increases higher than 10 −2 , the performance of all of the subtasks will undergo a clear decrease. Such a phenomenon follows our intuition since the identified nodes can collect useful information from their AMR neighbors by message passing. However, if the nodes focus too much on their neighborhood information, they will lose some of their own inherent semantic features which results in a performance decrease. In addition, we can also see that compared with entity and trigger extraction tasks, the performance of relation and argument extraction tasks varies more drastically with γ. This is because edge type prediction requires high-quality embeddings for both of the involved nodes, which makes the edge type prediction tasks more sensitive to message passing.

Sentence AMR Parsing OneIE outputs AMR-IE outputs
If the resolution is not passed, Washington would likely want to use the airspace for strikes against Iraq and for airlifting troops to northern Iraq.

Qualitative Analysis
In order to further understand how our proposed AMR guided encoding and AMR conditioned decoding method help to improve the performance, we select typical examples from the output of our AMR-IE model for illustration in Table 5.

Related Work
Some recent efforts have incorporated dependency parsing trees into neural networks for event extraction  and relation extraction (Miwa and Bansal, 2016;. For semantic role labeling (SRL), (Stanovsky and Dagan, 2016) manages to exploit the similarity between SRL and open domain IE by creating a mapping between two tasks. (Huang et al., 2016 employ AMR as a more concise input format for their IE models, but they decompose each AMR into triples to capture the local contextual information between nodes and edges, while the node information is not disseminated in a global graph topology. (Rao et al., 2017) proposes a subgraph matching based method to extract biomedical events from AMR graphs, while  uses an additional GCN based encoder for obtaining better word representations. Besides, graph neural networks are also widely used for event extraction (Liu et al., 2018;Balali et al., 2020;Zhang et al., 2021) and relation and entity extraction (Zhang et al., 2018;Sun et al., 2020). Graph neural networks also demonstrate effectiveness to encode other types of intrinsic structures of a sentence, such as knowledge graph (Zhang et al., 2019a;, document-level relations (Sahu et al., 2019;Lockard et al., 2020;, and selfconstructed graphs (Kim and Lee, 2012;Zhu et al., 2019;Qian et al., 2019;Sahu et al., 2020). However, all these approaches focus on single IE tasks while can not scale to extracting a joint information network with entities, relations, and events. There are some recent efforts that focus on building joint neural models for performing multiple IE tasks simultaneously, such as joint entity and relation extraction Katiyar and Cardie, 2017;Zheng et al., 2017;Bekoulis et al., 2018;Sun et al., 2019; and joint event and entity extraction (Yang and Mitchell, 2016). DyGIE++ ) designs a joint model to extract entities, events, and relations based on span graph propagation, while OneIE (Lin et al., 2020) further makes exploits global features to facilitate the model to capture more global interactions. Compared with the flat encoder in OneIE, our proposed framework leverages a semantic graph aggregator to incorporate information from fine-grained AMR semantics and enforce global interactions in the encoding phase. In addition, instead of a simple left-to-right sequential decoder, we creatively use the AMR hierarchy to decide the decoding order of knowledge elements. Both the AMR-guided graph encoder and decoder are proven highly effective compared to their flat counterparts.

Conclusions and Future Work
AMR parsing and IE share the same goal of constructing semantic graphs from unstructured text. IE focuses more on a target ontology, and thus its output can be considered as a subset of AMR graph. In this paper, we present two intuitive and effective ways to leverage guidance from AMR parsing to improve IE, during both encoding and decoding phases. In the future, we plan to integrate AMR graph with entity coreference graph so our IE framework can be extended to document level.