An Incremental Parser for Abstract Meaning Representation

Abstract Meaning Representation (AMR) is a semantic representation for natural language that embeds annotations related to traditional tasks such as named entity recognition, semantic role labeling, word sense disambiguation and co-reference resolution. We describe a transition-based parser for AMR that parses sentences left-to-right, in linear time. We further propose a test-suite that assesses specific subtasks that are helpful in comparing AMR parsers, and show that our parser is competitive with the state of the art on the LDC2015E86 dataset and that it outperforms state-of-the-art parsers for recovering named entities and handling polarity.


Introduction
Semantic parsing aims to solve the problem of canonicalizing language and representing its meaning: given an input sentence, it aims to extract a semantic representation of that sentence. Abstract meaning representation (Banarescu et al., 2013), or AMR for short, allows us to do that with the inclusion of most of the shallow-semantic natural language processing (NLP) tasks that are usually addressed separately, such as named entity recognition, semantic role labeling and coreference resolution. AMR is partially motivated by the need to provide the NLP community with a single dataset that includes basic disambiguation information, instead of having to rely on different datasets for each disambiguation problem. The annotation process is straightforward, enabling the development of large datasets. Alternative semantic representations have been developed and stud-ied, such as CCG (Steedman, 1996;Steedman, 2000) and UCCA (Abend and Rappoport, 2013).
Several parsers for AMR have been recently developed (Flanigan et al., 2014;Wang et al., 2015a;Peng et al., 2015;Pust et al., 2015;Goodman et al., 2016;Rao et al., 2015;Vanderwende et al., 2015;Artzi et al., 2015;Zhou et al., 2016). This line of research is new and current results suggest a large room for improvement. Greedy transitionbased methods (Nivre, 2008) are one of the most popular choices for dependency parsing, because of their good balance between efficiency and accuracy. These methods seem promising also for AMR, due to the similarity between dependency trees and AMR structures, i.e., both representations use graphs with nodes that have lexical content and edges that represent linguistic relations.
A transition system is an abstract machine characterized by a set of configurations and transitions between them. The basic components of a configuration are a stack of partially processed words and a buffer of unseen input words. Starting from an initial configuration, the system applies transitions until a terminal configuration is reached. The sentence is scanned left to right, with linear time complexity for dependency parsing. This is made possible by the use of a greedy classifier that chooses the transition to be applied at each step.
In this paper we introduce a parser for AMR that is inspired by the ARCEAGER dependency transition system of Nivre (2004). The main difference between our system and ARCEAGER is that we need to account for the mapping from word tokens to AMR nodes, non-projectivity of AMR structures and reentrant nodes (multiple incoming edges). Our AMR parser brings closer dependency parsing and AMR parsing by showing that dependency parsing algorithms, with some mod-ifications, can be used for AMR. Key properties such as working left-to-right, incrementality 1 and linear complexity further strengthen its relevance.
The AMR parser of Wang et al. (2015a), called CAMR, also defines a transition system. It differs from ours because we process the sentence left-toright while they first acquire the entire dependency tree and then process it bottom-up. More recently Zhou et al. (2016) presented a non-greedy transition system for AMR parsing, based on ARC-STANDARD (Nivre, 2004). Our transition system is also related to an adaptation of ARCEAGER for directed acyclic graphs (DAGs), introduced by Sagae and Tsujii (2008). This is also the basis for Ribeyre et al. (2015), a transition system used to parse dependency graphs. Similarly, Du et al. (2014) also address dependency graph parsing by means of transition systems. Analogously to dependency trees, dependency graphs have the property that their nodes consist of the word tokens, which is not true for AMR. As such, these transition systems are more closely related to traditional transition systems for dependency parsing.
Our contributions in this paper are as follows: • In §3 we develop a left-to-right, linear-time transition system for AMR parsing, inspired by the ARCEAGER transition system for dependency tree parsing; • In §5 we claim that the Smatch score (Cai and Knight, 2013) is not sufficient to evaluate AMR parsers and propose a set of metrics to alleviate this problem and better compare alternative parsers; • In §6 we show that our algorithm is competitive with publicly available state-of-the-art parsers on several metrics.

Background and Notation
AMR Structures AMRs are rooted and directed graphs with node and edge labels. An annotation example for the sentence I beg you to excuse me is shown in Figure 1, with the AMR graph reported in Figure 2. Concepts are represented as labeled nodes in the graph and can be either English words (e.g. I and you) or Propbank framesets (e.g. beg-01 and excuse-01). Each node in the graph is assigned to a variable in the AMR annotation so that a variable re-used in the annotation corresponds to reentrancies (multiple incoming edges) in the graph. Relations are represented as labeled and directed edges in the graph.
Notation For most sentences in our dataset, the AMR graph is a directed acyclic graph (DAG), with a few specific cases where cycles are permitted. These cases are rare, and for the purpose of this paper, we consider AMR as DAGs. We denote by [n] the set {1, . . . , n}. We define an AMR structure as a tuple (G, x, π), where x = x 1 · · · x n is a sentence, with each x i , i ∈ [n], a word token, and G is a directed graph G = (V, E) with V and E the set of nodes and edges, respectively. 2 We assume G comes along with a node labeling function and an edge labeling function. Finally, π : V → [n] is a total alignment function that maps every node of the graph to an index i for the sentence x, with the meaning that node v represents (part of) the concept expressed by the word x π(v) . 3 We note that the function π is not invertible, 2 We collapse all multi-word named entities in a single token (e.g., United Kingdom becomes United Kingdom) both in training and parsing.
3 π is a function because we do not consider co-references, • I beg you excuse Figure 3: AMR's edges for the sentence "I beg you to excuse me." mapped back to the sentence, according to the alignment. • is a special token representing the root.
since it is neither injective nor surjective. For each i ∈ [n], we let be the pre-image of i under π (this set can be empty for some i), which means that we map a token in the sentence to a set of nodes in the AMR.
In this way we can align each index i for x to the induced subgraph of G. More formally, we define with the node and edge labeling functions of ← − π (i) inherited from G. Hence, ← − π (i) returns the AMR subgraph aligned with a particular token in the sentence.

Transition-Based AMR Parsing
Similarly to dependency parsing, AMR parsing is partially based on the identification of predicateargument structures. Much of the dependency parsing literature focuses on transition-based dependency parsing-an approach to parsing that scans the sentence from left to right in linear time and updates an intermediate structure that eventually ends up being a dependency tree.
Because of the similarity of AMR structures to dependency structures, transition systems are also helpful for AMR parsing. Starting from the ARCEAGER system, we develop here a novel transition system, called AMREAGER that parses sentences into AMR structures. There are three key differences between AMRs and dependency trees that require further adjustments for dependency parsers to be used with AMRs.

Non-Projectivity
A key difference between English dependency trees and AMR structures is projectivity. Dependency trees in English are usually projective, roughly meaning that there are no which would otherwise cause a node to map to multiple indices. This is in line with current work on AMR parsing. Non-projective edges 6% Non-projective AMRs 51% Reentrant edges 41% AMRs with at least one reentrancy 93% Table 1: Statistics for non-projectivity and reentrancies in 200 AMR manually aligned with the associated sentences. 5 crossing arcs if the edges are drawn in the semiplane above the words. While this restriction is empirically motivated in syntactic theories for English, it is no longer motivated for AMR structures.
The notion of projectivity can be generalized to AMR graphs as follows. The intuition is that we can use the alignment π to map AMR edges back to the sentence x, and test whether there exist pairs of crossing edges. Figure 3 shows this mapping for the AMR of Figure 2, where the edge connecting excuse to I crosses another edge. More formally, consider an AMR edge e = (u, , v). Let π(u) = i and π(v) = j, so that u is aligned with x i and v is aligned with x j . The spanning set for e, written S(e), is the set of all nodes w such that π(w) = k and i < k < j if i < j or j < k < i if j < i. We say that e is projective if, for every node w ∈ S(e), all of its parent and child nodes are in S(e) ∪ {u, v}; otherwise, we say that e is non-projective. An AMR is projective if all of its edges are projective, and is non-projective otherwise. This corresponds to the intuitive definition of projectivity for DAGs introduced in Sagae and Tsujii (2008) and is closely related to the definition of non-crossing graphs of Kuhlmann and Jonsson (2015). Table 1 demonstrates that a relatively small percentage of all AMR edges are non-projective. Yet, a large fraction of the sentences contain at least one non-projective edge. Our parser is able to construct non-projective edges, as described in §3.
Reentrancy AMRs are graphs rather than trees because they can have nodes with multiple parents, called reentrant nodes, as in the node you for the AMR of Figure 2. There are two phenomena that cause reentrancies in AMR: control, where a reentrant edge appears between siblings of a control verb, and co-reference, where multiple men-tions correspond to the same concept. 6 In contrast, dependency trees do not have nodes with multiple parents. Therefore, when creating a new arc, transition systems for dependency parsing check that the dependent does not already have a head node, preventing the node from having additional parents. To handle reentrancy, which is not uncommon in AMR structures as shown in Table 1, we drop this constraint.
Alignment Another main difference with dependency parsing is that in AMR there is no straightforward mapping between a word in the sentence and a node in the graph: words may generate no nodes, one node or multiple nodes. In addition, the labels at the nodes are often not easily determined by the word in the sentence. For instance expectation translates to expect-01 and teacher translates to the two nodes teach-01 and person, connected through an :ARG0 edge, expressing that a teacher is a person who teaches. A mechanism of concept identification is therefore required to map each token x i to a subgraph with the correct labels at its nodes and edges: if π is the gold alignment, this should be the subgraph ← − π (i) defined in Equation (1). To obtain alignments between the tokens in the sentence and the nodes in the AMR graph of our training data, we run the JAMR aligner. 7

Transition system for AMR Parsing
A stack σ = σ n | · · · |σ 1 |σ 0 is a list of nodes of the partially constructed AMR graph, with the top element σ 0 at the right. We use the symbol '|' as the concatenation operator. A buffer β = β 0 |β 1 | · · · |β n is a list of indices from x, with the first element β 0 at the left, representing the word tokens from the input still to be processed. A configuration of our parser is a triple (σ, β, A), where A is the set of AMR edges that have been constructed up to this point.
In order to introduce the transition actions of our parser we need some additional notation. We use a function a that maps indices from x to AMR graph fragments. For each i ∈ [n], a(i) is a graph G a = (V a , E a ), with single root root(G a ), representing the semantic contribution of word x i to the AMR for x. As already mentioned, G a can have a single node representing the concept associated with x i , or it can have several nodes in case x i denotes a complex concept, or it can be empty.
The transition Shift is used to decide if and what to push on the stack after consuming a token from the buffer. Intuitively, the graph fragment a(β 0 ) obtained from the token β 0 , if not empty, is "merged" with the graph we have constructed so far. We then push onto the stack the node root(a(β 0 )) for further processing. LArc( ) creates an edge with label between the top-most node and the second top-most node in the stack, and pops the latter. RArc( ) is the symmetric operation, but does not pop any node from the stack.
Finally, Reduce pops the top-most node from the stack, and it also recovers reentrant edges between its sibling nodes, capturing for instance several control verb patterns. To accomplish this, Reduce decides whether to create an additional edge between the node being removed and the previously created sibling in the partial graph. With this operation the transition system is able to capture non-projective patterns, 8 according to the definition given in §2.1, when formed by arcs between nodes that share the same parent. This way of handling control verbs is similar to the REEN-TRANCE transition of Wang et al. (2015a).
The choice of popping the dependent in the LArc transition is inspired by ARCEAGER, where left-arcs are constructed bottom-up to increase the incrementality of the transition system (Nivre, 2004). This affects our ability to recover some reentrant edges: consider a node u with two parents v and v , where the arc v → u is a left-arc and v → u is any arc. If the first arc to be processed is v → u, we use LArc that pops u, hence making it impossible to create the second arc v → u. Nevertheless, we discovered that this approach works better than a completely unrestricted allowance of reentrancy. The reason is that if we do not remove dependents at all when first attached to a node, the stack becomes larger, and nodes which should be connected end up being distant from each other, and as such, are never connected.
The initial configuration of the system has a • node (representing the root) in the stack and the entire sentence in the buffer. The terminal configuration consists of an empty buffer and a stack with only the • node. The transitions required to parse the sentence The boy and the girl are shown in Table 2, where the first line shows the initial configuration and the last line shows the terminal configuration.
Similarly to the transitions of the ARCEAGER, the above transitions construct edges as soon as the head and the dependent are available in the stack, with the aim of maximizing the parser incrementality. We now show that our greedy transitionbased AMR parser is linear-time in n, the length of the input sentence x. We first claim that the output graph has size O(n). Each token in x is mapped to a constant number of nodes in the graph by Shift. Thus the number of nodes is O(n). Furthermore, each node can have at most three parent nodes, created by transitions RArc, LArc and Reduce, respectively. Thus the number of edges is also O(n). It is possible to bound the maximum number of transitions required to parse x: the number of Shift is bounded by n, and the number of Reduce, LArc and RArc is bounded by the size of the graph, which is O(n). Since each transition can be carried out in constant time, we conclude that our parser runs in linear time.

Training the System
Several components have to be learned: (1) a transition classifier that predicts the next transition given the current configuration, (2) a binary classifier that decides whether or not to create a reentrancy after a Reduce, (3) a concept identification step for each Shift to compute a(β 0 ), and 3) another classifier to label edges after each LArc or RArc.

Oracle
Training our system from data requires an oracle-an algorithm that given a gold-standard AMR graph and a sentence returns transition sequences that maximize the overlap between the gold-standard graph and the graph dictated by the sequence of transitions.
We adopt a shortest stack, static oracle similar to Chen and Manning (2014). Informally, static means that if the actual configuration of the parser has no mistakes, the oracle provides a transition that does not introduce any mistake. Shortest stack means that the oracle prefers transitions where the number of items in the stack is minimized. Given the current configuration (σ, β, A) and the gold-standard graph G = (V g , A g ), the oracle is defined as follows, where we test the conditions in the given order and apply the action associated with the first match: then Reduce; 4. Shift otherwise.
The oracle first checks whether some goldstandard edge can be constructed from the two elements at the top of the stack (conditions 1 and 2). If LArc or RArc are not possible, the oracle checks whether all possible edges in the gold graph involving σ 0 have already been processed, in which case it chooses Reduce (conditions 3). To this end, it suffices to check the buffer, since LArc and RArc have already been excluded and elements in the stack deeper than position two can no longer be accessed by the parser. If Reduce is not possible, Shift is chosen.
Besides deciding on the next transition, the oracle also needs the alignments, which we generate with JAMR, in order to know how to map the next token in the sentence to its AMR subgraph ← − π (i) defined in (1).

Transition Classifier
Like all other transition systems of this kind, our transition system has a "controller" that predicts a transition given the current configuration (among Shift, LArc, RArc and Reduce). The examples from which we learn this controller are based on features extracted from the oracle transition sequences, where the oracle is applied on the training data.
As a classifier, we use a feed-forward neural network with two hidden layers of 200 tanh units and learning rate set to 0.1, with linear decaying. The input to the network consists of the concatenation of embeddings for words, POS tags and Stanford parser dependencies, one-hot vectors for named entities and additional sparse features, extracted from the current configuration of the transition system; this is reported in more details in Table 3. The embeddings for words and POS tags were pre-trained on a large unannotated corpus consisting of the first 1 billion char-  acters from Wikipedia. 9 For lexical information, we also extract the leftmost (in the order of the aligned words) child (c), leftmost parent (p) and leftmost grandchild (cc). Leftmost and rightmost items are common features for transition-based parsers (Zhang and Nivre, 2011;Chen and Manning, 2014) but we found only leftmost to be helpful in our case. All POS tags, dependencies and named entities are generated using Stanford CoreNLP . The accuracy of this classifier on the development set is 84%. Similarly, we train a binary classifier for deciding whether or not to create a reentrant edge after a Reduce: in this case we use word and POS embeddings for the two nodes being connected and their parent as well as dependency label embeddings for the arcs between them.

Concept Identification
This routine is called every time the transition classifier decides to do a Shift; it is denoted by a(·) in §3. This component could be learned in a supervised manner, but we were not able to improve on a simple heuristic, which works as follows: during training, for each Shift decided by the oracle, we store the pair (β 0 , ← − π (i)) in a phrase-table. During parsing, the most frequent graph H for the given token is then chosen. In other words, a(i) approximates ← − π (i) by means of the graph most frequently seen among all occurrences of token x i in the training set.
An obvious problem with the phrase-table approach is that it does not generalize to unseen words. In addition, our heuristic relies on the fact that the mappings observed in the data are correct, which is not the case when the JAMR-generated alignments contain a mistake. In order to alleviate this problem we observe that there are classes of words such as named entities and numeric quantities that can be disambiguated in a deterministic manner. We therefore implement a set of "hooks" that are triggered by the named entity tag of the next token in the sentence. These hooks override the normal Shift mechanism and apply a fixed rule instead. For instance, when we see the token New York (the two tokens are collapsed in a single one at preprocessing) we generate the subgraph of Figure 4 and push its root onto the stack. Similar subgraphs are generated for all states, cities, countries and people. We also use hooks for ordinal numbers, percentages, money and dates.

Edge Labeling
Edge labeling determines the labels for the edges being created. Every time the transition classifier decides to take an LArc or RArc operation, the edge labeler needs to decide on a label for it. There are more than 100 possible labels such as :ARG0, depth d(σ 0 ), d(σ 1 ) children #c(σ 0 ), #c(σ 1 ) parents #p(σ 0 ), #p(σ 1 ) lexical w(σ 0 ), w(σ 1 ), w(β 0 ), w(β 1 ), w(p(σ 0 )), w(c(σ 0 )), w(cc(σ 0 )), w(p(σ 1 )), w(c(σ 1 )), w(cc(σ 1 )) POS s(σ 0 ), s(σ 1 ), s(β 0 ), s(β 1 ) entities e(σ 0 ), e(σ 1 ), e(β 0 ), e(β 1 ) dependency Table 3: Features used in transition classifier. The function d maps a stack element to the depth of the associated graph fragment. The functions #c and #p count the number of children and parents, respectively, of a stack element. The function w maps a stack/buffer element to the word embedding for the associated word in the sentence. The function p gives the leftmost (according to the alignment) parent of a stack element, the function c the leftmost child and the function cc the leftmost grandchild. The function s maps a stack/buffer element to the part-of-speech embedding for the associated word. The function e maps a stack/buffer element to its entity. Finally, the function maps a pair of symbols to the dependency label embedding, according to the edge (or lack of) in the dependency tree for the two words these symbols are mapped to.
We use a feed-forward neural network similar to the one we trained for the transition classier, with features shown in Table 4. The accuracy of this classifier on the development set is 77%. We constrain the labels predicted by the neural network in order to satisfy requirements of AMR. For instance, the label :top can only be applied when the node from which the edge starts is the special • node. Other constraints are used for the :polarity label and for edges attaching to numeric quantities.

Fine-grained Evaluation
Until now, AMR parsers were evaluated using the Smatch score. 10 Given the candidate graphs and 10 Since Smatch is an approximate randomized algorithm, decimal points in the results vary between different runs and are not reported. This approach was also taken by Wang et al. name feature template depth d(σ 0 ), d(σ 1 ) children #c(σ 0 ), #c(σ 1 ) parents #p(σ 0 ), #p(σ 1 ) lexical w(σ 0 ), w(σ 1 ), w(p(σ 0 )), w(c(σ 0 )), w(cc(σ 0 )), w(p(σ 1 )), w(c(σ 1 )), w(cc(σ 1 )) POS s(σ 0 ), s(σ 1 ) entities e(σ 0 ), e(σ 1 ) dependency (σ 0 , β 0 ), (β 0 , σ 0 ) the gold graphs in the form of AMR annotations, Smatch first tries to find the best alignments between the variable names for each pair of graphs and it then computes precision, recall and F1 of the concepts and relations. We note that the Smatch score has two flaws: (1) while AMR parsing involves a large number of subtasks, the Smatch score consists of a single number that does not assess the quality of each subtasks separately; (2) the Smatch score weighs different types of errors in a way which is not necessarily useful for solving a specific NLP problem. For example, for a specific problem concept detection might be deemed more important than edge detection, or guessing the wrong sense for a concept might be considered less severe than guessing the wrong verb altogether.
Consider the two parses for the sentence Silvio Berlusconi gave Lucio Stanca his current role of modernizing Italy's bureaucracy in Figure 5. At the top, we show the output of a parser (Parse 1) that is not able to deal with named entities. At the bottom, we show the output of a parser (Parse 2) which, except for :name, :op and :wiki, always uses the edge label :ARG0. The Smatch scores for the two parses are 56 and 78 respectively. Both parses make obvious mistakes but the three named entity errors in Parse 1 are considered more important than the six wrong labels in Parse 2. However, without further analysis, it is not advisable to conclude that Parse 2 is better than Parse 1. In order to better understand the limitations of the different parsers, find their strengths and gain insight in which downstream tasks they may be helpful, we compute a set of metrics on the test set.
Unlabeled is the Smatch score computed on (2015b) and others.
No WSD gives a score that does not take into account word sense disambiguation errors. By ignoring the sense specified by the Propbank frame used (e.g., duck-01 vs duck-02) we have a score that does not take into account this additional complexity in the parsing procedure. To compute this score, we simply strip off the suffixes from all Propbank frames and calculate the Smatch score.
Following Sawai et al. (2015), we also evaluate the parsers using the Smatch score on noun phrases only (NP-only), by extracting from the AMR dataset all noun phrases that do not include further NPs.
As we previously discussed, reentrancy is a very important characteristic of AMR graphs and it is not trivial to handle. We therefore implement a test for it (Reentrancy), where we compute the Smatch score only on reentrant edges.
Concept identification is another critical component of the parsing process and we therefore compute the F-score on the list of predicted concepts (Concepts) too. Identifying the correct concepts is fundamental: if a concept is not identified, it will not be possible to retrieve any edge  Table 5: Evaluation of the two parses in Figure 5 with the proposed evaluation suite.
involving that concept, with likely significant consequences on accuracy. This metric is therefore quite important to score highly on.
Similarly to our score for concepts, we further compute an F-score on the named entities (Named Ent.) and wiki roles for named entities (Wikification) that consider edges labeled with :name and :wiki respectively. These two metrics are strictly related to the concept score. However, since named entity recognition is the focus of dedicated research, we believe it is important to define a metric that specifically assesses this problem. Negation detection is another task which has received some attention. An F-score for this (Negations) is also defined, where we find all negated concepts by looking for the :polarity role. The reason we can compute a simple F-score instead of using Smatch for these metrics is that there are no variable names involved.
Finally we compute the Smatch score on :ARG edges only, in order to have a score for semantic role labeling (SRL), which is another extremely important subtask of AMR, as it is based on the identification of predicate-argument structures.
Using this evaluation suite we can evaluate AMRs on a wide range of metrics that can help us find strengths and weakness of each parser, hence speeding up the research in this area. Table 5 reports the scores for the two parses of Figure 5, where we see that Parse 1 gets a good score for semantic role labeling while Parse 2 is optimal for named entity recognition. Moreover, we can make additional observations such as that Parse 2 is optimal with respect to unlabeled score and that Parse 1 recovers more reentrancies. labeled case suggests that our parser has difficulty in labeling the arcs. Our score for concept identification, which is on par with the best result from the other parsers, demonstrates that there is a relatively low level of token ambiguity. State-of-theart results for this problem can be obtained by choosing the most frequent subgraph for a given token based on a phrase-table constructed from JAMR alignments on the training data. The scores for named entities and wikification are heavily dependent on the hooks mentioned in §4.3, which in turn relies on the named entity recognizer to make the correct predictions. In order to alleviate the problem of wrong automatic alignments with respect to polarity and better detect negation, we performed a post-processing step on the aligner output where we align the AMR constant -(minus) with words bearing negative polarity such as not, illegitimate and asymmetry.
Our experiments demonstrate that there is no parser for AMR yet that conclusively does better than all other parsers on all metrics. Advantages of our parser are the worst-case linear complexity and the fact that is possible to perform incremental AMR parsing, which is both helpful for realtime applications and to investigate how meaning of English sentences can be built incrementally left-to-right.

Conclusion
We presented a transition system that builds AMR graphs in linear time by processing the sentences left-to-right. The system is trained with feedforward neural networks. The parser demonstrates that it is possible to perform AMR parsing using techniques inspired by techniques from dependency parsing.
We also noted that it is less informative to evaluate the entire parsing process with Smatch than to use a collection of metrics aimed at evaluating the various subproblems in the parsing process. We further showed that our left-to-right transition system is competitive with publicly available state-of-the-art parsers. Although we do not outperform the best baseline in terms of Smatch score, we show on par or better results for several of the metrics proposed. We hope that moving away from a single-metric evaluation will further speed up progress in AMR parsing.