Core Semantic First: A Top-down Approach for AMR Parsing

We introduce a novel scheme for parsing a piece of text into its Abstract Meaning Representation (AMR): Graph Spanning based Parsing (GSP). One novel characteristic of GSP is that it constructs a parse graph incrementally in a top-down fashion. Starting from the root, at each step, a new node and its connections to existing nodes will be jointly predicted. The output graph spans the nodes by the distance to the root, following the intuition of first grasping the main ideas then digging into more details. The core semantic first principle emphasizes capturing the main ideas of a sentence, which is of great interest. We evaluate our model on the latest AMR sembank and achieve the state-of-the-art performance in the sense that no heuristic graph re-categorization is adopted. More importantly, the experiments show that our parser is especially good at obtaining the core semantics.


Introduction
Abstract Meaning Representation (AMR) (Banarescu et al., 2013) is a semantic formalism that encodes the meaning of a sentence as a rooted labeled directed graph. As illustrated by an example in Figure 1, AMR abstracts away from the surface forms in text, where the root serves as a rudimentary representation of the overall focus while the details are elaborated as the depth of the graph increases. AMR has been proved useful for many downstream NLP tasks, including text summarization (Liu et al., 2015;Hardy and Vlachos, 2018) and question answering (Mitra and Baral, 2016).
The task of AMR parsing is to map natural language strings to AMR semantic graphs automatically. Compared to constituent parsing (Zhang * The work described in this paper is substantially supported by a grant from the Research Grant Council of the Hong Kong Special Administrative Region, China (Project Code: 14204418). The first author is grateful for the discussions with Zhisong Zhang and Zhijiang Guo. and Clark, 2009) and dependency parsing (Kübler et al., 2009), AMR parsing is considered more challenging due to the following characteristics: (1) The nodes in AMR have no explicit alignment to text tokens; (2) The graph structure is more complicated because of frequent reentrancies and non-projective arcs; (3) There is a large and sparse vocabulary of possible node types (concepts). Many methods for AMR parsing have been developed in the past years, which can be categorized into three main classes: Graph-based parsing (Flanigan et al., 2014;Lyu and Titov, 2018) uses a pipeline design for concept identification and relation prediction. Transition-based parsing (Wang et al., 2016;Damonte et al., 2017;Ballesteros and Al-Onaizan, 2017;Guo and Lu, 2018;Liu et al., 2018;) processes a sentence from left-to-right and constructs the graph incrementally. The third class is seq2seq-based parsing (Barzdins and Gosko, 2016;Konstas et al., 2017;van Noord and Bos, 2017), which views parsing as sequence-to-sequence transduction by a linearization (depth-first traversal) of the AMR graph.
While existing graph-based models cannot sufficiently model the interactions between individual decisions, the autoregressive nature of transitionbased and seq2seq-based models makes them suffer from error propagation, where later decisions can easily go awry, especially given the complexity of AMR. Since capturing the core semantics of a sentence is arguably more important and useful in practice, it is desirable for a parser to have a global view and a priority for capturing the main ideas first. In fact, AMR graphs are organized in a hierarchy that the core semantics stay closely to the root, for which a top-down parsing scheme can fulfill the desiderata. For example, in Figure 1, the subgraph in the red box already conveys the core meaning "an earthquake suddenly struck at a particular time", and the subgraph in the blue box further informs that "the earthquake was big" and "the time was of prosperity and happiness".
We propose a novel framework for AMR parsing known as Graph Spanning based Parsing (GSP). One novel characteristic of GSP is that, to our knowledge, it is the first top-down AMR parser. 1 GSP performs parsing in an incremental, root-to-leaf fashion, but still maintains a global view of the sentence and the previously derived graph. At each step, it generates the connecting arcs between the existing nodes and the coming new node, upon which the type of the new node (concept) is jointly decided. The output graph spans the nodes by the distance to the root, following the intuition of first grasping the main ideas then digging into more details. Compared to previous graph-based methods, our model is capable of capturing more complicated intra-graph interactions, while reducing the number of parsing steps to be linear in the sentence length. 2 Compared to transition-based methods, our model removes the left-to-right restriction and avoids sophisticated oracle design for handling the complexity of AMR graphs.
Notably, most existing methods including the state-the-of-art parsers often rely on heavy graph re-categorization for reducing the complexity of the original AMR graphs. For graph recategorization, specific subgraphs of AMR are grouped together and assigned to a single node with a new compound category (Werling et al.,1 Depth-first traversal in seq2seq models does not produce a strictly top-down order due to the reentrancies in AMR. 2 Since the size of AMR graph is approximately linear in the length of sentence. 2015; Foland and Martin, 2017;Lyu and Titov, 2018;Groschwitz et al., 2018;Guo and Lu, 2018). The hand-crafted rules for re-categorization are often non-trivial, requiring exhaustive screening and expert-level manual efforts. For instance, in the re-categorization system of Lyu and Titov (2018), the graph fragment "temporal-quantity will be replaced by one single nested node "rate-entity-3(annual-01)". There are hundreds of such manual heuristic rules. This kind of re-categorization has been shown to have considerable effects on the performance Guo and Lu, 2018). However, one issue is that the precise set of re-categorization rules differs among different models, making it difficult to distinguish the performance improvement from model optimization or carefully designed rules. In fact, some work will become totally infeasible when removing this re-categorization step. For example, the parser of Lyu and Titov (2018) requires tight integration with this step as it is built on the assumption that an injective alignment exists between sentence tokens and graph nodes.
We evaluate our parser on the latest AMR sembank and achieve competitive results to the stateof-the-art models. The result is remarkable since our parser directly operates on the original AMR graphs and requires no manual efforts for graph recategorization. The contributions of our work are summarized as follows: • We propose a new method for learning AMR parsing that produces high-quality core semantics.
• Without the help of heuristic graph recategorization which requires expensive expert-level manual efforts for designing re-categorization rules, our method achieves state-of-the-art performance.

Related Work
Currently, most AMR parsers can be categorized into three classes: (1) Graph-based methods (Flanigan et al., 2014(Flanigan et al., , 2016Werling et al., 2015;Foland and Martin, 2017;Lyu and Titov, 2018;Zhang et al., 2019) adopt a pipeline approach for graph construction. It first maps continuous text spans into AMR concepts, then calculates the scores of possible edges and uses a max-imum spanning connected subgraph algorithm to select the final graph. The major deficiency is that the concept identification and relation prediction are strictly performed in order, yet the interactions between them should benefit both sides (Zhou et al., 2016). In addition, for computational efficacy, usually only first-order information is considered for edge scoring.
(3) Seq2seq-based methods (Barzdins and Gosko, 2016;Peng et al., 2017;Konstas et al., 2017;van Noord and Bos, 2017) treat AMR parsing as sequence-to-sequence problem by linearizing AMR graphs, thus existing seq2seq models (Bahdanau et al., 2014;Luong et al., 2015) can be readily utilized. Despite its simplicity, the performance of the current seq2seq models lag behind when the training data is limited. The first reason is that seq2seq models are often not as effective on smaller datasets. The second reason is that the linearized AMRs add the challenges of making use of the graph structure information.
There are also some notable exceptions. Peng et al. (2015) introduce a synchronous hyperedge replacement grammar solution. Pust et al. (2015) regard the task as a machine translation problem, while Artzi et al. (2015) adapt combinatory categorical grammar. Groschwitz et al. (2018); Lindemann et al. (2019) view AMR graphs as the structure AM algebra.
Most AMR parsers require an explicit alignment between tokens in the sentences and nodes in the AMR graph during training. Since such information is not annotated, a pre-trained aligner (Flanigan et al., 2014;Pourdamghani et al., 2014;Liu et al., 2018) is often required. More recently, Lyu and Titov (2018) demonstrate that the alignments can be treated as latent variables in a joint probabilistic model.

Background of Multi-head Attention
The multi-head attention mechanism introduced by Vaswani et al. (2017) is used as a basic building block in our framework. The multi-head attention consists of H attention heads, and each of which learns a distinct attention function. Given a query vector x and a set of vectors {y 1 , y 2 , . . . , y m } or in short y 1:m , for each attention head, we project x and y 1:m into distinct query, key, and value representations q ∈ R d , K ∈ R m×d and V ∈ R m×d respectively, where d is the dimension of the vector space. Then we perform scaled dot-product attention (Vaswani et al., 2017): where a ∈ R m is the attention vector (a distribution over all input y 1:m ) and attn is the weighted sum of the value vectors. Finally, the outputs of all attention heads are concatenated and projected to the original dimension of x. For brevity, we will denote the whole attention procedure described above as a function T (x, y 1:m ).
Based on the multi-head attention, the Transformer encoder (Vaswani et al., 2017) uses self-attention for context information aggregation when given a set of vectors (e.g., word embeddings in a sentence or node embeddings in a graph).

Overview
Figure 2 depicts the major neural components in our proposed framework: The Sentence Encoder component and the Graph Encoder component are designed for token-level sentence representation and node-level graph representation respectively. Given an input sentence w = (w 1 , w 2 , . . . , w n ), where n is the sentence length, the Sentence Encoder component will first read the whole sentence and encode each word w i into the hidden state s i . The initial graph G 0 is always initialized with one dummy node d * and a previously generated concept c j is encoded into the hidden state v j by the Graph Encoder component.
At each time step t, the Focus Selection component reads both the sentence representation s 1:n and the graph representation v 0:t−1 of G t−1 repeatedly, generates the initial parser state h t . The

Concept Prediction Relation Classification Relation Identification
? c t d* Figure 5: A single layer of the Recursive Neural Tensor Network. Each dashed box represents one of d-many slices and can capture a type of influence a child can have on its parent.
The RNTN uses this definition for computing p1: where W is as defined in the previous models. The next parent vector p2 in the tri-gram will be computed with the same weights: The main advantage over the previous RNN model, which is a special case of the RNTN when V is set to 0, is that the tensor can directly relate input vectors. Intuitively, we can interpret each slice of the tensor as capturing a specific type of composition.
An alternative to RNTNs would be to make the compositional function more powerful by adding a second neural network layer. However, initial experiments showed that it is hard to optimize this model and vector interactions are still more implicit than in the RNTN.

Tensor Backprop through Structure
We describe in this section how to train the RNTN model. As mentioned above, each node has a softmax classifier trained on its vector representation to predict a given ground truth or target vector t. We assume the target distribution vector at each node has a 0-1 encoding. If there are C classes, then it has length C and a 1 at the correct label. All other entries are 0.
We want to maximize the probability of the correct prediction, or minimize the cross-entropy error between the predicted distribution y i 2 R C⇥1 at node i and the target distribution t i 2 R C⇥1 at that node. This is equivalent (up to a constant) to minimizing the KL-divergence between the two distributions. The error as a function of the RNTN parameters ✓ = (V, W, Ws, L) for a sentence is: The derivative for the weights of the softmax classifier are standard and simply sum up from each node's error. We define x i to be the vector at node i (in the example trigram, the x i 2 R d⇥1 's are (a, b, c, p1, p2)). We skip the standard derivative for Ws. Each node backpropagates its error through to the recursively used weights V, W . Let i,s 2 R d⇥1 be the softmax error vector at node i: where ⌦ is the Hadamard product between the two vectors and f 0 is the element-wise derivative of f which in the standard case of using f = tanh can be computed using only f (x i ).
The remaining derivatives can only be computed in a top-down fashion from the top node through the tree and into the leaf nodes. The full derivative for V and W is the sum of the derivatives at each of the nodes. We define the complete incoming error messages for a node i as i,com . The top node, in our case p2, only received errors from the top node's softmax. Hence, p2,com = p2,s which we can use to obtain the standard backprop derivative for W (Goller and Küchler, 1996;Socher et al., 2010). For the derivative of each slice k = 1, . . . , d, we get: where p2,com k is just the k'th element of this vector. Now, we can compute the error message for the two Figure 5: A single layer of the Recursive Neural Tensor Network. Each dashed box represents one of d-many slices and can capture a type of influence a child can have on its parent.
The RNTN uses this definition for computing p1: where W is as defined in the previous models. The next parent vector p2 in the tri-gram will be computed with the same weights: ! .
The main advantage over the previous RNN model, which is a special case of the RNTN when V is set to 0, is that the tensor can directly relate input vectors. Intuitively, we can interpret each slice of the tensor as capturing a specific type of composition.
An alternative to RNTNs would be to make the compositional function more powerful by adding a second neural network layer. However, initial experiments showed that it is hard to optimize this model and vector interactions are still more implicit than in the RNTN.

Tensor Backprop through Structure
We describe in this section how to train the RNTN model. As mentioned above, each node has a softmax classifier trained on its vector representation to predict a given ground truth or target vector t. We assume the target distribution vector at each node has a 0-1 encoding. If there are C classes, then it has length C and a 1 at the correct label. All other entries are 0.
We want to maximize the probability of the correct prediction, or minimize the cross-entropy error between the predicted distribution y i 2 R C⇥1 at node i and the target distribution t i 2 R C⇥1 at that node. This is equivalent (up to a constant) to minimizing the KL-divergence between the two distributions. The error as a function of the RNTN parameters ✓ = (V, W, Ws, L) for a sentence is: The derivative for the weights of the softmax classifier are standard and simply sum up from each node's error. We define x i to be the vector at node i (in the example trigram, the x i 2 R d⇥1 's are (a, b, c, p1, p2)). We skip the standard derivative for Ws. Each node backpropagates its error through to the recursively used weights V, W . Let i,s 2 R d⇥1 be the softmax error vector at node i: where ⌦ is the Hadamard product between the two vectors and f 0 is the element-wise derivative of f which in the standard case of using f = tanh can be computed using only f (x i ).
The remaining derivatives can only be computed in a top-down fashion from the top node through the tree and into the leaf nodes. The full derivative for V and W is the sum of the derivatives at each of the nodes. We define the complete incoming error messages for a node i as i,com . The top node, in our case p2, only received errors from the top node's softmax. Hence, p2,com = p2,s which we can use to obtain the standard backprop derivative for W (Goller and Küchler, 1996;Socher et al., 2010). For the derivative of each slice k = 1, . . . , d, we get: where p2,com k is just the k'th element of this vector. Now, we can compute the error message for the two f Figure 5: A single layer of the Recursive Neural Tensor Network. Each dashed box represents one of d-many slices and can capture a type of influence a child can have on its parent.
The RNTN uses this definition for computing p1: where W is as defined in the previous models. The next parent vector p2 in the tri-gram will be computed with the same weights: ! .
The main advantage over the previous RNN model, which is a special case of the RNTN when V is set to 0, is that the tensor can directly relate input vectors. Intuitively, we can interpret each slice of the tensor as capturing a specific type of composition.
An alternative to RNTNs would be to make the compositional function more powerful by adding a second neural network layer. However, initial experiments showed that it is hard to optimize this model and vector interactions are still more implicit than in the RNTN.

Tensor Backprop through Structure
We describe in this section how to train the RNTN model. As mentioned above, each node has a softmax classifier trained on its vector representation to predict a given ground truth or target vector t. We assume the target distribution vector at each node has a 0-1 encoding. If there are C classes, then it has length C and a 1 at the correct label. All other entries are 0.
We want to maximize the probability of the correct prediction, or minimize the cross-entropy error between the predicted distribution y i 2 R C⇥1 at node i and the target distribution t i 2 R C⇥1 at that node. This is equivalent (up to a constant) to minimizing the KL-divergence between the two distributions. The error as a function of the RNTN parameters ✓ = (V, W, Ws, L) for a sentence is: The derivative for the weights of the softmax classifier are standard and simply sum up from each node's error. We define x i to be the vector at node i (in the example trigram, the x i 2 R d⇥1 's are (a, b, c, p1, p2)). We skip the standard derivative for Ws. Each node backpropagates its error through to the recursively used weights V, W . Let i,s 2 R d⇥1 be the softmax error vector at node i: where ⌦ is the Hadamard product between the two vectors and f 0 is the element-wise derivative of f which in the standard case of using f = tanh can be computed using only f (x i ).
The remaining derivatives can only be computed in a top-down fashion from the top node through the tree and into the leaf nodes. The full derivative for V and W is the sum of the derivatives at each of the nodes. We define the complete incoming error messages for a node i as i,com . The top node, in our case p2, only received errors from the top node's softmax. Hence, p2,com = p2,s which we can use to obtain the standard backprop derivative for W (Goller and Küchler, 1996;Socher et al., 2010). For the derivative of each slice k = 1, . . . , d, we get: where p2,com k is just the k'th element of this vector. Now, we can compute the error message for the two f + The RNTN uses this definition for computing p1: where W is as defined in the previous models. The next parent vector p2 in the tri-gram will be computed with the same weights: ! .
The main advantage over the previous RNN model, which is a special case of the RNTN when V is set to 0, is that the tensor can directly relate input vectors. Intuitively, we can interpret each slice of the tensor as capturing a specific type of composition.
An alternative to RNTNs would be to make the compositional function more powerful by adding a second neural network layer. However, initial experiments showed that it is hard to optimize this model and vector interactions are still more implicit than in the RNTN.

Tensor Backprop through Structure
We describe in this section how to train the RNTN model. As mentioned above, each node has a softmax classifier trained on its vector representation to predict a given ground truth or target vector t. We assume the target distribution vector at each node has a 0-1 encoding. If there are C classes, then it has length C and a 1 at the correct label. All other entries are 0.
We want to maximize the probability of the correct prediction, or minimize the cross-entropy error between the predicted distribution y i 2 R C⇥1 at node i and the target distribution t i 2 R C⇥1 at that node. This is equivalent (up to a constant) to minimizing the KL-divergence between the two distributions. The error as a function of the RNTN parameters ✓ = (V, W, Ws, L) for a sentence is: The derivative for the weights of the softmax classifier are standard and simply sum up from each node's error. We define x i to be the vector at node i (in the example trigram, the x i 2 R d⇥1 's are (a, b, c, p1, p2)). We skip the standard derivative for Ws. Each node backpropagates its error through to the recursively used weights V, W . Let i,s 2 R d⇥1 be the softmax error vector at node i: where ⌦ is the Hadamard product between the two vectors and f 0 is the element-wise derivative of f which in the standard case of using f = tanh can be computed using only f (x i ).
The remaining derivatives can only be computed in a top-down fashion from the top node through the tree and into the leaf nodes. The full derivative for V and W is the sum of the derivatives at each of the nodes. We define the complete incoming error messages for a node i as i,com . The top node, in our case p2, only received errors from the top node's softmax. Hence, p2,com = p2,s which we can use to obtain the standard backprop derivative for W (Goller and Küchler, 1996;Socher et al., 2010). For the derivative of each slice k = 1, . . . , d, we get: parser state carries the most useful information and serves as a writable memory during the expansion step. Next, the Relation Identification component decides which specific head nodes to expand by computing the multiple attention scores {a g i t } k i=1 over the existing nodes. New arcs are generated according to the attention scores. Then the Concept Prediction component updates the parser state h t with arc information, computes the attention vector a s t over the sentence and accordingly chooses a specific part to generate the new concept c t . Finally, the Relation Classification component is used to predict the relation labels between the newly generated concept and its predecessors. Consequently an updated graph G t is produced and G t will be processed for the next time step. The whole decoding procedure is terminated if the newly generated concept is the special stop concept .
Our method expands the graph in a root-toleaf fashion, nodes with shorter distances to the root will be introduced first. It follows a similar way that humans grasp the meaning: first seeking the main concepts then proceeding to the substructures governed by certain head concepts (Banarescu et al., 2013).
During training, we use breadth-first search to decide the order of nodes. However, for nodes with multiple children, there still exist multiple valid selections. In order to define a deterministic decoding process, we sort sibling nodes by their relations to the head node. We will present more discussions on the choice of sibling order in § 5.3.

Sentence & Graph Representation
Transformer encoder architecture is employed for both the Sentence Encoder and the Graph encoder components. For sentence encoding, a special token (£) is prepended to the input word sequence, whose final hidden state s 0 is regarded as an aggregated summary of the whole sentence and used as the initial state in parsing steps.
The Graph Encoder component takes previously generated concept sequence (c 0 , c 1 , . . . , c t−1 ) (c 0 is the dummy node d * ) as input. For computation efficiency and reducing error propagation, instead of encoding the edge information explicitly, we use the Transformer encoder to capture the interactions between nodes. Finally, the encoder outputs a sequence of node representations (v 0 , v 1 , . . . , v t−1 ).

Focus Selection
At each time step t, the Focus Selection component will read the sentence and the partially constructed graph repeatedly for gradually locating and collecting the most relevant information for the next expansion. We simulate the repeated reading by multiple levels of attention. Formally, the following recurrence is applied by L times: where T (·, ·) is the multi-head attention function. LN is the layer normalization (Ba et al., 2016) and h 0 t is always initialized with s 0 . For clarity, we denote the last hidden state h L t as h t , as the parser state at the time step t. We now proceed to present the details of each decision stage of one parsing step, which is also illustrated in Figure 2.

Relation Identification
Our Relation Identification component is inspired by a recent attempt of exposing auxiliary supervision on attention mechanism (Strubell et al., 2018). It can be considered as another attention layer over the existing graph, yet the attention weights explicitly indicate the likelihood of the new node being attached to a specific node. In other words, its aim is to answer the question of where to expand. Since a node can be attached to multiple nodes by playing different semantic roles, we utilize multi-head attention and take the maximum over different heads as the final arc probabilities.
Formally, through a multi-head attention mechanism taking h t and v 0:t−1 as input, we obtain a set of attention weights {a g i t } k i=1 , where k is the number of attention heads and a g i t is the i-th probability vector. The probability of the arc between the new node and the node v j is then computed by a g t,j = max i (a g i t,j ). Intuitively, each head is in charge of a set of possible relations (though not explicitly specified). If certain relations do not exist between the new node and any existing node, the probability mass will be assigned to the dummy node d * . The maximum pooling reflects that the arc should be built once one relation is activated. 3 The attention mechanism passes the arc decisions to later layers by the update of the parser state as follows: 3 We also found that there may exist more than one relation between two distinct nodes, however, it rarely happens.

Our Concept Prediction component uses a soft alignment between words and the new concept.
Concretely, a single-head attention a s t is computed based on the parser state h t and the sentence representation s 1:n , where a s t,i denotes the attention weight of the word w i in the current time step. This component then updates the parser state with the alignment information via the following equation: The probability of generating a specific concept c from the concept vocabulary V is calculated as gen , where x c (for c ∈ V) denotes the model parameters. To address the data sparsity issue in concept prediction, we introduce a copy mechanism in similar spirit to . Besides generation, our model can either directly copy an input token w i (e.g, for entity names) or map w i to one concept m(w i ) according to the alignment statistics 4 in the training data (e.g., for "went", it would propose go). Formally, the prediction probability of a concept c is given by: where [[. . .]] is the indicator function. P (copy|h t ), P (map|h t ) and P (gen|h t ) are the probabilities of three prediction modes respectively, computed by a single layer neural network with softmax activation.

Relation Classification
Lastly, the Relation Classification component employs a multi-class classifier for labeling the arcs detected in the Relation Identification component. The classifier uses a biaffine function to score each label, given the head concept representation v i and the child vector h t as input: where W, U, V, b are model parameters. As suggested by Dozat and Manning (2016), we project v i and h t to a lower dimension for reducing the computation cost and avoiding the overfitting of the model. The label probabilities are computed by a softmax function over all label scores.

Reentrancies
AMR reentrancy is employed when a node participates in multiple semantic relations (with multiple parent nodes), and that is why AMRs are graphs, rather than trees. The reentrancies are often hard to treat. While previous work often either remove them (Guo and Lu, 2018) or relies on rule-based restoration in the postprocessing stage (Lyu and Titov, 2018;van Noord and Bos, 2017), our model provides a new and principled way to deal with reentrancies. In our approach, when a new node is generated, all its connections to already existing nodes are determined by the multi-head attention.
For example, for a node with k parent nodes, k different heads will point to the those parent nodes respectively. For a better understanding of our model, a pseudocode is presented in Algorithm 1.

Training and Inference
Our model is trained to maximize the log likelihood of the gold AMR graphs given sentences, i.e. log P (G|w), which can be factorized as: where m is the total number of vertices. The set of predecessor nodes of c t is denoted as pred(t). arc it denotes the arc between c i and c t , and rel arc it indicates the arc label (relation type). As mentioned, GSP is an autoregressive model, such as seq2seq models and transition models, but it factors the distribution according to a top-down graph structure rather than a depth-first traversal or a left-to-right chain. Meanwhile, GSP has a clear separation of node, arc and relation label probabilities, interacting in a more interpretable and tighten manner.

Algorithm 1 Graph Spanning based Parsing
Input: the input sentence w = (w 1 , w 2 , . . . , w n ) Output: the AMR graph G corresponds to w.
Q Learning Sentence Representation 1: w = (w 0 = £) + (w 1 , w 2 , . . . , w n ) 2: s 0 , s 1 , s 2 , . . . , s n = Transformer(w) Q Initialization 3: initialize the graph G 0 (c 0 = d * ) 4: initialize time step t = 1 Q Entering Main Spanning Loop 5: while True do update G t−1 to G t 17: end while 18: return G t−1 At the operational or testing time, the prediction for the input w is obtained viaĜ = arg max G P (G |w). Rather than iterating over all possible graphs, we adopt a beam search to approximate the best graph. Specifically, for each partially constructed graph, we only consider the top-K concepts obtaining the best single-step probability (a product of the corresponding concept, arc, and relation label probability), where K is the beam size. Only the best K graphs at each time step are kept for the next expansion.

Setup
We focus on the most recent LDC2017T10 dataset, as it is the largest AMR corpus. It consists of 36521, 1368, and 1371 sentences in the training, development, and testing sets respectively.
We use Stanford CoreNLP (Manning et al., 2014) for text preprocessing, including tokenization, lemmatization, part-of-speech, and namedentity tagging. The input for sentence encoder consists of the randomly initialized lemma, partof-speech tag, and named-entity tag embeddings, as well as the output from a learnable CNN with character embeddings as inputs. The graph encoder uses randomly initialized concept embeddings and another char-level CNN. Model hyperparameters are chosen by experiments on the development set. The details of the hyper-parameter settings are provided in the Appendix. During testing, we use a beam size of 8 for generating graphs. 5 Conventionally, the quality of AMR parsing results is evaluated using the Smatch tool , which seeks for the maximum number of overlaps between two AMR annotations after decomposing AMR graphs into triples. However, the ordinary Smatch metric treats all triples equally regardless of their roles in the composition of the whole sentence meaning. We refine the ordinary Smatch metric to take into consideration the notion of core semantics. Specifically, we compute: • Smatch-weighted: This metric weights different triples by their importance of composing the core ideas. The root distance d of a triple is defined as the minimum root distance of its involving nodes, the weight of the triple is then computed as: In other words, the weight has a linear decay in root distance until d thr . If two triples are matched, the minimum importance score of them is obtained. In our experiments, d thr is set to 5.
• Smatch-core: This metric only compares the subgraphs representing the main meaning. Precisely, we cut down AMR graphs by setting a maximum root distance d max and only keep the nodes and edges within the threshold. d max is set to 4 in our experiments, of which the remaining subgraphs still have a broad coverage of the original meaning, as illustrated by the distribution of root distance in Figure 3.
Besides, we also evaluate the quality by computing the following metrics. 5 Our code can be found at https://github.com/ jcyk/AMR-parser. • complete-match (CM): This metric counts the number of parsing results that are completely correct.
• root-accuracy (RA): This metric measures the accuracy of the root concept identification.

Main Results and Case Study
The main result is presented in Table 1. We compare our method with the best-performing models in each category as discussed in § 2. Concretely, van Noord and Bos (2017) is a character-level seq2seq model that achieves very competitive result. However, their model is very data demanding as it requires to train on additional 100K sentence-AMR pairs generated by other parsers. Guo and Lu (2018) is a transitionbased parser with refined search space for AMR. Certain concepts and relations (e.g., reentrancies) are removed to reduce the burdens during training. Lyu and Titov (2018) is a graph-based method that achieves the best-reported result evaluated by the ordinary Smatch metric. Their parser uses different LSTMs for concept prediction, relation identification, and root identification sequentially. Also, the relation identification stage has the time complexity of O(m 2 log m) where m is the number of concepts. Groschwitz et al. (2018) views AMR as terms of the AM algebra (Groschwitz et al., 2017), which allows standard tree-based parsing techniques to be applicable. The complexity of their projective decoder is O(m 5 ). Last but not least, all these models except for that of van Noord and Bos (2017) require hand-crafted heuristics for graph re-categorization.
We consider the Smatch-weighted metric as the most suitable metric for measuring the parser's quality on capturing core semantics. The comparison shows that our method significantly outperforms all other methods. The Smatch-core metric also demonstrates the advantage of our

Model
Graph Smatch(%) RA(%) CM(%) Re-ca. weighted core ordinary Buys and Blunsom (2017) No --61.9 -van Noord and Bos (2017) Figure 4: Case study. method in capturing the core ideas. Besides, our model achieves the highest root-accuracy (RA) and complete-match (CM), which further confirms the usefulness of a global view and the core semantic first principle.
Even evaluated by the ordinary Smatch metric, our model yields better results than all previously reported models with the exception of Lyu and Titov (2018), which relies on a tremendous amount of manual heuristics for designing rules for graph re-categorization and adopts a pipeline approach. Note that our parser constructs the AMR graph in an end-to-end fashion with a better (quadratic) time complexity.
We present a case study in Figure 4 with comparison to the output of Lyu and Titov (2018)'s parser. As seen, both parsers make some mistakes. Specifically, our method fails to identify the concept generated-01. While Lyu and Titov (2018)'s parser successfully identifies it, their parser mistakenly treats it as the root of the whole AMR. It leads to a serious drawback of making the sentence meaning be interpreted in a wrong way. In contrast, our method shows a strong capacity in capturing the main idea "the solution is about some patterns and a balance". However, on the ordinary Smatch met- ric, their graph obtains a higher score (68% vs. 66%), which indicates that the ordinary Smatch is not a proper metric for evaluating the quality of capturing core semantics. If we adopt the Smatch-weighted metric, our method achieves a better score i.e. 74% vs. 61%.

More Results
To reveal our parser's ability for grasping meanings at different levels of granularity, we plot the Smatch-core scores in Figure 5 by varying the maximum root distance d max , compared with several strong baselines and the state-of-the-art model. It demonstrates that our method is better at abstracting the core ideas of a sentence. As discussed in § 3, there could be multiple valid generation orders for sibling nodes in an AMR graph. We experiment with the following traversal variants: (1) random, which sorts the sibling nodes in completely random order. (2) relation freq., which sorts the sibling nodes according to their relations to the head node. We assign higher priorities to relations that occur more frequently, which drives our parser always to seek for the most common relation first. (3) combined, which combines the above two strategies by using random and relation freq. with equal chance. As seen in Table 2, the deterministic order strategy for training (relation freq.) achieves better performance than random order. Interestingly, the combined strategy significantly boosts the performance. 6 The reason is that the random order potentially produces a larger set of training pairs since each random order strategy can be considered as a different training pair. On the other hand, the deterministic order stabilizes the maximum likelihood estimate training. Therefore, the combined strategy benefits from both worlds.

Conclusion and Future Work
We presented the first top-down AMR parser. Our proposed parser builds a AMR graph incrementally in a root-to-leaf manner. Experiments show that our method has a better capability of capturing the core semantics in a sentence compared with previous state-of-the-art methods. In addition, we overcome the need of heuristics for graph recategorization employed in most previous work, which makes our method much more transferable to other semantic representations or languages.
Our methods follows the intuition that humans tend to grasp the core meaning of a sentence first. However, some cognitive theories (Langacker, 2008) also suggest that human language understanding is often presented as a circular, abductive process (hermeneutic circle). It is interesting to explore the use of some revision mechanisms when the initial steps go wrong.