AMR Parsing as Sequence-to-Graph Transduction

We propose an attention-based model that treats AMR parsing as sequence-to-graph transduction. Unlike most AMR parsers that rely on pre-trained aligners, external semantic resources, or data augmentation, our proposed parser is aligner-free, and it can be effectively trained with limited amounts of labeled AMR data. Our experimental results outperform all previously reported SMATCH scores, on both AMR 2.0 (76.3% F1 on LDC2017T10) and AMR 1.0 (70.2% F1 on LDC2014T12).


Introduction
Abstract Meaning Representation (AMR, Banarescu et al., 2013) parsing is the task of transducing natural language text into AMR, a graphbased formalism used for capturing sentence-level semantics.Challenges in AMR parsing include: (1) its property of reentrancy -the same concept can participate in multiple relations -which leads to graphs in contrast to trees (Wang et al., 2015); (2) the lack of gold alignments between nodes (concepts) in the graph and words in the text which limits attempts to rely on explicit alignments to generate training data (Flanigan et al., 2014;Wang et al., 2015;Damonte et al., 2017;Foland and Martin, 2017;Peng et al., 2017b;Groschwitz et al., 2018;Guo and Lu, 2018); and (3) relatively limited amounts of labeled data (Konstas et al., 2017).
In this paper, we introduce a different way to handle reentrancy, and propose an attention-based model that treats AMR parsing as sequence-tograph transduction.The proposed model, supported by an extended pointer-generator network, is aligner-free and can be effectively trained with limited amount of labeled AMR data.Experiments on two publicly available AMR benchmarks demonstrate that our parser clearly outperforms the previous best parsers on both benchmarks.It achieves the best reported SMATCH scores: 76.3% F1 on LDC2017T10 and 70.2% F1 on LDC2014T12.We also provide extensive ablative and qualitative studies, quantifying the contributions from each component.Our model implementation is available at https://github.com/sheng-z/stog.
ARG1 of "help-01".While efforts have gone into developing graph-based algorithms for AMR parsing (Chiang et al., 2013;Flanigan et al., 2014), it is more challenging to parse a sentence into an AMR graph rather than a tree as there are efficient off-the-shelf tree-based algorithms, e.g., Chu and Liu (1965); Edmonds (1968).To leverage these tree-based algorithms as well as other structured prediction paradigms (McDonald et al., 2005), we introduce another view of reentrancy.AMR reentrancy is employed when a node participates in multiple semantic relations.We convert an AMR graph into a tree by duplicating nodes that have reentrant relations; that is, whenever a node has a reentrant relation, we make a copy of that node and use the copy to participate in the relation, thereby resulting in a tree.Next, in order to preserve the reentrancy information, we add an extra layer of annotation by assigning an index to each node.Duplicated nodes are assigned the same index as the original node.Figure 1(b) shows a resultant AMR tree: subscripts of nodes are indices; two "victim" nodes have the same index as they refer to the same concept.The original AMR graph can be recovered by merging identically indexed nodes and unioning edges from/to these nodes.Similar ideas were used by Artzi et al. (2015) who introduced Skolem IDs to represent anaphoric references in the transformation from CCG to AMR, and van Noord and Bos (2017a) who kept co-indexed AMR variables, and converted them to numbers.

Task Formalization
If we consider the AMR tree with indexed nodes as the prediction target, then our approach to parsing is formalized as a two-stage process: node prediction and edge prediction. 1 An example of the parsing process is shown in Figure 2. Node Prediction Given a input sentence w = w 1 , ..., w n , each w i a word in the sentence, our approach sequentially decodes a list of nodes u = u 1 , ..., u m and deterministically assigns their indices d = d 1 , ..., d m .
Note that we allow the same node to occur multi-1 The two-stage process is similar to "concept identification" and "relation identification" in Flanigan et al. (2014); Zhou et al. (2016); Lyu and Titov (2018); inter alia.
The victim could help himself.Figure 2: A two-stage process of AMR parsing.We remove senses (i.e., -01, -02, etc.) as they will be assigned in the post-processing step.
ple times in the list; multiple occurrences of a node will be assigned the same index.We choose to predict nodes sequentially rather than simultaneously, because (1) we believe the current node generation is informative to the future node generation; (2) variants of efficient sequence-to-sequence models (Bahdanau et al., 2014;Vinyals et al., 2015) can be employed to model this process.At the training time, we obtain the reference list of nodes and their indices using a pre-order traversal over the reference AMR tree.We also evaluate other traversal strategies, and will discuss their difference in Section 7.2.Edge Prediction Given a input sentence w, a node list u, and indices d, we look for the highest scoring parse tree y in the space Y(u) of valid trees over u with the constraint of d.A parse tree y is a set of directed head-modifier edges y = {(u i , u j ) | 1 ≤ i, j ≤ m}.In order to make the search tractable, we follow the arcfactored graph-based approach (McDonald et al., 2005;Kiperwasser and Goldberg, 2016), decomposing the score of a tree to the sum of the score of its head-modifier edges: Based on the scores of the edges, the highest scoring parse tree (i.e., maximum spanning arborescence) can be efficiently found using the Chu-Liu-Edmonnds algorithm.We further incorporate indices as constraints in the algorithm, which is described in Section 4.4.After obtaining the parse tree, we merge identically indexed nodes to recover the standard AMR graph.p tgt < l a t e x i t s h a 1 _ b a s e 6 4 = " R q f 2 5 5 u a S R 5 j 8 l n v l 9 a Z a 1 j G X z I Figure 3: Extended pointer-generator network for node prediction.For each decoding time step, three probabilities p src , p tgt and p gen are calculated.The source and target attention distributions as well as the vocabulary distribution are weighted by these probabilities respectively, and then summed to obtain the final distribution, from which we make our prediction.Best viewed in color.

Model
Our model has two main modules: (1) an extended pointer-generator network for node prediction; and (2) a deep biaffine classifier for edge prediction.The two modules correspond to the two-stage process for AMR parsing, and they are jointly learned during training.

Extended Pointer-Generator Network
Inspired by the self-copy mechanism in Zhang et al. (2018), we extend the pointer-generator network (See et al., 2017) for node prediction.The pointer-generator network was proposed for text summarization, which can copy words from the source text via pointing, while retaining the ability to produce novel words through the generator.
The major difference of our extension is that it can copy nodes, not only from the source text, but also from the previously generated nodes on the target side.This target-side pointing is well-suited to our task as nodes we will predict can be copies of other nodes.While there are other pointer/copy networks (Gulcehre et al., 2016;Merity et al., 2016;Gu et al., 2016;Miao and Blunsom, 2016;Nallapati et al., 2016), we found the pointer-generator network very effective at reducing data sparsity in AMR parsing, which will be shown in Section 7.2.
As depicted in Figure 3, the extended pointergenerator network consists of four major components: an encoder embedding layer, an encoder, a decoder embedding layer, and a decoder.Encoder Embedding Layer This layer converts words in input sentences into vector representations.Each vector is the concatenation of embeddings of GloVe (Pennington et al., 2014), BERT (Devlin et al., 2018), POS (part-of-speech) tags and anonymization indicators, and features learned by a character-level convolutional neural network (CharCNN, Kim et al., 2016).
Anonymization indicators are binary indicators that tell the encoder whether the word is an anonymized word.In preprocessing, text spans of named entities in input sentences will be replaced by anonymized tokens (e.g.person, country) to reduce sparsity (see the Appendix for details).
Except BERT, all other embeddings are fetched from their corresponding learned embedding lookup tables.BERT takes subword units as input, which means that one word may correspond to multiple hidden states of BERT.In order to accurately use these hidden states to represent each word, we apply an average pooling function to the outputs of BERT. Figure 4 illustrates the process of generating word-level embeddings from BERT.Encoder The encoder is a multi-layer bidirectional RNN (Schuster and Paliwal, 1997): where − → f l and ← − f l are two LSTM cells (Hochreiter and Schmidhuber, 1997); h l i is the l-th layer encoder hidden state at the time step i; h 0 i is the encoder embedding layer output for word w i .Decoder Embedding Layer Similar to the encoder embedding layer, this layer outputs vector representations for AMR nodes.The difference is that each vector is the concatenation of embeddings of GloVe, POS tags and indices, and feature vectors from CharCNN.
POS tags of nodes are inferred at runtime: if a node is a copy from the input sentence, the POS tag of the corresponding word is used; if a node is a copy from the preceding nodes, the POS tag of its antecedent is used; if a node is a new node emitted from the vocabulary, an UNK tag is used.
We do not include BERT embeddings in this layer because AMR nodes, especially their order, are significantly different from natural language text (on which BERT was pre-trained).We tried to use "fixed" BERT in this layer, which did not lead to improvement.2Decoder At each step t, the decoder (an l-layer unidirectional LSTM) receives hidden state s l−1 t from the last layer and hidden state s l t−1 from the previous time step, and generates hidden state s l t : where s 0 t is the concatenation (i.e., the inputfeeding approach, Luong et al., 2015) of two vectors: the decoder embedding layer output for the previous node u t−1 (while training, u t−1 is the previous node of the reference node list; at test time it is the previous node emitted by the decoder), and the attentional vector s t−1 from the previous step (explained later in this section).s l 0 is the concatenation of last encoder hidden states from − → f l and ← − f l respectively.Source attention distribution a t src is calculated by additive attention (Bahdanau et al., 2014): and it is then used to produce a weighted sum of encoder hidden states, i.e., the context vector c t .
Attentional vector s t combines both source and target side information, and it is calculated by an MLP (shown in Figure 3): The attentional vector s t has 3 usages: (1) it is fed through a linear layer and softmax to produce the vocabulary distribution: (2) it is used to calculate the target attention distribution a t tgt : e t tgt = v tgt tanh(W tgt s 1:t−1 + U tgt s t + b tgt ), a t tgt = softmax(e t tgt ), (3) it is used to calculate source-side copy probability p src , target-side copy probability p tgt , and generation probability p gen via a switch layer: [p src , p tgt , p gen ] = softmax(W switch s t + b switch ) Note that p src + p tgt + p gen = 1.They act as a soft switch to choose between copying an existing node from the preceding nodes by sampling from the target attention distribution a t tgt , or emitting a new node in two ways: (1) generating a new node from the fixed vocabulary by sampling from P vocab , or (2) copying a word (as a new node) from the input sentence by sampling from the source attention distribution a t src .The final probability distribution P (node) (u t ) for node u t is defined as follows.If u t is a copy of existing nodes, then: otherwise: where a t [i] indexes the i-th element of a t .Note that a new node may have the same surface form as the existing node.We track their difference using indices.The index d t for node u t is assigned deterministically as below:

Deep Biaffine Classifier
For the second stage (i.e., edge prediction), we employ a deep biaffine classifier, which was originally proposed for graph-based dependency parsing (Dozat and Manning, 2016), and recently has been applied to semantic parsing (Peng et al., 2017a;Dozat and Manning, 2018).As depicted in Figure 5, the major difference of our usage is that instead of re-encoding AMR nodes, we directly use decoder hidden states from the extended pointer-generator network as the input to deep biaffine classifier.We find two advantages of using decoder hidden states as input: (1) through the input-feeding approach, decoder hidden states contain contextualized information from both the input sentence and the predicted nodes; (2) because decoder hidden states are used for both node prediction and edge prediction, we can jointly train the two modules in our model.
Given decoder hidden states s 1 , ..., s m and a learnt vector representation s 0 of a dummy root, we follow Dozat and Manning (2016), factorizing edge prediction into two components: one that predicts whether or not a directed edge (u k , u t ) exists between two nodes u k and u t , and another that predicts the best label for each potential edge.
Edge and label scores are calculated as below: ) score (label)   k,t = Bilinear(s (label-head)   k , s where MLP, Biaffine and Bilinear are defined as below: Given a node u t , the probability of u k being the edge head of u t is defined as: ) The edge label probability for edge (u k , u t ) is defined as:

Training
The training objective is to jointly minimize the loss of reference nodes and edges, which can be decomposed to the sum of the negative log likelihood at each time step t for (1) the reference node u t , (2) the reference edge head u k of node u t , and (3) the reference edge label l between u k and u t : covloss t is a coverage loss to penalize repetitive nodes: , where cov t is the sum of source attention distributions over all previous decoding time steps: cov t = t−1 t =0 a t src .See See et al. (2017) for full details.

Prediction
For node prediction, based on the final probability distribution P (node) (u t ) at each decoding time step, we implement both greedy search and beam search to sequentially decode a node list u and indices d.
For edge prediction, given the predicted node list u, their indices d, and the edge scores S = {score (edge) i,j | 0 ≤ i, j ≤ m}, we apply the Chu-Liu-Edmonds algorithm with a simple adaption to find the maximum spanning tree (MST).As described in Algorithm 1, before calling the Chu-Liu-Edmonds algorithm, we first include a dummy root u 0 to ensure every node have a head, and then exclude edges whose source and destination nodes have the same indices, because these nodes will be merged into a single node to recover the standard AMR graph where self-loops are invalid.Alignment-based approaches were first explored by JAMR (Flanigan et al., 2014), a pipeline of concept and relation identification with a graphbased algorithm.Zhou et al. (2016) improved this by jointly learning concept and relation identification with an incremental model.Both approaches rely on features based on alignments.Lyu and Titov (2018) treated alignments as latent variables in a joint probabilistic model, leading to a substantial reported improvement.Our approach re-quires no explicit alignments, but implicitly learns a source-side copy mechanism using attention.
Grammar-based approaches are represented by Artzi et al. (2015); Peng et al. (2015) who leveraged external semantic resources, and employed CCG-based or SHRG-based grammar induction approaches converting logical forms into AMRs.Pust et al. (2015) recast AMR parsing as a machine translation problem, while also drawing features from external semantic resources.
Attention-based parsing with Seq2Seq-style models have been considered (Barzdins and Gosko, 2016;Peng et al., 2017b), but are limited by the relatively small amount of labeled AMR data.Konstas et al. (2017) overcame this by making use of millions of unlabeled data through self-training, while van Noord and Bos (2017b) showed significant gains via a characterlevel Seq2Seq model and a large amount of silverstandard AMR training data.In contrast, our approach supported by extended pointer generator can be effectively trained on the limited amount of labeled AMR data, with no data augmentation.

AMR Pre-and Post-processing
Anonymization is often used in AMR preprocessing to reduce sparsity (Werling et al., 2015;Peng et al., 2017b;Guo and Lu, 2018, inter alia).Similar to Konstas et al. (2017), we anonymize sub-graphs of named entities and other entities.Like Lyu and Titov (2018), we remove senses, and use Stanford CoreNLP (Manning et al., 2014) to lemmatize input sentences and add POS tags.
In post-processing, we assign the most frequent sense for nodes (-01, if unseen) like Lyu and Titov (2018), and restore wiki links using the DBpedia Spotlight API (Daiber et al., 2013) following Bjerva et al. (2016);van Noord and Bos (2017b).We add polarity attributes based on the rules observed from the training data.More details of preand post-processing are provided in the Appendix.We conduct experiments on two AMR general releases (available to all LDC subscribers): AMR 2.0 (LDC2017T10) and AMR 1.0 (LDC2014T12).Our model is trained using ADAM (Kingma and Ba, 2014) for up to 120 epochs, with early stopping based on the development set.Full model training takes about 19 hours on AMR 2.0 and 7 hours on AMR 1.0, using two GeForce GTX TI-TAN X GPUs.At training, we have to fix BERT parameters due to the limited GPU memory.We leave fine-tuning BERT for future work.

Experiments
Table 1 lists the hyper-parameters used in our full model.Both encoder and decoder embedding layers have GloVe and POS tag embeddings as well as CharCNN, but their parameters are not tied.We apply dropout (dropout rate = 0.33) to the outputs of each module.

Main Results
We compare our approach against the previous best approaches and several recent competitors.Table 2 summarizes their SMATCH scores (Cai and Knight, 2013) on the test sets of two AMR general releases.On AMR 2.0, we outperform the latest push from Naseem et al. (2019) by 0.8% F1, and significantly improves Lyu and Titov (2018)'s results by 1.9% F1.Compared to the previous best attention-based approach (van Noord and Bos, 2017b), our approach shows a substantial gain of 5.3% F1, with no usage of any silver-standard training data.On AMR 1.0 where the traininng instances are only around 10k, we improve the best reported results by 1.9% F1.

Fine-grained Results
In Table 3, we assess the quality of each subtask using the AMR-evaluation tools (Damonte et al., 2017).We see a notable increase on reentrancies, which we attribute to target-side copy (based on our ablation studies in the next section).Significant increases are also  Figure 6 shows the frequency of nodes from difference sources, and their corresponding precision and recall based on our model prediction.Among all reference nodes, 43.8% are from vocabulary generation, 47.6% from source-side copy, and only 8.6% from target-side copy.On one hand, the highest frequency of source-side copy helps address sparsity and results in the highest precision and recall.On the other hand, we see space for improvement, especially on the relatively low recall of target-side copy, which is probably due to its low frequency.Node Linearization As decribed in Section 3, we create the reference node list by a preorder traversal over the gold AMR tree.As for the children of each node, we sort them in alphanumerical order.This linearization strathas two advantages: (1) pre-order traversal guarantees that a head node (predicate) always comes in front of its children (arguments); (2) alphanumerical sort orders according to role ID (i.e., ARG0>ARG1>...>ARGn), following intuition from research in Thematic Hierarchies (Fillmore, 1968;Levin and Hovav, 2005) In Table 5, we report SMATCH scores of full models trained and tested on data generated via our linearization strategy (Pre-order + Alphanum), as compared to two obvious alternates: the first alternate still runs a pre-order traversal, but it sorts the children of each node based on the their alignments to input words; the second one linearizes nodes purely based alignments.Alignments are created using the tool by Pourdamghani et al. (2014).Clearly, our linearization strategy leads to much better results than the two alternates.We also tried other traversal strategies such as combining in-order traversal with alphanumerical sorting or alignment-based sorting, but did not get scores even comparable to the two alternates.5Average Pooling vs. Max Pooling In Figure 4, we apply average pooling to the outputs (last-layer hidden states) of BERT in order to generate wordlevel embeddings for the input sentence.Table 6 shows scores of models using different pooling functions.Average pooling performs slightly better than max pooling.

Conclusion
We proposed an attention-based model for AMR parsing where we introduced a series of novel components into a transductive setting that extend beyond what a typical NMT system would do on this task.Our model achieves the best performance on two AMR corpora.For future work, we would like to extend our model to other semantic parsing tasks (Oepen et al., 2014;Abend and Rappoport, 2013).We are also interested in semantic parsing in cross-lingual settings (Zhang et al., 2018;Damonte and Cohen, 2018).

Figure 1 :
Figure 1: Two views of reentrancy in AMR for an example sentence "The victim could help himself."(a) A standard AMR graph.(b) An AMR tree with node indices as an extra layer of annotation, where the corresponding graph can be recovered by merging nodes of the same index and unioning their incoming edges.

Figure 5 :
Figure 5: Deep biaffine classifier for edge prediction.Edge label prediction is not depicted in the figure.

Figure 6 :
Figure 6: Frequency, precision and recall of nodes from different sources, based on the AMR 2.0 test set.
e x i t >

Table 4 :
Ablation studies on components of our model.(Scores are sorted by the delta from the full model.)

Table 5 :
. SMATCH scores of full models trained and tested based on different node linearization strategies.

Table 6 :
SMATCH scores based different pooling functions.Standard deviation is over 3 runs on the test data.