Neural AMR: Sequence-to-Sequence Models for Parsing and Generation

Sequence-to-sequence models have shown strong performance across a broad range of applications. However, their application to parsing and generating text using Abstract Meaning Representation (AMR) has been limited, due to the relatively limited amount of labeled data and the non-sequential nature of the AMR graphs. We present a novel training procedure that can lift this limitation using millions of unlabeled sentences and careful preprocessing of the AMR graphs. For AMR parsing, our model achieves competitive results of 62.1 SMATCH, the current best score reported without significant use of external semantic resources. For AMR generation, our model establishes a new state-of-the-art performance of BLEU 33.8. We present extensive ablative and qualitative analysis including strong evidence that sequence-based AMR models are robust against ordering variations of graph-to-sequence conversions.


Introduction
Abstract Meaning Representation (AMR) is a semantic formalism to encode the meaning of natural language text. As shown in Figure 1, AMR represents the meaning using a directed graph while abstracting away the surface forms in text. AMR has been used as an intermediate meaning representation for several applications including machine translation (MT) (Jones et al., 2012), summarization (Liu et al., 2015), sentence compression (Takase et al., 2016), and event extraction (Huang et al., 2016). While AMR allows for rich semantic representation, annotating training data in AMR is expensive, which in turn limits the use Obama was elected and his voters celebrated . AMR encodes semantic dependencies between entities mentioned in the sentence, such as "Obama" being the "arg0" of the verb "elected".
In this work, we present the first successful sequence-to-sequence (seq2seq) models that achieve strong results for both text-to-AMR parsing and AMR-to-text generation. Seq2seq models have been broadly successful in many other applications (Wu et al., 2016;Bahdanau et al., 2015;Luong et al., 2015;Vinyals et al., 2015). However, their application to AMR has been limited, in part because effective linearization (encoding graphs as linear sequences) and data sparsity were thought to pose significant challenges. We show that these challenges can be easily overcome, by demonstrating that seq2seq models can be trained using any graph-isomorphic linearization and that unlabeled text can be used to significantly reduce sparsity.
Our approach is two-fold. First, we introduce a novel paired training procedure that enhances both the text-to-AMR parser and AMR-to-text generator. More concretely, first we use self-training to bootstrap a high quality AMR parser from millions of unlabeled Gigaword sentences (Napoles et al., 2012) and then use the automatically parsed AMR graphs to pre-train an AMR generator. This paired training allows both the parser and generator to learn high quality representations of fluent English text from millions of weakly labeled examples, that are then fine-tuned using human annotated AMR data.
Second, we propose a preprocessing procedure for the AMR graphs, which includes anonymizing entities and dates, grouping entity categories, and encoding nesting information in concise ways, as illustrated in Figure 2(d). This preprocessing procedure helps overcoming the data sparsity while also substantially reducing the complexity of the AMR graphs. Under such a representation, we show that any depth first traversal of the AMR is an effective linearization, and it is even possible to use a different random order for each example.
Experiments on the LDC2015E86 AMR corpus (SemEval-2016 Task 8) demonstrate the effectiveness of the overall approach. For parsing, we are able to obtain competitive performance of 62.1 SMATCH without using any external annotated examples other than the output of a NER system, an improvement of over 10 points relative to neural models with a comparable setup. For generation, we substantially outperform previous best results, establishing a new state of the art of 33.8 BLEU. We also provide extensive ablative and qualitative analysis, quantifying the contributions that come from preprocessing and the paired training procedure.

Related Work
Alignment-based Parsing Flanigan et al. (2014) (JAMR) pipeline concept and relation identification with a graph-based algorithm.  extend JAMR by performing the concept and relation identification tasks jointly with an incremental model. Both systems rely on features based on a set of alignments produced using bi-lexical cues and hand-written rules. In contrast, our models train directly on parallel corpora, and make only minimal use of alignments to anonymize named entities.
Grammar-based Parsing  (CAMR) perform a series of shift-reduce transformations on the output of an externally-trained dependency parser, similar to Damonte et al. (2017), Brandt et al. (2016), Puzikov et al. (2016), and Goodman et al. (2016. Artzi et al. (2015) use a grammar induction approach with Combinatory Categorical Grammar (CCG), which relies on pretrained CCGBank categories, like Bjerva et al. (2016). Pust et al. (2015) recast parsing as a string-to-tree Machine Translation problem, using unsupervised alignments (Pourdamghani et al., 2014), and employing several external semantic resources. Our neural approach is engineering lean, relying only on a large unannotated corpus of English and algorithms to find and canonicalize named entities.
Neural Parsing Recently there have been a few seq2seq systems for AMR parsing (Barzdins and Gosko, 2016;Peng et al., 2017). Similar to our approach, Peng et al. (2017) deal with sparsity by anonymizing named entities and typing low frequency words, resulting in a very compact vocabulary (2k tokens). However, we avoid reducing our vocabulary by introducing a large set of unlabeled sentences from an external corpus, therefore drastically lowering the out-of-vocabulary rate (see Section 6). Flanigan et al. (2016) specify a number of tree-to-string transduction rules based on alignments and POS-based features that are used to drive a tree-based SMT system. Pourdamghani et al. (2016) also use an MT decoder; they learn a classifier that linearizes the input AMR graph in an order that follows the output sentence, effectively reducing the number of alignment crossings of the phrase-based decoder. Song et al. (2016) recast generation as a traveling salesman problem, after partitioning the graph into fragments and finding the best linearization order. Our models do not need to rely on a particular linearization of the input, attaining comparable performance even with a per example random traversal of the graph. Finally, all three systems intersect with a large language model trained on Gigaword. We show that our seq2seq model has the capacity to learn the same information as a language model, especially after pretraining on the external corpus.

AMR Generation
Data Augmentation Our paired training procedure is largely inspired by Sennrich et al. (2016). They improve neural MT performance for low resource language pairs by using a back-translation MT system for a large monolingual corpus of the target language in order to create synthetic output, and mixing it with the human translations. We instead pre-train on the external corpus first, and then fine-tune on the original dataset.

Methods
In this section, we first provide the formal definition of AMR parsing and generation (section 3.1). Then we describe the sequence-to-sequence models we use (section 3.2), graph-to-sequence conversion (section 3.3), and our paired training procedure (section 3.4).

Tasks
We assume access to a training dataset D where each example pairs a natural language sentence s with an AMR a. The AMR is a rooted directed acylical graph. It contains nodes whose names correspond to sense-identified verbs, nouns, or AMR specific concepts, for example elect.01, Obama, and person in Figure 1. One of these nodes is a distinguished root, for example, the node and in Figure 1. Furthermore, the graph contains labeled edges, which correspond to PropBank-style (Palmer et al., 2005) semantic roles for verbs or other relations introduced for AMR, for example, arg0 or op1 in Figure 1. The set of node and edge names in an AMR graph is drawn from a set of tokens C, and every word in a sentence is drawn from a vocabulary W .
We study the task of training an AMR parser, i.e., finding a set of parameters θ P for model f , that predicts an AMR graphâ, given a sentence s: We also consider the reverse task, training an AMR generator by finding a set of parameters θ G , for a model f that predicts a sentenceŝ, given an AMR graph a: In both cases, we use the same family of predictors f , sequence-to-sequence models that use global attention, but the models have independent parameters, θ P and θ G .

Sequence-to-sequence Model
For both tasks, we use a stacked-LSTM sequenceto-sequence neural architecture employed in neural machine translation (Bahdanau et al., 2015;Wu et al., 2016). 1 Our model uses a global attention decoder and unknown word replacement with small modifications (Luong et al., 2015).
The model uses a stacked bidirectional-LSTM encoder to encode an input sequence and a stacked LSTM to decode from the hidden states produced by the encoder. We make two modifications to the encoder: (1) we concatenate the forward and backward hidden states at every level of the stack instead of at the top of the stack, and (2) introduce dropout in the first layer of the encoder. The decoder predicts an attention vector over the encoder hidden states using previous decoder states. The attention is used to weigh the hidden states of the encoder and then predict a token in the output sequence. The weighted hidden states, the decoded token, and an attention signal from the previous time step (input feeding) are then fed together as input to the next decoder state. The decoder can optionally choose to output an unknown word symbol, in which case the predicted attention is used to copy a token directly from the input sequence into the output sequence.

Linearization
Our seq2seq models require that both the input and target be presented as a linear sequence of tokens. We define a linearization order for an AMR graph as any sequence of its nodes and edges. A linearization is defined as (1) a linearization order and (2) a rendering function that generates any number of tokens when applied to an element in the linearization order (see Section 4.2 for implementation details). Furthermore, for parsing, a valid AMR graph must be recoverable from the linearization.

Paired Training
Obtaining a corpus of jointly annotated pairs of sentences and AMR graphs is expensive and current datasets only extend to thousands of examples. Neural sequence-to-sequence models suffer from sparsity with so few training pairs. To reduce the effect of sparsity, we use an external unannotated corpus of sentences S e , and a procedure which pairs the training of the parser and generator.
Our procedure is described in Algorithm 1, and first trains a parser on the dataset D of pairs of sentences and AMR graphs. Then it uses self-training Algorithm 1 Paired Training Procedure Input: Training set of sentences and AMR graphs (s, a) ∈ D, an unannotated external corpus of sentences Se, a number of self training iterations, N , and an initial sample size k. Output: Model parameters for AMR parser θP and AMR generator θG. 1: θP ← Train parser on D Self-train AMR parser. 2: S 1 e ← sample k sentences from Se 3: for i = 1 to N do 4: A i e ← Parse S i e using parameters θP Pre-train AMR parser. 5: θP ← Train parser on (A i e , S i e ) Fine tune AMR parser. 6: θP ← Train parser on D with initial parameters θP 7: S i+1 e ← sample k · 10 i new sentences from Se 8: end for 9: S N e ← sample k · 10 N new sentences from Se Pre-train AMR generator. 10: Ae ← Parse S N e using parameters θP 11: θG ← Train generator on (A N e , S N e ) Fine tune AMR generator. 12: θG ← Train generator on D using initial parameters θG 13: return θP , θG to improve the initial parser. Every iteration of self-training has three phases: (1) parsing samples from a large, unlabeled corpus S e , (2) creating a new set of parameters by training on S e , and (3) fine-tuning those parameters on the original paired data. After each iteration, we increase the size of the sample from S e by an order of magnitude. After we have the best parser from self-training, we use it to label AMRs for S e and pre-train the generator. The final step of the procedure fine-tunes the generator on the original dataset D.

AMR Preprocessing
We use a series of preprocessing steps, including AMR linerization, anonymization, and other modifications we make to sentence-graph pairs. Our methods have two goals: (1) reduce the complexity of the linearized sequences to make learning easier while maintaining enough original information, and (2) address sparsity from certain open class vocabulary entries, such as named entities (NEs) and quantities. Figure 2(d) contains example inputs and outputs with all of our preprocessing techniques.
Graph Simplification In order to reduce the overall length of the linearized graph, we first remove variable names and the instance-of relation ( / ) before every concept. In case of re-entrant nodes we replace the variable mention with its co-referring concept. Even though this replacement incurs loss of information, often the surrounding context helps recover the correct realization, e.g., the possessive role :poss in the example of Figure 1 is strongly correlated with the surface form his. Following Pourdamghani et al. (2016) we also remove senses from all concepts for AMR generation only. Figure 2(a) contains an example output after this stage.

Anonymization of Named Entities
Open-class types including NEs, dates, and numbers account for 9.6% of tokens in the sentences of the training corpus, and 31.2% of vocabulary W . 83.4% of them occur fewer than 5 times in the dataset. In order to reduce sparsity and be able to account for new unseen entities, we perform extensive anonymization.
First, we anonymize sub-graphs headed by one of AMR's over 140 fine-grained entity types that contain a :name role. This captures structures referring to entities such as person, country, miscellaneous entities marked with * -enitity, and typed numerical values, * -quantity. We exclude date entities (see the next section). We then replace these sub-graphs with a token indicating fine-grained type and an index, i, indicating it is the ith occurrence of that type. 2 For example, in Figure 2 the sub-graph headed by country gets replaced with country 0.
On the training set, we use alignments obtained using the JAMR aligner (Flanigan et al., 2014) and the unsupervised aligner of Pourdamghani et al. (2014) in order to find mappings of anonymized subgraphs to spans of text and replace mapped text with the anonymized token that we inserted into the AMR graph. We record this mapping for use during testing of generation models. If a generation model predicts an anonymization token, we find the corresponding token in the AMR graph and replace the model's output with the most frequent mapping observed during training for the entity name. If the entity was never observed, we copy its name directly from the AMR graph.
Anonymizing Dates For dates in AMR graphs, we use separate anonymization tokens for year, month-number, month-name, day-number and day-name, indicating whether the date is mentioned by word or by number. 3 In AMR gener-US officials held an expert group meeting in January 2002 in New York. ation, we render the corresponding format when predicted. Figure 2(b) contains an example of all preprocessing up to this stage.
Named Entity Clusters When performing AMR generation, each of the AMR fine-grained entity types is manually mapped to one of the four coarse entity types used in the Stanford NER system (Finkel et al., 2005): person, location, organization and misc. This reduces the sparsity associated with many rarely occurring entity types. Figure 2 (c) contains an example with named entity clusters.
NER for Parsing When parsing, we must normalize test sentences to match our anonymized training data. To produce fine-grained named entities, we run the Stanford NER system and first try to replace any identified span with a fine-grained category based on alignments observed during training. If this fails, we anonymize the sentence using the coarse categories predicted by the NER system, which are also categories in AMR. After parsing, we deterministically generate AMR for anonymizations using the corresponding text span.

Linearization
Linearization Order Our linearization order is defined by the order of nodes visited by depth first search, including backward traversing steps. For example, in Figure 2, starting at meet the order contains meet, :ARG0, person, :ARG1-of, expert, :ARG2-of, group, :ARG2-of, :ARG1-of, :ARG0. 4 The order traverses children in the sequence they are presented in the AMR. We consider alternative orderings of children in Section 7 but always follow the pattern demonstrated above.
Rendering Function Our rendering function marks scope, and generates tokens following the pre-order traversal of the graph: (1) if the element is a node, it emits the type of the node.
(2) if the element is an edge, it emits the type of the edge and then recursively emits a bracketed string for the (concept) node immediately after it. In case the node has only one child we omit the scope markers (denoted with left "(", and right ")" parentheses), thus significantly reducing the number of generated tokens. Figure 2(d) contains an example showing all of the preprocessing techniques and scope markers that we use in our full model.

Experimental Setup
We conduct all experiments on the AMR corpus used in SemEval-2016 Task    summarizes statistics about the original dataset and the extracted portions of Gigaword. We evaluate AMR parsing with SMATCH (Cai and Knight, 2013), and AMR generation using BLEU (Papineni et al., 2002) 5 . We validated word embedding sizes and RNN hidden representation sizes by maximizing AMR development set performance (Algorithm 1 -line 1). We searched over the set {128, 256, 500, 1024} for the best combinations of sizes and set both to 500. Models were trained by optimizing cross-entropy loss with stochastic gradient descent, using a batch size of 100 and dropout rate of 0.5. Across all models when performance does not improve on the AMR dev set, we decay the learning rate by 0.8.
For the initial parser trained on the AMR corpus, (Algorithm 1 -line 1), we use a single stack version of our model, set initial learning rate to 0.5 and train for 60 epochs, taking the best performing model on the development set. All subsequent models benefited from increased depth and we used 2-layer stacked versions, maintaining the same embedding sizes. We set the initial Gigaword sample size to k = 200, 000 and executed a maximum of 3 iterations of self-training. For pretraining the parser and generator, (Algorithm 1lines 4 and 9), we used an initial learning rate of 1.0, and ran for 20 epochs. We attempt to fine-tune the parser and generator, respectively, after every epoch of pre-training, setting the initial learning rate to 0.1. We select the best performing model on the development set among all of these fine-tuning 5 We use the multi-BLEU script from the MOSES decoder suite (Koehn et al., 2007).

Corpus
Examples OOV@1 (Song et al., 2016) 21.1 22.4 TREETOSTR (Flanigan et al., 2016) 23.0 23.0 attempts. During prediction we perform decoding using beam search and set the beam size to 5 both for parsing and generation. parser improves. Our final parser outperforms comparable seq2seq and character LSTM models by over 10 points. While much of this improvement comes from self-training, our model without Gigaword data outperforms these approaches by 3.5 points on F1. We attribute this increase in performance to different handling of preprocessing and more careful hyper-parameter tuning. All other models that we compare against use semantic resources, such as WordNet, dependency parsers or CCG parsers (models marked with * were trained with less data, but only evaluate on newswire text; the rest evaluate on the full test set, containing text from blogs). Our full models outperform JAMR, a graph-based model but still lags behind other parser-dependent systems (CAMR 6 ), and resource heavy approaches (SBMT). Table 3 summarizes our AMR generation results on the development and test set. We outperform all previous state-of-theart systems by the first round of self-training and further improve with the next rounds. Our final model trained on GIGA-20M outperforms TSP and TREETOSTR trained on LDC2015E86, by over 9 BLEU points. 7 Overall, our model incorporates less data than previous approaches as all reported methods train language models on the whole Gigaword corpus. We leave scaling our models to all of Gigaword for future work.

Generation Results
Sparsity Reduction Even after anonymization of open class vocabulary entries, we still encounter a great deal of sparsity in vocabulary given the small size of the AMR corpus, as shown in Table 2. By incorporating sentences from Gigaword we are able to reduce vocabulary sparsity dramatically, as we increase the size of sampled sentences: the out-of-vocabulary rate with a threshold of 5 reduces almost 5 times for GIGA-20M.
Preprocessing Ablation Study We consider the contribution of each main component of our preprocessing stages while keeping our linearization order identical. Figure 2 contains examples for each setting of the ablations we evaluate on. First we evaluate using linearized graphs without paren-6 Since we are currently not using any Wikipedia resources for the prediction of named entities, we compare against the no-wikification version of the CAMR system. 7 We also trained our generator on GIGA-2M and finetuned on LDC2014T12 in order to have a direct comparison with PBMT, and achieved a BLEU score of 29.7, i.e., 2.8 points of improvement.  theses for indicating scope, Figure 2(c), then without named entity clusters, Figure 2(b), and additionally without any anonymization, Figure 2(a).
Tables 4 summarizes our evaluation on the AMR generation. Each components is required, and scope markers and anonymization contribute the most to overall performance. We suspect without scope markers our seq2seq models are not as effective at capturing long range semantic relationships between elements of the AMR graph. We also evaluated the contribution of anonymization to AMR parsing (Table 5). Following previous work, we find that seq2seq-based AMR parsing is largely ineffective without anonymization (Peng et al., 2017).

Linearization Evaluation
In this section we evaluate three strategies for converting AMR graphs into sequences in the context of AMR generation and show that our models are largely agnostic to linearization orders. Our results argue, unlike SMT-based AMR generation methods (Pourdamghani et al., 2016), that seq2seq models can learn to ignore artifacts of the conversion of graphs to linear sequences.

Linearization Orders
All linearizations we consider use the pattern described in Section 4.2, but differ on the order in which children are visited. Each linearization generates anonymized, scope-marked output (see Section 4), of the form shown in Figure 2(d).
Human The proposal traverses children in the order presented by human authored AMR annotations exactly as shown in Figure 2(d).

Global-Random
We construct a random global ordering of all edge types appearing in AMR graphs and re-use it for every example in the dataset. We traverse children based on the position in the global ordering of the edge leading to a child.
Random For each example in the dataset we traverse children following a different random order of edge types.

Results
We present AMR generation results for the three proposed linearization orders in Table 6. Random linearization order performs somewhat worse than traversing the graph according to Human linearization order. Surprisingly, a per example random linearization order performs nearly identically to a global random order, arguing seq2seq models can learn to ignore artifacts of the conversion of graphs to linear sequences.

Human-authored AMR leaks information
The small difference between random and globalrandom linearizations argues that our models are largely agnostic to variation in linearization order. On the other hand, the model that follows the human order performs better, which leads us to suspect it carries extra information not apparent in the graphical structure of the AMR.
To further investigate, we compared the relative ordering of edge pairs under the same parent to the relative position of children nodes derived from those edges in a sentence, as reported by JAMR alignments. We found that the majority of pairs of AMR edges (57.6%) always occurred in the same relative order, therefore revealing no extra generation order information. 8 Of the examples corresponding to edge pairs that showed variation, 70.3% appeared in an order consistent with the order they were realized in the sentence. The relative ordering of some pairs of AMR edges was  particularly indicative of generation order. For example, the relative ordering of edges with types location and time, was 17% more indicative of the generation order than the majority of generated locations before time. 9 To compare to previous work we still report results using human orderings. However, we note that any practical application requiring a system to generate an AMR representation with the intention to realize it later on, e.g., a dialog agent, will need to be trained either using consistent, or randomderived linearization orders. Arguably, our models are agnostic to this choice.
8 Qualitative Results Figure 3 shows example outputs of our full system. The generated text for the first graph is nearly perfect with only a small grammatical error due to anonymization. The second example is more challenging, with a deep right-branching structure, and a coordination of the verbs stabilize and push in the subordinate clause headed by state. The model omits some information from the graph, namely the concepts terrorist and virus. In the third example there are greater parts of the graph that are missing, such as the whole sub-graph headed by expert. Also the model makes wrong attachment decisions in the last two sub-graphs (it is the evidence that is unimpeachable and irrefutable, and not the equipment), mostly due to insufficient annotation (thing) thus making their generation harder.
Finally, Table 7 summarizes the proportions of error types we identified on 50 randomly selected examples from the development set. We found that the generator mostly suffers from coverage issues, an inability to mention all tokens in the input, followed by fluency mistakes, as illustrated above. Attachment errors are less frequent, which supports our claim that the model is robust to graph linearization, and can successfully encode long range dependency information between concepts.

Conclusions
We applied sequence-to-sequence models to the tasks of AMR parsing and AMR generation, by carefully preprocessing the graph representation and scaling our models via pretraining on millions of unlabeled sentences sourced from Gigaword corpus. Crucially, we avoid relying on resources such as knowledge bases and externally trained parsers. We achieve competitive results for the parsing task (SMATCH 62.1) and state-of-theart performance for generation (BLEU 33.8).
For future work, we would like to extend our work to different meaning representations such as the Minimal Recursion Semantics (MRS; Copestake et al. (2005)). This formalism tackles certain linguistic phenomena differently from AMR (e.g., negation, and co-reference), contains explicit annotation on concepts for number, tense and case, and finally handles multiple languages 10 (Bender, 2014). Taking a step further, we would like to apply our models on Semantics-Based Machine Translation using MRS as an intermediate representation between pairs of languages, and investigate the added benefit compared to directly translating the surface strings, especially in the case of distant language pairs such as English and Japanese (Siegel, 2000).

limit
:arg0 ( treaty :arg0-of ( control :arg1 arms ) ) :arg1 ( number :arg1 ( weapon :mod conventional :arg1-of ( deploy :arg2 ( relative-pos :op1 loc_0 :dir west ) :arg1-of possible ) ) ) SYS: the arms control treaty limits the number of conventional weapons that can be deployed west of Ural Mountains . REF: the arms control treaty limits the number of conventional weapons that can be deployed west of the Ural Mountains .
REF: a technical committee of Indian missile experts stated that the equipment was unimpeachable and irrefutable evidence of a plan to transfer not just missiles but missile-making capability.
COMMENT: coverage , disfluency, attachment SYS: the report stated that the Britain government must help stabilize the weak states and push international regulations to stop the use of freely available information to create a form of new biological warfare such as the modified version of the influenza . Figure 3: Linearized AMR after preprocessing, reference sentence, and output of the generator. We mark with colors common error types: disfluency, coverage (missing information from the input graph), and attachment (implying a semantic relation from the AMR between incorrect entities).