AMR Parsing using Stack-LSTMs

We present a transition-based AMR parser that directly generates AMR parses from plain text. We use Stack-LSTMs to represent our parser state and make decisions greedily. In our experiments, we show that our parser achieves very competitive scores on English using only AMR training data. Adding additional information, such as POS tags and dependency trees, improves the results further.


Introduction
Transition-based algorithms for natural language parsing (Yamada and Matsumoto, 2003;Nivre, 2003Nivre, , 2004Nivre, , 2008 are formulated as a series of decisions that read words from a buffer and incrementally combine them to form syntactic structures in a stack. Apart from dependency parsing, these models, also known as shift-reduce algorithms, have been successfully applied to tasks like phrase-structure parsing (Zhang and Clark, 2011;, named entity recognition (Lample et al., 2016), CCG parsing (Misra and Artzi, 2016) joint syntactic and semantic parsing (Henderson et al., 2013;Swayamdipta et al., 2016) and even abstract-meaning representation parsing (Wang et al., 2015b,a;Damonte et al., 2016).
AMR parsing requires solving several natural language processing tasks; mainly named entity recognition, word sense disambiguation and joint syntactic and semantic role labeling. 1 Given the difficulty of building an end-to-end system, most prior work is based on pipelines or heavily dependent on precalculated features (Flanigan et al., 2014;Zhou et al., 2016;Werling et al., 2015;Wang et al., 2015b, inter-alia).
Inspired by Wang et al. (2015b,a); Goodman et al. (2016); Damonte et al. (2016) and , we present a shift-reduce algorithm that produces AMR graphs directly from plain text. Wang et al. (2015b,a); Zhou et al. (2016); Goodman et al. (2016) presented transition-based treeto-graph transducers that traverse a dependency tree and transforms it to an AMR graph. Damonte et al. (2016)'s input is a sentence and it is therefore more similar (with a different parsing algorithm) to our approach, but their parser relies on external tools, such as dependency parsing, semantic role labeling or named entity recognition.
The input of our parser is plain text sentences and, through rich word representations, it predicts all actions (in a single algorithm) needed to generate an AMR graph representation for an input sentence; it handles the detection and annotation of named entities, word sense disambiguation and it makes connections between the nodes detected towards building a predicate argument structure. Even though the system that runs with just words is very competitive, we further improve the results incorporating POS tags and dependency trees into our model.
Stack-LSTMs 2 have proven to be useful in tasks related to syntactic and semantic parsing Swayamdipta et al., 2016) and named entity recognition (Lample et al., 2016). In this paper, we demonstrate that they can be effectively used for AMR parsing as well.

Parsing Algorithm
Our parsing algorithm makes use of a STACK (that stores AMR nodes and/or words) and a BUFFER that contains the words that have yet to be processed. The parsing algorithm is inspired from the semantic actions presented by Henderson et al. (2013), the transition-based NER algorithm by Lample et al. (2016) and the arc-standard algorithm (Nivre, 2004). As in (Ballesteros and Nivre, 2013) the buffer starts with the root symbol at the end of the sequence. Figure 2 shows a running example. The transition inventory is the following: • SHIFT: pops the front of the BUFFER and push it to the STACK.
• CONFIRM: calls a subroutine that predicts the AMR node corresponding to the top of the STACK. It then pops the word from the STACK and pushes the AMR node to the STACK. An example is the prediction of a propbank sense: From occured to occur-01.
• REDUCE: pops the top of the STACK. It occurs when the word/node at the top of the stack is complete (no more actions can be applied to it). Note that it can also be applied to words that do not appear in the final output graph, and thus they are directly discarded.
• MERGE: pops the two nodes at the top of the STACK and then it merges them, it then pushes the resulting node to the top of STACK. Note that this can be applied recursively. This action serves to get multiword named entities (e.g. New York City).
• ENTITY(label): labels the node at the top of the STACK with an entity label. This action serves to label named entities, such as New York City or Madrid and it is normally run after MERGE when it is a multi-word named entity, or after SHIFT if it is a single-word named entity.
• DEPENDENT(label,node): creates a new node in the AMR graph that is dependent on the node at the top of the STACK. An example is the introduction of a negative polarity to a given node: From illegal to (legal, polarity -).
• LA(label) and RA(label): create a left/right arc with the top two nodes at the top of the STACK. They keep both the head and the dependent in the stack to allow reentrancies (multiple incoming edges). The head is now a composition of the head and the dependent. They are enriched with the AMR label.
• SWAP: pops the two top items at the top of the STACK, pushes the second node to the front of the BUFFER, and pushes the first one back into the STACK. This action allows nonprojective arcs as in (Nivre, 2009) but it also helps to introduce reentrancies. At oracle time, SWAP is produced when the word at the top of the stack is blocking actions that may happen between the second element at the top of the stack and any of the words in the buffer. Figure 1 shows the parser actions and the effect on the parser state (contents of the stack, buffer) and how the graph is changed after applying the actions.
We implemented an oracle that produces the sequence of actions that leads to the gold (or close to gold) AMR graph. In order to map words in the sentences to nodes in the AMR graph we need to align them. We use the JAMR aligner provided by Flanigan et al. (2014). 3 It is important to mention that even though the aligner is quite accurate, it is not perfect, producing a F1 score of around 0.90. This means that most sentences have at least one alignment error which implies that our oracle is not capable of perfectly reproducing all AMR graphs. This has a direct impact on the accuracy of the parser described in the next section since it is trained on sequences of actions that are not perfect. The oracle achieves 0.895 F1 Smatch score  when it is run on the development set of the LDC2014T12.
The algorithm allows a set of different constraints that varies from the basic ones (not allowing impossible actions such as SHIFT when the buffer is empty or not generating arcs when the words have not yet been CONFIRMed and thus transformed to nodes) to more complicated ones based on the propbank candidates and number of arguments. We choose to constrain the parser to the basic ones and let it learn the more complicated ones.

Parsing Model
In this section, we revisit Stack-LSTMs, our parsing model and our word representations.

Stack-LSTMs
The stack LSTM is an augmented LSTM (Hochreiter and Schmidhuber, 1997;Graves, 2013) that allows adding new inputs in the same way as LSTMs but it also provides a POP operation that moves a pointer to the previous element. The output vector of the LSTM will consider the stack pointer instead of the rightmost position of the sequence. 4

Representing the State and Making Parsing Decisions
The state of the algorithm presented in Section 2 is represented by the contents of the STACK, BUFFER and a list with the history of actions (which are encoded as Stack-LSTMs). 5 All of this forms the vector s t that represents the state which s calculated as follows: where W is a learned parameter matrix, d is a bias term and st t , b t ,a t represent the output vector of the Stack-LSTMs at time t.
Predicting the Actions: Our model then uses the vector s t for each timestep t to compute the probability of the next action as: where g z is a column vector representing the (output) embedding of the action z, and q z is a bias term for action z. The set A represents the actions listed in Section 2. Note that due to parsing constraints the set of possible actions may vary. The total number of actions (in the LDC2014T12 dataset) is 478; note that they include all possible labels (in the case of LA and RA ) and the different dependent nodes for the DEPENDENT action Predicting the Nodes: When the model selects the action CONFIRM, the model needs to decide the AMR node 6 that corresponds to the word at the top of the STACK, by using s t , as follows: where N is the set of possible candidate nodes for the word at the top of the STACK. g e is a column vector representing the (output) embedding of the node e, and q e is a bias term for the node e. It is important to mention that this implies finding a propbank sense or a lemma. For that, we rely entirely on the AMR training set instead of using additional resources. Given that the system runs two softmax operations, one to predict the action to take and the second one to predict the corresponding AMR node, and they both share LSTMs to make predictions, we include an additional layer with a tanh nonlinearity after s t for each softmax.

Word Representations
We use character-based representations of words using bidirectional LSTMs (Ling et al., 2015b;. They learn representations for words that are orthographically similar. Note that they are updated with the updates to the model.  and Lample et al. (2016) demonstrated that it is possible to achieve high results in syntactic parsing and named entity recognition by just using characterbased word representations (not even POS tags, in fact, in some cases the results with just characterbased representations outperform those that used explicit POS tags since they provide similar vectors for words with similar/same morphosyntactic tag ); in this paper we show a similar result given that both syntactic parsing and named-entity recognition play a central role in AMR parsing.
These are concatenated with pretrained word embeddings. We use a variant of the skip n-gram model provided by Ling et al. (2015a) with the LDC English Gigaword corpus (version 5). These embeddings encode the syntactic behavior of the words (see (Ling et al., 2015a)).
More formally, to represent each input token, we concatenate two vectors: a learned characterbased representation (w C ); and a fixed vector representation from a neural language model (w LM ). A linear map (V) is applied to the resulting vector and passed through a component-wise ReLU, where V is a learned parameter matrix, b is a bias term and w C is the character-based learned representation for each word,w LM is the pretrained word representation.

POS Tagging and Dependency Parsing
We may include preprocessed POS tags or dependency parses to incorporate more information into our model. For the POS tags we use the Stanford tagger (Toutanova et al., 2003) while we use the 's Stack-LSTM parser trained on the English CoNLL 2009 dataset (Hajič et al., 2009) to get the dependencies.
POS tags: The POS tags are preprocessed and a learned representation tag is concatenated with the word representations. This is the same setting as .
Dependency Trees: We use them in the same way as POS tags by concatenating a learned representation dep of the dependency label to the parent with the word representation. Additionally, we enrich the state representation s t , presented in Section 3.2. If the two words at the top of the STACK have a dependency between them, s t is enriched with a learned representation that indicates that and the direction; otherwise s t remains unchanged. s t is calculated as follows: where dep t is the learned vector that represents that there is an arc between the two top words at the top of the stack.

Experiments and Results
We use the LDC2014T12 dataset 7 for our experiments. Table 1 shows results, including comparison with prior work that are also evaluated on the same dataset. 8 7 This dataset is a standard for comparison and has been used for evaluation in recent papers like Goodman et al., 2016;Zhou et al., 2016). We use the standard training/development/test split: 10,312 sentences for training, 1,368 sentences for development and 1,371 sentences heldout for testing. 8 The first entry for Damonte et al. is calculated using a pretrained LDC2015 model, available at https:// github.com/mdtux89/amr-eager, but evaluated on the LDC2014 dataset. This means that the score is not directly comparable with the rest. The second entry (0.64) for Damonte et al. is calculated by training their parser with the LDC2014 training set which makes it directly comparable with the rest of the parsers.  shows results without pretrained word embeddings. (NO PRETRAINED-NO CHARS) shows results without character-based representations and without pretrained word embeddings. The rest of our results include both pretrained embeddings and character-based representations.
Our model achieves 0.68 F1 in the newswire section of the test set just by using character-based representations of words and pretrained word embeddings. All prior work uses lemmatizers, POS taggers, dependency parsers, named entity recognizers and semantic role labelers that use additional training data while we achieve competitive scores without that. Pust et al. (2015) reports 0.66 F1 in the full test by using WordNet for concept identification, but their performance drops to 0.61 without WordNet. It is worth noting that we achieved 0.64 in the same test set without Word-Net. Wang et al. (2015b,a) without SRL (via Propbank) achieves only 0.63 in the newswire test set while we achieved 0.69 without SRL (and 0.68 without dependency trees).
In order to see whether pretrained word embeddings and character-based embeddings are use-ful we carried out an ablation study by showing the results of our parser with and without character-based representations (replaced by standard lookup table learned embeddings) and with and without pretrained word embeddings. By looking at the results of the parser without character-based embeddings but with pretrained word embeddings we observe that the characterbased representation of words are useful since they help to achieve 2 points better in the Newswire dataset and 1 point more in the full test set. The parser with character-based embeddings but without pretrained word embeddings, the parser has more difficulty to learn and only achieves 0.61 in the full test set. Finally, the model that does not use neither character-based embeddings nor pretrained word embeddings is the worst achieving only 0.59 in the full test set, note that this model has no explicity way of getting any syntactic information through the word embeddings nor a smart way to handle out of vocabulary words.
All the systems marked with * require that the input is a dependency tree, which means that they solve a transduction task between a dependency tree and an AMR graph. Even though our parser starts from plain text sentences when we incorporate more information into our model, we achieve further improvements. POS tags provide small improvements (0.6801 without POS tags vs 0.6822 for the model that runs with POS tags). Dependency trees help a bit more achieving 0.6920.

Conclusions and Future Work
We present a new transition-based algorithm for AMR parsing and we implement it using Stack-LSTMS and a greedy decoder. We present competitive results, without any additional resources and external tools. Just by looking at the words, we achieve 0.68 F1 (and 0.69 by preprocessing dependency trees) in the standard dataset used for evaluation.