AMR Parsing with Action-Pointer Transformer

Abstract Meaning Representation parsing is a sentence-to-graph prediction task where target nodes are not explicitly aligned to sentence tokens. However, since graph nodes are semantically based on one or more sentence tokens, implicit alignments can be derived. Transition-based parsers operate over the sentence from left to right, capturing this inductive bias via alignments at the cost of limited expressiveness. In this work, we propose a transition-based system that combines hard-attention over sentences with a target-side action pointer mechanism to decouple source tokens from node representations and address alignments. We model the transitions as well as the pointer mechanism through straightforward modifications within a single Transformer architecture. Parser state and graph structure information are efficiently encoded using attention heads. We show that our action-pointer approach leads to increased expressiveness and attains large gains (+1.6 points) against the best transition-based AMR parser in very similar conditions. While using no graph re-categorization, our single model yields the second best Smatch score on AMR 2.0 (81.8), which is further improved to 83.4 with silver data and ensemble decoding.


Introduction
Abstract Meaning Representation (AMR) (Banarescu et al., 2013) is a sentence level semantic formalism encoding who does what to whom in the form of a rooted directed acyclic graph. Nodes represent concepts such as entities or predicates which are not explicitly aligned to words, and edges represent relations such as subject/object (see Figure 1). AMR parsing, the task of generating the graph from a sentence, is nowadays tackled with sequence to sequence models parametrized with neural networks. There are two broad categories of methods that are highly effective in recent years. Transition-based approaches predict a sequence of actions given the sentence. These actions generate the graph while processing tokens left-to-right through the sentence and store intermediate representations in memories such as stack and buffer (Wang et al., 2015;Damonte et al., 2016;Ballesteros and Al-Onaizan, 2017;Vilares and Gómez-Rodríguez, 2018;Naseem et al., 2019;Astudillo et al., 2020;Lee et al., 2020). General graph-based approaches, on the other hand, directly predict nodes and edges in sequential order from graph traversals such as breath first search or depth first search (Zhang et al., 2019a,b;Lam, 2019, 2020). While not modeling the local semantic correspondence between graph nodes and source tokens, the approaches achieve strong results without restrictions of transition-based approaches, but often require graph re-categorization, a form of graph normalization, for optimal performance.
The strong left-to-right constraint of transitionbased parsers provides a form of inductive bias that fits AMR characteristics. AMR nodes are very often normalized versions of sentence tokens and locality between words and nodes is frequently preserved. The fact that transition-based systems for AMR have alignments as the core of their explanatory model also guarantees that they produce reliable alignments at decoding time, which are useful for applications utilizing AMR parses. Despite these advantages, transition-based systems still suffer in situations when multiple nodes are best explained as aligned to one sentence token or none. Furthermore, long distance edges in AMR, e.g. re-entrancies, require excessive use of SWAP or  Figure 2: Source tokens, target actions and AMR graph for the sentence I offer a solution to the problem (partially parsed). The black arrow marks the current token cursor position. The circles contain the action indices (used as ids), black circles indicate node creating actions. Only these actions are available for edge attachments. Notice that the edge actions (at steps 3, 7 and 9) explicitly refer to past nodes using the id of the action that created the node. The other participant of the edge action is implicitly assumed to be the most recently created graph node. equivalent actions, leading to very long action sequences. This in turn affects both a model's ability to learn and its decoding speed.
In this work, we propose the Action-Pointer Transition (APT) system which combines the advantages of both the transition-based approaches and more general graph-generation approaches. We focus on predicting an action sequence that can build the graph from a source sentence. The core idea is to put the target action sequence to a dual use -as a mechanism for graph generation as well as the representation of the graph itself. Inspired by recent progress in pointer-based parsers (Ma et al., 2018a;Fernández-González and Gómez-Rodríguez, 2020), we replace the stack and buffer by a cursor that moves from left to right and introduce a pointer network (Vinyals et al., 2015) as mechanism for edge creation. Unlike previous works, we use the pointer mechanism on the target side, pointing to past node generation actions to create edges. This eliminates the node generation and attachment restrictions of previous transitionbased parsers. It is also more natural for graph generation, essentially resembling the generation process in the graph-based approaches, but keeping the graph and source aligned.
We model both the action generation and the pointer prediction with a single Transformer model (Vaswani et al., 2017). We relate target node and source token representations through masking of cross-attention mechanism, similar to Astudillo et al. (2020) but simply with monotonic actionsource alignment driven by cursor positions, rather than stack and buffer contents. Finally we also embed the AMR graph structural information in the target decoder by re-purposing edge-creating steps, and propose a novel step-wise incremental graph message passing method (Gilmer et al., 2017) en-abled by the decoder self-attention mechanism.
Experiments on AMR 1.0, AMR 2.0, and AMR 3.0 benchmark datasets show the effectiveness of our APT system. We outperform the best transitionbased systems while using sensibly shorter action sequences, and achieve better performance than all previous approaches with similar size of training parameters. Figure 2 shows a partially parsed example of a source sentence, a transition action sequence and the AMR graph for the proposed transitions. Given a source sentence x = x 1 , x 2 , . . . , x S , our transition system works by scanning the sentence from left to right using a cursor c t ∈ {1, 2, . . . , S}. Cursor movement is controlled by three actions:

AMR Generation with Action-Pointer
SHIFT moves cursor one position to the right, such that c t+1 = c t + 1.
REDUCE is a special SHIFT indicating that no action was performed at current cursor position. MERGE merges tokens x ct and x ct+1 and SHIFTs. Merged tokens act as a single token under the position of the last token merged.
At cursor position c t we can generate any subgraph through following actions: COPY creates a node by copying the word under x ct . Since AMR nodes are often lemmas or propbank frames, two versions of this action exist to copy the lemma of x ct or provide the first sense (frame −01) constructed from the lemma. This covers a large portion of the total AMR nodes. It also helps generalize for predictions of unseen nodes. We use an external lemmatizer 1 for this action. PRED(LABEL) creates a node with name LABEL from the node names seen at train time. SUBGRAPH(LABEL) produces an entire subgraph indexed by label LABEL. Any future attachments can only be made to the root of the subgraph.
LA(ID,LABEL) creates an arc with LABEL from last generated node to a previous node at position ID. Note that we can only point to past node generating actions in the action history. RA(ID,LABEL) creates an arc with LABEL to last generated node from a previous node at position ID. Using the above actions, it is easy to derive an oracle action sequence given gold-graph information and initial word to node alignments. For current cursor position, all the nodes aligned to it are generated using SUBGRAPH(), COPY or PRED() actions. Each node prediction action is followed by edge creation actions. Edges connecting to closer nodes are generated before the farther ones. When multiple connected nodes are aligned to one token, they are traversed in pre-order for node generation. A detailed description of oracle algorithm is given in Appendix B.
The use of a cursor variable c t decouples node reference from source tokens, allowing to produce multiple nodes and edges (see Figure 3), even the entire AMR graph if necessary, from a single token. This provides more expressiveness and flexibility than previous transition-based AMR parsers, while keeping a strong inductive bias. The only restriction is that all inbound or outbound edges between current node and all previously produced nodes need to be generated before predicting a new node or shifting the cursor. This does not limit the oracle coverage, however, for trained parsers, it leads to a small percentage of disconnected graphs in decoding. Furthermore, nodes within the SUBGRAPH() action can not be reached for edge creation. The use of SUBGRAPH() action, initially introduced in Ballesteros and Al-Onaizan (2017), is reduced in this work to cases where no such edges are expected, which is mainly the case for dates and named-entities.
Compared to previous oracles (Ballesteros and Al-Onaizan, 2017;Naseem et al., 2019;Astudillo et al., 2020), the action-pointer does not use a SWAP action. It can establish an edge between the last predicted node and any previous node, since edges are created by pointing to decoder representations. Step-by-step actions on the sentence your opinion matters. Creates subgraph from a single word (thing :ARG1-of opine-01) and allows attachment to all its nodes. Cursor is at underlined words (post-action).
This oracle is expected to work with generic AMR aligners. For this work, we use the alignments generation method of Astudillo et al. (2020), which generates many-to-many alignments. It is a combination of Expectation Maximization based alignments of Pourdamghani et al. (2014) and rule base alignments of Flanigan et al. (2014). Any remaining unaligned nodes are aligned based on their graph proximity to unaligned tokens. For more details, we refer the reader to the works of Astudillo et al. (2020)

Basic Architecture
The backbone of our model is the encoder-decoder Transformer (Vaswani et al., 2017), combined with a pointer network (Vinyals et al., 2015). The probability of an action sequence y = y 1 , y 2 , . . . , y T for input tokens x = x 1 , x 2 , . . . , x S is given in our model by where at each time step t, we decompose the target action y t into the pointer-removed action and the pointer value with y t = (a t , p t ). A dummy pointer p t = null is fixed for non-edge actions, so that where γ(a t ) is an indicator variable set to 0 if a t is not an edge action and 1 otherwise.
Given a sequence to sequence Transformer model with N encoder layers and M decoder layers, each decoder layer is defined by where FF m (), CA m () and SA m () are feedforward, multi-head cross-attention and multi-head self-attention components respectively 2 . e N is the output of last encoder layer and d m−1 is the output of the previous decoder layer, with d 0 ≤t initialized to be the embeddings of the action history y <t concatenated with a special start symbol.
The distribution over actions is given by where W are the output vocabulary embeddings, and the edge pointer distribution is given by where K M , Q M are key and query matrices of 1 head of the last decoder self-attention layer SA M ().
The top layer self-attention is a natural choice for the pointer network, since it is likely to have high values for the nodes involved in the edge direction and label prediction. Although the edge action and its pointing value are both output at the same step, the specialized pointer head is also part of the overall self-attention mechanism used to compute the model's hidden representations, thus making actions distribution aware of the pointer distribution.
Our transition system moves the cursor c t over the source from left to right during parsing, essentially maintaining a monotonic alignment between target actions and source tokens. We encode the alignment c t with hard attentions in cross-attention heads CA m () with m = 1 · · · M at every decoder layer. We mask one head of the cross-attention to see only the aligned source token at c t , and augment it with another head masked to see only positions > c t . This is similar to the hard attention in Peng et al. (2018) and parser state encoding in Astudillo et al. (2020).
As in prior works, we restrict the output space of our model to only allow valid actions given x, y <t . The restriction is not only enforced at inference, but is also internalized with the model during training so that the model can always focus on relevant action subsets when making predictions.

Incremental Graph Embedding
Incrementally generated graphs are usually modeled via graph neural networks (Li et al., 2018), where a node's representation is updated from the collection of it's neighboring nodes' representations by message passing (Gilmer et al., 2017). However, this requires re-computation of all node representations every time the graph is modified, which is expensive, prohibiting its use in previous graph-based AMR parsing works (Cai and Lam, 2020). To better utilize the intermediate topological graph information without losing the efficient parallelization of Transformer, we propose to use the edge creation actions as updated views of each node, that encode this node's neighboring subgraph. This does not change the past computations and can be done by altering the hard masking of the self-attention heads of decoder layers SA m () . By interpreting the decoder layers as implementing message passing vertically, we can fully encode graphs up to depth M .
Given a node generating action a t = v, it is followed by k ≥ 0 edge generating actions a t+1 , a t+2 , . . . , a t+k that connect the current node with previous nodes, pointed by p t+1 , p t+2 , . . . , p t+k positions on the target side. This also defines k graph modifications, expanding the graph neighborhood on the current node. Figure 4 shows an example for the sentence The boy wants to go, with node prediction actions at positions t = 2, 4, 8, with k being 0, 1, 2, respectively. We use the steps from t to t + k in the Transformer decoder to encode this expanding neighborhood. In particular, we fix the decoder input as the current node action v for these steps, as illustrated in the input actions in Figure 4. At each intermediate step τ ∈ [t, t + k], 2 decoder self-attention heads SA m () are restricted to only attend to the direct graph neighbors of the current node, represented by previous nodes at positions p t , p t+1 , · · · , p τ as well as the current position τ . This essentially builds sub-sequences of node representations with richer graph information step by step, and we use the last reference of the same node for pointing positions when generating new edges. Moreover, when propagating this masking pattern along m layers, each node encodes its m-hop neighborhood information. This defines a message passing procedure as shown in Figure 4, encoding the compositional relations between nodes. Since the edges have directions indicated by LA and RA, we also encode the direction information by separating the two heads with each only considering one direction.

Training and Inference
Our model is trained by maximizing the log likelihood of Equation (1). The valid action space, action-source alignment c t , and the graph embedding mask at each step t are pre-calculated at training time. For inference, we modify the beam search algorithm to jointly search for actions and edge pointers and combine them to find the action sequence that maximizes Equation (1). We also consider hard constraints in the searching process such as valid output actions and valid target pointing values at different steps to ensure an AMR graph is recoverable. For the structural information that is extracted from the parsing state such as c t and graph embedding masks, we compute them on the fly at each new step of decoding based on the current results, which are then used by the model for the next step decoding. We detail our search algorithm in Appendix C.

Experimental Setup
Data and Evaluation We test our approach on two widely used AMR parsing benchmark datasets: AMR 2.0 (LDC2017T10) and AMR 1.0 (LDC2014T12). The AMR graphs are all human annotated. The two datasets have 36521 and 10312 training AMRs, respectively, and share 1368 development AMRs and 1371 testing AMRs 3 . We also report results on the latest AMR 3.0 (LDC2020T02) dataset, which is larger in size but has not been fully explored, with 55635 training AMRs and 1722 and 1898 AMRs for development and testing set. Wiki links are removed in the preprocessing of data, and we run a wikification approach in post-processing to recover Wikipedia entries in the AMR graphs as in Naseem et al. (2019).
For evaluation, we use the SMATCH (F1) scores 4  and further the fine-grained evaluation metrics (Damonte et al., 2016) to assess the model's AMR parsing performance.

Model Configuration
Our base setup has 6 layers and 4 attention heads for both the Transformer encoder and decoder, with model size 256 and feedforward size 512. We also compare with a small model with 3 layers in encoder and decoder but identical otherwise. The pointer network is always tied with one target self-attention head of the top decoder layer. We use the cross-attention of all decoder layers for action-source alignment. For graph embedding, we use 2 heads of the bottom 3 layers for the base model and bottom 2 layers for the small model. We use contextualized embeddings extracted from the pre-trained RoBERTa  large model for the source sentence, with average of all layer states and BPE tokens mapped to words by averaging as in (Lee et al., 2020). The pre-trained embeddings are fixed. For  target actions we train our own embeddings along with the model.

Implementation Details
We use the Adam optimizer with β 1 of 0.9 and β 2 of 0.98 for training. Each data batch has 3584 maximum number of tokens, and the learning rate schedule is the same as Vaswani et al. (2017), where we use the maximum learning rate of 5e−4 with 4000 warm-up steps. We use a dropout rate of 0.3 and label smoothing rate of 0.01. We train all the models for a maximum number of 120 epochs, and average the best 5 epoch checkpoints among the last 40 checkpoints based on the SMATCH scores on the development data with greedy decoding. We use a default beam size of 10 for decoding. We implement our model 5 with the FAIRSEQ toolkit . All models are trained and tested on a single Nvidia Titan RTX GPU. Training takes about 10 hours on AMR 2.0 and 3.5 hours on AMR 1.0.
6 Results and Analysis

Main Results
Oracle Actions Table 1 compares the oracle data SMATCH and average action sequence length on the AMR 2.0 training set among recent transition systems. Our approach yields much shorter action sequences due to the target-side pointing mechanism. It has also the best coverage on training AMR graphs, due to the flexibility of our transitions that can capture the majority of graph components. We chose not to tackle a number of small corner cases, such as disconnected subgraphs for a token, that account for the missing oracle performance.
Parsing Performance We compare our actionpointer transition/Transformer (APT) model with existing approaches in Table 2 Lee et al. (2020). ¦ denotes concurrent work based on finetuning pre-trained BART large models. We report the best/average score ± standard deviation over 3 seeds. p.e. is partial ensemble decoding with 3 seed models.
(from large models) with B or R , and graph recategorization with G . Graph re-categorization (Lyu and Titov, 2018;Zhang et al., 2019a;Cai and Lam, 2020;Bevilacqua et al., 2021) removes node senses and groups certain nodes together such as named entities in pre-processing. It reverts these back in post-processing with the help of a name entity recognizer. We report results over 3 runs for each model with different random seeds. Given that we use fixed pre-trained embeddings, it becomes computationally cheap to build a partial ensemble  that uses the average probability of 3 models from different seeds which we denote as p.e. With the exception of the recent BART-based model Bevilacqua et al. (2021), we outperform all previously published approaches, both with our small and base models. Our best single-model parsing scores are 81.8 on AMR 2.0 and 78.5 on AMR 1.0, which improves 1.6 points over the previous best model trained only with gold data. Our small model only trails the base model by a small margin and we achieve high performance on small AMR 1.0 dataset, indicating that our approach benefits from having good inductive bias towards the problem so that the learning is efficient. More remarkably, we even surpass the scores reported in Lee et al. (2020) combining various self-learning techniques and utilizing 85K extra sentences for self-annotation (silver data). For the most recent AMR 3.0 dataset, we report our results for future reference.
Additionally, the partial ensemble decoding proves to be simple and effective in boosting the model performance, which consistently brings more than 1 point gain for AMR 1.0 and 2.0. It should be noted that the ensemble decoding is only 20% slower than a single model.
We thus use this ensemble to annotate the 85K sentence set used in (Lee et al., 2020). After removing parses with detached nodes we obtained 70K model-annotated silver data sentences. Adding these for training regularly, we achieve our best score of 83.4 with ensemble on AMR 2.0. Table 3, we compare parameter sizes of recently published models alongside their parsing performances on AMR 2.0. Similar to our approach, most models use large pre-trained models to extract contextualized embeddings as fixed features, with the exception of Xu et al. (2020), which is a seq-to-seq pre-training approach on large amount of data, and Bevilacqua et al. (2021), which directly fine-tunes a seq-to-seq BART large (Lewis et al., 2019) model. 7 Except the large BART model, our APT small (3 layers) has the least number of trained parameters yet already surpasses all the previous models. This justifies our method is highly efficient in learning for AMR parsing. Moreover, with the small parameter size, the partial ensemble is an appealing way to improve parsing quality with minor decoding overhead. Although more performant, direct fine-tuning of pre-trained seq-to-seq models such as BART would require prohibitively large numbers to perform an ensemble. Table 4 shows the finegrained AMR 2.0 evaluation (Damonte et al., 2016) of APT and previous models with comparable trainable parameter sizes. Our model achieves the best scores among all sub-tasks except negations and wikification, handled by post-processing on the best performing approach. We obtain large improvement on edge related sub-tasks including SRL (ARG arcs) and Reentrancies, proving the effectiveness of our target-side pointer mechanism.

Analysis
Ablation of Model Components We evaluate the contribution of different components in our model in Table 5. The top part of the table shows effects of 2 major components that utilize parser state information and the graph structural information in the Transformer decoder. The baseline model is a free Transformer model with pointers (row 1), which is greatly increased by including the monotonic action-source alignment via hard attention (row 2) on both AMR 1.0 and AMR 2.0 corpus, and combining it with the graph embedding (row 3) gives further improvements of 0.3 and 0.2 for AMR 1.0 and AMR 2.0. This highlights that injecting hard encoded structural information in the Transformer decoder greatly helps our problem.    The bottom part of Table 5 evaluates the contribution of output space restriction for target and input pre-trained embeddings for source, respectively. Removing the restriction for target output space i.e. the valid actions, hurts the model performance, as the model may not be able to learn the underlying rules that govern the target sequence restrictions. Switching the RoBERTa large embeddings to RoBERTa base or BERT large also hurts the performance (although score drops are only 0.3 ∼ 0.6), indicating that the contextual embeddings from large and better pre-trained models better equip the parser to capture semantic relations in the source sentence.
Effect of Oracle Setup As our model directly learns from the oracle actions, we study how the upstream transition system affects the model performance by varying transition setups in Table 6. We try three variations of the oracle. In the first setup, we measure the impact of breaking down SUBGRAPH action into individual node generation and attachment actions. We do this by using the SUBGRAPH for all cases of multi-node alignments. This degrades the parser performance and oracle SMATCH considerably, dropping by absolute 1.1 points. This is expected, since SUBGRAPH action makes internal nodes of the subgraph unattachable. In the second setup, we vary the order of edge creation actions. We reverse it so that the edges connecting farther nodes are built first. Although this does not affect the oracle score, we observe that the model performance on this oracle drops by 0.3. The reason might be that the easy close-range edge building actions become harder when pushed farther, also making easy decisions first is less prone to error propagation. Finally, we also change the order in which the various nodes connected to a token are created. Instead of generating the nodes from the root downwards, we perform a post-order traversal, where leaves are generated before parents. This also does not affect oracle score, however it gave a minor gain in parser performance. Figure 5 shows performance for different beam sizes. Ideally, if the model is more certain and accurate in making right predictions at different steps, the decoding performance should be less impacted by beam size. The results show that performance improves with beam size, but the gains saturate at beam size 3. This indicates that a smaller beam size can be considered for application scenarios with time constraints.

Related Work
With the exception of Astudillo et al. (2020), other works introducing stack and buffer information into sequence-to-sequence attention parsers Buys and Blunsom, 2017), are based on RNNs and do not attain high performances. Liu and Zhang (2017);  tackle dependency parsing and propose modified attention mechanisms while Buys and Blunsom (2017) predicts semantic graphs jointly with their alignments and compares stackbased with latent and fixed alignments. Compared to the stack-Transformer (Astudillo et al., 2020), we propose the use of an action pointing mechanism to decouple word and node representation, remove the need for stack and buffer and model graph structure on the decoder side. We show that these improvements yield superior performance while exploiting the same inductive biases with little train data or small models. Vilares and Gómez-Rodríguez (2018) proposed an AMR-CONVINGTON system for unrestricted nonprojective AMR parsing, comparing the current word with all previous words for arc attachment as we propose. However, their comparison is done with sequential actions whereas we use an efficient pointer mechanism to parallelize the process.
Regarding the use of pointer mechanisms for arc attachment, Ma et al. (2018b) proposed the stack-pointer network to build partial graph representations, and Fernández-González and Gómez-Rodríguez (2020) adopted pointers along with the left-to-right scan of the sentence, greatly improv-ing the efficiency. Compared with these works, we tackle a more general text-to-graph problem, where nodes are only loosely related to words, by utilizing the action-pointer mechanism. Our method is also able to build up to depth M graph representations with M decoding layers.
While not explicitly stated, graph-based approaches (Zhang et al., 2019a;Cai and Lam, 2020) generate edges with a pointing mechanism, either with a deep biaffine classifier (Dozat and Manning, 2018) or with attention (Vaswani et al., 2017). They also model inductive biases indirectly through graph re-categorization, detailed in Section 6.1, which requires a name entity recognition system at test time. Re-categorization was proposed in Lyu and Titov (2018), which reformulated alignments as a differentiable permutation problem, interpretable as another form of inductive bias.
Finally, augmenting seq-to-seq models with graph structures has been explored in various NLP areas, including machine translation (Hashimoto and Tsuruoka, 2017;Moussallem et al., 2019), text classification (Lu et al., 2020), AMR to text generation (Zhu et al., 2019), etc. Most of these works model graph structure in the encoder since the complete source sentence and graph are known. We embed a dynamic graph in the Transformer decoder during parsing. This is similar to broad graph generation approaches (Li et al., 2018) relying on graph neural networks , but our approach is much more efficient as we do not require heavy re-computation of node representations.

Conclusion
We present an Action-Pointer mechanism that can naturally handle the generation of arbitrary graph constructs, including re-entrancies and multiple nodes per token. Our structural modeling with incremental encoding of parser and graph states based on a single Transformer architecture proves to be highly effective, obtaining the best results on all AMR corpora among models with similar learnable parameter sizes. An interesting future exploration is on combining our system with large pre-trained models such as BART, as directly finetuning on the latter shows great potential in boosting the performance (Bevilacqua et al., 2021). Although we focus on AMR graphs in this work, our system can essentially be adopted to any task generating graphs from texts where copy mechanisms or hard-attention plays a central role.