Transition-based Parsing with Stack-Transformers

Modeling the parser state is key to good performance in transition-based parsing. Recurrent Neural Networks considerably improved the performance of transition-based systems by modelling the global state, e.g. stack-LSTM parsers, or local state modeling of contextualized features, e.g. Bi-LSTM parsers. Given the success of Transformer architectures in recent parsing systems, this work explores modifications of the sequence-to-sequence Transformer architecture to model either global or local parser states in transition-based parsing. We show that modifications of the cross attention mechanism of the Transformer considerably strengthen performance both on dependency and Abstract Meaning Representation (AMR) parsing tasks, particularly for smaller models or limited training data.


Introduction
Transition-based Parsing transforms the task of predicting a graph from a sentence into predicting an action sequence of a state machine that produces the graph (Nivre, 2003(Nivre, , 2004Kubler et al., 2009;Henderson et al., 2013). These parsers are attractive for their linear inference time and interpretability, however, their performance hinges on effective modeling of the parser state at every decision step.
Parser states typically comprise two memories, a buffer and a stack, from which tokens can be pushed or popped (Kubler et al., 2009). Traditionally, parser states were modeled using hand selected local features pertaining only to the words on the top of the stack or buffer (Nivre et al., 2007;Zhang and Nivre, 2011, inter-alia). With the widespread use of neural networks, global models of the parser state such as the stack-LSTM (Dyer et al., 2015) allowed encoding the entire buffer and stack. It was later shown that local features of the stack and buffer extracted from contextual word representations, such as Bi-LSTMs, could outperform global modeling (Kiperwasser and Goldberg, 2016;Dozat and Manning, 2016).
With the rise of the Transformer model (Vaswani et al., 2017), various approaches have been proposed that leverage this architecture for parsing (Kondratyuk, 2019;Kulmizev et al., 2019;Mrini et al., 2019;Ahmad et al., 2019;Cai and Lam, 2020). In this work we revisit the local versus global paradigms of state modeling in the context of sequence-to-sequence Transformers applied to action prediction for transition-based parsing. Similarly to previous works for RNN sequence to sequence , we propose a modification of the cross-attention mechanism of the Transformer to provide global parser state modeling. We analyze the role of local versus global parser state modeling, stack and buffer modeling, effects model size as well as task complexity and amount of training data.
Results show that local and global state modeling of the parser state yield more than 2 percentage points absolute improvement over a strong Transformer baseline, both for dependency and Abstract Meaning Representation (AMR) parsing. Gains are also particularly large for smaller train sets and smaller model sizes, indicating that parser state modeling, can compensate for both. Finally, we improve the AMR transition-based oracle (Ballesteros and Al-Onaizan, 2017a), yielding best results for a transition-based system and second overall.

Global versus Local Parser State
Given pair of sentence w = w 1 , w 2 · · · w N and graph g, transition-based parsers learn an action sequence a = a 1 , a 2 · · · a T , that applied to a state machine yields the graph g = M (a, w). Actions Figure 1: Encoding of buffer and stack for action sequence a = {SHIFT, SHIFT, REDUCE, SHIFT} and sentence w = {a, b, c}. The stack-LSTM is at the top, with hidden states representation of buffer (black) and stack (white) displayed. The stack-Transformer is at the bottom, with masks for cross-attention heads attending buffer (black) and stack (white) displayed. Circles indicate extra cross-attention positions relative to stack and buffer. of the state machine generally move words from a buffer, that initially contains the entire sentence, to a stack. Components of the graph, such as edges or nodes, are created by applying transformations to words in the stack. The correct action sequence is given by an oracle a = O(w, g), which is generally rule-based. In principle, one could learn the sentence to action mapping w → a as a sequence to sequence problem similarly to e.g. Machine Translation. In practice, this approach does not accurately represent the parser state and thus shows limited performance. The parser state at step t is defined implicitly by (a <t , w). This translates to an explicit state at step t where the stack contains some tokens about to be processed, sometimes along with new composed vector representations, and the buffer contains the remainder of tokens in the sentence. Buffer and stack increase (push) or decrease (pop) their size dynamically with each time step as shown in Fig. 1.
The transition-based formalism relies heavily on the explicit representation of the state i.e. buffer and stack configurations. Prior to widespread use of Neural Networks, local features limited to top of the stack and buffer already achieved good performances (Nivre et al., 2007;Zhang and Nivre, 2011, inter-alia). The introduction of stack-LSTMs (Dyer et al., 2015) made possible modeling the global state of the parser by separately encoding action history a <t , and the dynamically changing stack and buffer with LSTMs (Hochreiter and Schmidhuber, 1997). In addition to this, stack-LSTMs used the transition-based formalism to recursively build vector representations of sub-graphs, similarly to a graph neural network.
Another well known LSTM model is the Bi-LSTM feature parser (Kiperwasser and Goldberg, 2016;Dozat and Manning, 2016). In this case, a contextual representation of the sentence is first built with a Bi-LSTM h = BiLSTM(w). At each time step t the stack configuration determined by a <t is used to select the elements from h corresponding to words on the top of the stack and buffer.
Although the features utilize local information of the buffer and stack, the use of a strong contextual representation proved to be sufficient and this remains one of the most widely used forms of parsing today.

From stack-LSTMs to stack-Transformers
In transition-based parsers, at a given time step t, input tokens w may be on the buffer, stack or reduced. As displayed in Fig. 1 (top), to encode this state stack-LSTMs unroll LSTMs over the stack and buffer following their respective word order, which can be different from the sentence's token order. If an element is added to the buffer or stack, it is only necessary to unroll one additional LSTM cell. If an element is removed under a pop operation (e.g. REDUCE), stack-LSTMs move back a pointer to reuse previously computed hidden states. This allows efficient encoding of the dynamically changing stack and buffer. Unlike LSTMs, Transformers (Vaswani et al., 2017) encode sequences through an attention mechanism (Bahdanau et al., 2015) as a weighted sum of tokens plus position embeddings. One can take advantage of this mechanism to replace LSTMs with Transformers for stack and buffer encoding. Since Transformers just sum token representations, under a pop operation elements can be masked out and there is no need for a pointer. Furthermore, since Transformers use multiple heads one can have separate modeling of stack and buffer by specializing two heads of the attention mechanism, see Fig. 1 (bottom), while the other heads remain free.
In practical terms, we modify the cross-attention mechanism of the Transformer decoder. For example, for the head attending the stack, the score function between action history encoding b t (query) and hidden representation of word h i (key) is given by where m ti is a {−∞, 0} mask, p ti are the position embeddings for elements in the stack, h = f (w) is the output of the Transformer encoder. The attention would be computed from the score function as Both mask and positions change for each word and time-step as the parser state changes, but they imply little computation overhead and can be precomputed for training. Henceforth this modification will be referred to as stack-Transformer.

Labeled SHIFT Multi-task
It is common practice for transition-based systems to add an additional Part of Speech (POS) or word prediction task (Bohnet and Nivre, 2012). This is achieved by labeling the SHIFT action, that moves a word from the buffer to the stack, with the word's tag. This decorated actions become part of the action history a <t , which was expected to give better visibility into stack/buffer content and exploit Transformer's attentional encoding of history.
In initial experiments, POS tags produced a small improvement while word prediction led to performance decrease. It was observed, however, that prediction of only 100 − 300 most frequent words, leaving SHIFT undecorated otherwise, led to large performance increases. This is thus the method reported in the experimental setup as alternative parser state modeling.

Experiments and Results
To test the proposed approach, different parsing tasks were selected. Dependency parsing in the English-Treebank, is well known and well resourced (40K sentences). The AMR2.0 semantic parsing task is more complex, encompassing named entity recognition, word sense disambiguation and co-reference among other sub-tasks, also well resourced (36K sentences). AMR1.0 has around 10K sentences and can be considered as AMR with limited train data. The dependency parsing setup followed Dyer et al. (2015), in the setting with no POS tags. This has only SHIFT, LEFT-ARC(label), and RIGHT-ARC(label) base action with a total of 82 different actions. Results were measured in terms of (Un)labeled Attachment Scores (UAS/LAS).
The AMR setup followed Ballesteros and Al-Onaizan (2017a), which introduced new actions to segment text and derive nodes or entity sub-graphs. In addition, we use the alignments and wikification from Naseem et al. (2019). Unlike previous works, we force-aligned the unaligned nodes to neighbouring words and allowed attachment to the leaf nodes of entity sub-graphs, this increased oracle Smatch from 93.7 to 98.1 and notably improved model performance. We therefore provide results for the Naseem et al. (2019) oracle for comparison. Both previous works predict a node creation action and then a node label, or call a lemmatizer if no label is found. Instead, we directly predicted the label and added COPY actions to construct node names from lemmas 1 or surface words, resulting in a maximum of 9K actions. Node label predictions were limited to those seen during training for the word on the top of the stack. Results were measured in Smatch (Cai and Knight, 2013) using the latest version 1.0.4 2 .
Regarding model implementation, all models were implemented on the fairseq toolkit and trained with only minor modifications over the MT model hyper-parameters (Ott et al., 2018). This used crossentropy training with learning rate 5e −4 , inverse square root scheduling with min. 1e −9 , 4000 warmup updates with learning rate 1e −7 , and maximum 3584 tokens per batch. Adam parameters 0.9 and 0.98, label smoothing was reduced to 0.01 3 . All models used 6 layers of encoding and decoding with size 256 and 4 attention heads, except the normal Transformers in AMR, which performed better on a 3/8 layer configuration instead of 6/6. To study the effect of model size, small versions of all models using a 2/2 configuration were also tested.  We used RoBERTa-base (Liu et al., 2019) embeddings without fine-tuning as input, averaging wordpieces to obtain word representations. Weight averaging of the best 3 checkpoints (Junczys-Dowmunt et al., 2016) and beam 10 were used in all models. This improves results at most by 0.4/0.8 points for AMR2.0/AMR1.0 with no significant differences across models. Models were trained for a fixed number of epochs, selecting the best model on validation by either LAS or Smatch. A maximum epoch number of 80 − 120 was set to guarantee a margin of 5 epochs from best model to last epoch. No other hyper-parameters were changed across models or tasks. Training took at most 6h on a Nvidia Tesla v100 GPU. It should be noted that this is around 10 times faster than our Pytorch stack-LSTM implementation for the same data. The labeled SHIFT strategy used the 100 most frequent words. Table 1 compares the standard Transformer, with and without multi-task with the stack-Transformer, its components, and smaller versions of all models. Comparing LAS and Smatch, stack-transformer provides around 2 points improvement against Transformer on PTB and AMR2.0, and 0.5 points improvement against its multi-task version (a-c). This improvement becomes sensibly larger for the smaller train set AMR1.0 with 5.8 and 1.4 point gains over the Transformer and its multi-task version respectively. Differences are also larger for the 4 layer version of the models. Under this setting, the stack-Transformer looses only 0.4 points against a 12 layer model in AMR2.0. In this same setting, the Transformer and its multi-task version loose 3.3 and 2.3 points respectively, pointing to the fact that modeling parser state compensates for less training data or smaller models.

Analysis of Results
Regarding ablation of the stack-Transformer components, the use of stack/buffer positions seems clearly detrimental (d) in all scenarios. This was a consistent pattern across various variants for which we do not report numbers such as sinusoidal versus learnable positions and reducing the position range to top three of the stack and buffer. One possible explanation is that positions varying after each time step may be hard to learn, particularly if injected directly in the decoder. It is also worth noting, than the combination of multi-task and stack-Transformer produced little improvement or was even detrimental pointing to their similar role. Re-Model AMR1.0 AMR2.0 Lyu and Titov (2018)   sults for the weakest of the stack-Transformer variants are provided (h).
Comparing across different attention modifications (e-h), most methods perform similarly although there seems to be some evidence for global (full buffer, full stack) variants being more performant. Modeling of the buffer seems also more important than modeling of the stack. One possible explanation for this is that, since the total number of heads is kept fixed, it may be more useful to gain an additional free head than modeling the stack content. Furthermore without recursive representation building, as in stack-LSTMs, the role of the stack can be expected to be less important.
Overall, the stack-Transformer is competitive against recent works particularly for AMR, likely due to the higher complexity of the task. Compared to prior AMR systems, it is worth noting the large performance increase against stack-LSTM (Naseem et al., 2019), while sharing a similar oracle and embeddings and not using reinforcement learning fine-tuning. The stack-Transformer also matches the best reported AMR system (Cai and Lam, 2020) on AMR1.0 without graph recategorization, but using RoBERTa instead of BERT embeddings and provided the second best reported scores on the higher resourced AMR2.0 4 .

Related Works
While inspired by stack-LSTMs (Dyer et al., 2015), the stack-Transformer lacks their elegant recursive composition, where representations for partial graph components are added to the stack and used in subsequent representations. It allows, however, to model the global parser state in a simple way that is easy to parallelize, and shows large performance gains against stack-LSTMs on AMR. The proposed modified attention mechanism, could also be interpreted as a form of feature-based parser (Kiperwasser and Goldberg, 2016), where the parser state is used to select encoder representations, integrated into a Transformer sequence to sequence model.
The modification of the attention mechanism to reflect the parse state has been applied in the past to RNN sequence-to-sequence models.  propose the use of a boundary to separate stack and buffer attentions. While simple, this precludes the use of SWAP actions needed for AMR parsing and non-projective parsing.  mask out reduced words and add a bias to the attention weights for words in the stack. While being the closest to the proposed technique, this method does not separately model stack and buffer nor retains free attention heads, which we consider a fundamental advantage. We also provide evidence that modeling the parser state still produces gains when using pre-trained Transformer embeddings and provide a detailed analysis of components. Finally, RNN (Ma et al., 2018) and selfattention (Ahmad et al., 2019) Stack-Pointer networks sum encoder representations based on local graph structure, which can be interpreted as masked uniform attention over 3 words and is related to the previous methods.

Conclusions
We have explored modifications of sequence-tosequence Transformers to encode the parser state for transition-based parsing, inspired by stack-LSTM's global modeling of the parser state. While simple, these modifications consistently provide improvements against a normal sequence to sequence Transformer in transition-based parsing, both for dependency parsing and AMR parsing tasks. Results also point to the benefits of modeling the parser state as a way to compensate for limited training resources or limitation in model sizes.