A Unifying Theory of Transition-based and Sequence Labeling Parsing

We define a mapping from transition-based parsing algorithms that read sentences from left to right to sequence labeling encodings of syntactic trees. This not only establishes a theoretical relation between transition-based parsing and sequence-labeling parsing, but also provides a method to obtain new encodings for fast and simple sequence labeling parsing from the many existing transition-based parsers for different formalisms. Applying it to dependency parsing, we implement sequence labeling versions of four algorithms, showing that they are learnable and obtain comparable performance to existing encodings.

These algorithms define an abstract state machine where each state (configuration) holds a structured representation, as well as auxiliary data structures (often, but not always, a buffer and a stack of tokens). Shift-reduce actions (transitions) are defined to move the system between states until a full parse is found. The transition to take at each state was traditionally predicted by data-driven classifiers based on local decisions and rich feature representations (Zhang and Nivre, 2011). With the adoption of deep learning, which can globally contextualize word representations, the dependency on hand-crafted features has been drastically reduced (Kiperwasser and Goldberg, 2016;Shi et al., 2017); and it has also been shown that alternative ways to attack the problem can be practical.
More particularly, several parsing problems have been cast as a machine translation task, where a sequence-to-sequence (seq2seq) network maps the sentence into a string of arbitrary length that encodes a linearized graph (Vinyals et al., 2015;Li et al., 2018;Konstas et al., 2017). To a certain extent, the attention mechanism in these seq2seq models can be seen as an abstraction of the stack and buffer in transition-based parsers, where the attention weights mark the relevant words to generate the next component of the output string. Alternatively, some authors have reduced constituent and dependency parsing to sequence labeling, where given an input sentence of length n, the output has length n too, assigning one label to each word Strzyz et al., 2019). However, these reductions have consisted in defining custom encodings for the output structure, which cannot be automatically derived.
In this context, some studies have linked transition-based parsers to seq2seq architectures, as in (Li et al., 2018), but to the best of our knowledge there is no unified framework or theory for transition-based and sequence labeling parsing.
Contribution (i) Our first contribution is theoretical, connecting the transition-based and sequence labeling parsing paradigms. We give a broad definition of a left-to-right transition system, covering the majority of transition-based parsers, and prove that the transitions produced by such systems for a sentence of length n can be mapped to a sequence of n labels, hence providing a mapping from transition-based parsers to sequence labeling parsers. (ii) The second contribution is empirical, applied to dependency parsing. We implement projective and non-projective transition-based algorithms, cast them in a sequence labeling setup and show that they are learnable, and even outperform some existing custom encodings for parsing as labeling. The source code is available at https://github.com/mstrise/dep2label.

Preliminaries
Let w = w 1 . . . w n be an input sentence. We will denote by P w the set of possible well-formed parses for w in the relevant parsing formalism (e.g. in projective dependency parsing, P w is the set of projective dependency trees of n nodes). Following Nivre (2008), with some adaptations to generalize the notions beyond dependency parsing, we can define a transition system as a quadruple S = (C, T, c s , C t ) where: • C is a set of configurations, such that each configuration c ∈ C contains at least a partially-built parse P c (be it a partial syntactic dependency tree, semantic dependency tree, constituent tree, or any other structure that can be built using the input words), in addition to any other data structures needed by each specific transition system, • T is a set of transitions, i.e. partial functions t : C → C, • c s is an initialization function, mapping an input sentence w = w 1 . . . w n to an initial configuration in C, • C t ⊆ C is a set of terminal configurations.
A transition system parses an input sentence w by starting in the initial configuration c s (w), and applying a sequence of transitions until a final configuration c f ∈ C t is reached. At that point, the parse P c f ∈ P w is returned.
Thus, for a transition system S = (C, T, c s , C t ), we can define a computation of S on w as a sequence of configurations c 0 , c 1 , . . . , c m such that each c i is obtained from c i−1 by applying a transition t i . Such a computation is complete for w if c 0 = c s (w) and c m ∈ C t . We denote by C S w the set of complete computations of S on w. Note that a computation is uniquely determined by its starting configuration and the sequence of transitions t 1 , . . . , t m applied to it. A complete computation is uniquely determined by the sequence of transitions t 1 , . . . , t m , as the initial configuration is fixed.
Finally, we define a static oracle for a transition system S = (C, T, c s , C t ) as a function ω : P w → C S w such that for every P ∈ P w , the parse associated with the final configuration in ω(P ) is P . That is, given a gold parse for the sentence w, a static oracle returns a (canonical) computation of S that produces that gold parse on w. Note that a correct transition system should be able to produce all the possible well-formed parses in P w , and hence, a static oracle must exist. 1

Mapping transition-based parsers to sequence labeling parsers
We first formally define our mapping from transition systems to sequence labeling parsers, to then explain it in detail, provide examples, and analyze how it applies to different transition systems: Definition 1. Let S = (C, T, c s , C t ) be a transition system. We say that S is a left-to-right transition system if there is a subset of transitions T s ⊆ T , called the read transitions, which satisfy the following conditions: 1. Every sequence of transitions t 1 , . . . , t m corresponding to a complete computation for a sentence w 1 . . . w n has exactly n read transitions, one of which is t 1 .
2. There is a constant value k such that, for each 1 ≤ i ≤ n, the partial parse contained in a computation starting at c s and containing i read transitions in its transition sequence is a partial parse over the substring w 1 . . . w i+k .
Such parsers can be mapped into a sequence labeling encoding as follows: Definition 2. Let S = (C, T, c s , C t ) be a left-to-right transition system. Let γ = c 0 , . . . , c m be a complete computation of S on a sentence w 1 . . . w n . By the first condition of a left-to-right transition system, we know that the sequence of transitions of γ is of the form t r 1 , t 1 1 , . . . , t m 1 1 , t r 2 , t 1 2 , . . . , t m 2 2 , . . . , t r n , t 1 n , . . . , t mn n where each t r i is a read transition, and t j i is the jth consecutive non-read transition after t r i . Then, we define the label sequence associated to γ, denoted L(γ), as the sequence of n labels where the ith label is t r i , t 1 i , . . . , t m i i . Thus, informally speaking, the notion of a transition system being left-to-right corresponds to the presence of a transition (or set of transitions) that read a new word from the input, in left-to-right order. In transition systems where such transitions are present, processing a sentence of length n requires n of them (as each of the n words must be read from the input); and we can use this property to split transition sequences for full trees into n labels (one per word), each led by a read transition. The constant k in Definition 1 is an offset between the number of read transitions that have been executed at each given state and the words that can appear in parses at that state, as seen in the examples below.
Thus, the above definition provides a mapping from complete computations of a transition system to a label sequence. Since, as explained in Section 2, every correct transition system has at least one static oracle that maps gold parses to complete computations, it is easy to compose both mappings to define a mapping from left-to-right transition systems to sequence labeling parsers, where labels are subsequences of transitions: Definition 3. Let S = (C, T, c s , C t ) be a left-to-right transition system, w = w 1 . . . w n a sentence of length n, and ω a static oracle for S. Then, we define the sequence labeling encoding associated to S as the mapping κ : P w → (T * ) n such that for each P ∈ P w , κ(P ) = L(ω(P )).

Examples
The arc-standard transition system for projective dependency parsing (shown in Table 1) is a left-to-right transition system, where SH is the only read transition. Figure 1 shows how a parse for an example projective sentence is converted into n labels. In this transition system, k = 0 because it needs to place SH (σ, b|β, P ) ⇒ (σ|b, β, P ) SHc (λ1, λ2, b|β, P ) ⇒ (λ1 · λ2|b, [], β, P ) LAst (σ|s1|s0, β, P ) ⇒ (σ|s0, β, P ∪ {s0 → s1}) NO-ARCc (λ1|s, λ2, β, P ) ⇒ (λ1, s|λ2, β, P ) RAst (σ|s1|s0, β, P ) ⇒ (σ|s1, β, P ∪ {s1 → s0}) LAc (λ1|s, λ2, b|β, P ) ⇒ (λ1, s|λ2, b|β, P ∪ {b → s}) words on the stack (via a read transition) before creating dependencies between them, so after having executed i read transitions, the words available for parsing are restricted to w 1 . . . w i . For example, one needs two SH transitions to be able to link "Kyrie" as dependent of "ate". The arc-eager transition system for projective dependency parsing (also in Table 1) is a left-to-right transition system, where both SH and RA are read transitions. This shows that, while in most transition systems the notion of a read transition is implemented by a SH or shift transition (which moves a word from the remaining input into a stack or another data structure for words being processed), this need not always be the case: for example, in arc-eager, the right-arc transition also moves a word from the remaining input into the stack apart from creating a right arc. Figure 1 shows the n labels resulting from the parse for the example sentence. In this transition system, k = 1 because it can create dependencies involving the first word in the buffer, so computations with i read transitions can result into a partial parse involving w 1 . . . w i+1 . For example, the link between "Kyrie" and "ate" in Figure 1 is created after only one read transition.

Discussion
Informally speaking, the first condition of a left-to-right transition system simply states that when parsing a sentence of length n, the parser should start with a read transition and execute exactly n read transitions in total. This does not impose any limits on the nature of the transitions but only on their number and arrangement in transition sequences, and it is the basis of our mapping, which uses the n read transitions to split the transition sequence into n subsequences that can act as labels.
The second condition means that the algorithm needs to proceed in an incremental left-to-right manner, driven by the read transitions, in the sense that each read transition introduces the possibility of using a new word in the parse, and words are introduced in left-to-right order. This generalizes the classic notion of shift transitions reading words and placing them into a data structure (like a stack) that then allows their manipulation. Note that this is a weak form of incrementality. For example, in an arc-standard dependency parser a chain of right arcs is processed from right to left, but this condition is still met. The constant k is typically 0 or 1 in most parsers in the literature. For example, it is 1 in the classic arc-eager dependency parser and 0 in arc-standard, as explained above.
Moreover, we wish to stress that the notion of a left-to right transition system defined in this work only implies incrementality at the transition system level. This means that it does not guarantee that any implementation of a such transition system is left-to-right incremental in the traditional psycholinguistic sense, as the definition of a transition system only concerns the transitions available and cannot impose restrictions on what a particular implementation can use to decide between them. For instance, imple-mentations that use BiLSTMs (as those in our own experiments below) are not left-to-right incremental, as they have information about future input encoded in the hidden representations of the input tokens. However, in that case, the absence of left-to-right incrementality is due to the word representations chosen for the implementation (and not due to the transition system); and any transition system that meets our definition can be implemented in a truly incremental way by using word representations that do not depend on future input.
Strictly speaking, the first condition in our definition of a left-to-right transition system is enough to define a mapping from transition sequences to sequences of n labels, thus obtaining a sequence labeling encoding. The definition of the mapping above does not formally use or rely on the second condition. However, we include said condition because it means that the label for each word will encode information about the transitions executed after reading that word. In other words, it means that not only the transition sequences of the parser can be encoded as sequences of n labels, but also that the ith label in the encoding is semantically linked to the ith word. We believe that this is a reasonable common-sense assumption for the resulting sequence labeling model to be learnable.
An example of a transition system that fails the second condition is the easy-first parser of Goldberg and Elhadad (2010). The algorithm runs n transitions in total that attach each input word to a head, so if we ignore the second condition, all its transitions could be considered read transitions, and we would obtain a compact sequence labeling encoding where each label contains a single transition. However, the problem is that the label for a given word is not semantically linked to that word, as the parser can create arcs in any order, so the encoding can hardly be considered practical.

Coverage
Algorithm L2R? Read t. k Algorithm L2R? Read t. k Arc-standard (Fraser, 1989;Nivre, 2004) Yes SH 0 Spinal arc-eager (Ballesteros and Carreras, 2015) Yes SH, RA 1 Arc-eager (Nivre, 2003) Yes SH, RA 1 Yamada and Matsumoto (2003) No Arc-hybrid (Kuhlmann et al., 2011) Yes SH 1 Choi and Palmer (2011) Yes SH 1 Covington projective (Covington, 2001;Nivre, 2008) Yes SH 0 Choi and McCallum (2013) Yes No-SH, R-SH 1 Covington non-projective (Covington, 2001;Nivre, 2008) Yes SH 0 Non-monotonic arc-eager (Honnibal et al., 2013) Yes SH, RA 1 Easy-first (Goldberg and Elhadad, 2010) No Improved non-monotonic arc-eager (Honnibal and Johnson, 2015) No Attardi (Attardi, 2006) Yes SH 0 Non-monotonic Covington (Fernández-González and Gómez-Rodríguez, 2017) Yes SH 1 Planar (Gómez-Rodríguez and Nivre, 2010) Yes SH 1 Tree-constrained arc-eager (Nivre and Fernández-González, 2014) No 2-Planar (Gómez-Rodríguez and Nivre, 2010) Yes SH 1 Non-local Covington (Fernández-González and Gómez-Rodríguez, 2018) Yes SH 1 Arc-eager with buffer transitions (Fernández-González and Gómez-Rodríguez, 2012) Yes SH 2,3 Two-register (Pitler and McDonald, 2015) No Swap (Nivre, 2009) No Stack-pointer (Ma et al., 2018) No Swap-hybrid (de Lhoneux et al., 2017b) No Left-to-right pointer network (Fernández-González and Gómez-Rodríguez, 2019b) No Arc-swift (Qi and Manning, 2017) Yes SH, RA k 1 Table 2: Transition-based dependency parsers, whether they are left-to-right (L2R?) or not, read transitions in case they are, and value of the constant k. The value of k should be seen as a guide only, as k can vary between variants of each parser. For example most definitions of arc-standard create arcs between nodes on the stack, so k = 0 (Nivre, 2004) but it has also been defined in an equivalent form where arcs are created between stack and buffer, so k = 1 (see (Nivre, 2008)). The same happens with Attardi and other algorithms.  left-to-right or not, together with the values of the read transitions and k if applicable. As can be seen in the table, the majority of known transition-based parsers conform to our definition of left-to-right, and hence yield an encoding that can be used to define a sequence labeling parser with the framework defined here. Exceptions are parsers with multiple left-to-right passes over the input (Yamada and Matsumoto, 2003), those that can create arcs between words in arbitrary order (like the aforementioned easy-first parser of Goldberg and Elhadad (2010)) those that use unshift transitions that return a node to the buffer (Nivre and Fernández-González, 2014;Honnibal and Johnson, 2015) or swap transitions which also have that side effect (Nivre, 2009), and those that can create arcs involving nodes arbitrarily far to the right (Ma et al., 2018;Fernández-González and Gómez-Rodríguez, 2019b). For all of these exceptions, one could still construct sequence-labeling encodings by ignoring the second condition of a left-to-right system as arc-creating transitions always meet the first condition, but this is dubiously practical in most cases, as explained above. A singular case in this respect is the left-to-right pointer network parser of Fernández-González and Gómez-Rodríguez (2019b). This parser does not fit the second condition of our definition because it can use nodes arbitrarily to the right as heads (in spite of being purely left-to-right in terms of the order in which it considers dependents); and if we still apply our transformation to it, we obtain an encoding isomorphic to the relative positional encoding used in Li et al. (2018), which has been shown useful under strong machine learning models (Vacareanu et al., 2020). Table 3 shows transition-based constituency parsers with analogous information to Table 2. In this case, all of the listed parsers that do not fall into our definition of left-to-right are discontinuous constituent parsers that use swap transitions (Versley, 2014;Maier, 2015). Every continuous constituent parser that we found is covered, as well as discontinuous parsers that use other devices, like gap transitions  or set handling . All of the supported constituent parsers have k = 0, following the traditional shift-reduce paradigm that only operates on nodes after reading them from the input buffer.   (2019)). Table 4 lists transition-based semantic dependency parsers with analogous information to Tables 2 and 3. Once again, parsers that do not fall into our definition of left-to-right are mostly those that define a swap transition. Note that for this coverage analysis we exclude semantic formalisms that go beyond dependency graphs, such as AMR (Banarescu et al., 2013), where pure transition-based models (e.g. Damonte et al. (2017), Ballesteros and Al-Onaizan (2017),Vilares and Gómez-Rodríguez (2018)) need to remove tokens and also create (multiple) concepts from words, and therefore include specific transitions to do so. Removing tokens can be seen as a read transition, but creating many concept nodes from a single word breaks the left-to-right condition and does not ensure that the system will have n read transitions, where n is the length of the input sentence. Although outside the scope of this paper, note that a hybrid system that first computed the concepts from words (e.g. with a seq2seq architecture), to then apply a transition-based graph parser could support the left-to-right condition in the same way as other formalisms we have considered.
Putting it all together, our mapping is applicable to a wide range of transition systems, spanning a variety of formalisms: projective and non-projective dependency parsing, continuous and discontinuous constituency parsing, and various flavours of semantic parsing. Also, note that the impossibility to map buffer-based 2 swap transition-based parsers is not related to inability of our approach to handle nonprojectivity, but to the non-left-to-right nature of those swap models. For instance, in Table 2 we showed how other non-projective transition-based algorithms (Covington, 2001;Attardi, 2006;Gómez-Rodríguez and Nivre, 2010) can be mapped to a sequence labeling encoding.

Experiments
To test the practical applicability of our theoretical contribution, we implement sequence labeling versions, obtained using the mapping in Section 3, of various syntactic dependency parsers. We include three wellknown projective parsers: arc-standard (Nivre, 2004), arc-eager (Nivre, 2003) and arc-hybrid (Kuhlmann et al., 2011); and one parser with full coverage of non-projective trees: the Covington non-projective parser (Covington, 2001;Nivre, 2008). The transition systems for all of these parsers are shown in Table 1. For comparison, we also include two of the best encodings to the date for dependency parsing as labeling (Strzyz et al., 2019): (i) the rel-PoS (Spoustová and Spousta, 2010) and (ii) the bracketing encoding (Yli-Jyrä and Gómez-Rodríguez, 2017). The former casts the problem as a head-selection task where each token encodes its head using a PoS-tag-based offset. The latter assigns a label to each token encoding a set of incoming/outgoing arcs from that and neighbouring tokens.

Data
Following Anderson and Gómez-Rodríguez (2020), we choose a subset of UDv2.4 treebanks (Nivre and others, 2019) which includes languages with a variety of corpus sizes, language typologies, alphabets and levels of non-projectivity, among other differences. More particularly, these treebanks are: Ancient Greek Perseus , Chinese GSD , English EWT , Finnish TDT , Hebrew HTB , Russian GSD , Tamil TTB , Uyghur UDT and Wolof WTB . Appendix A shows the number of labels that our approach generates for each treebank. We also use UDpipe (Straka, 2018) to obtain data with predicted segmentation, tokenization and PoS tags.

Sequence labeling models
For training, we consider two sequence labeling encoders (see Appendix B for hyper-parameters), which will produce n hidden contextualized representations h i to generate the labels: BiLSTMs (Hochreiter and Schmidhuber, 1997;Schuster and Paliwal, 1997) We use the NCRF++ framework (Yang and Zhang, 2018) to train the models and consider two different setups: 1. A setup where the input vectors to the models are a concatenation of the pre-trained word embeddings by Ginter et al. (2017), a second word vector generated by a char-LSTM layer, which is trained together with the rest of the network, and PoS tag vectors.
The motivation for setup 1 is that the PoS-tag-based encoding (Strzyz et al., 2019) requires PoS tags to decode the labels into a tree. Therefore, as PoS tags need to be computed to decode the tree, it is a fair argument to use them as input parameters, as done by Strzyz et al. (2019). The motivation for setup 2 is that in an era where the usefulness of PoS tags has been questioned (de Lhoneux et al., 2017a;, our encodings (which do not require PoS tags to decode the trees) could benefit in terms of speed and simplicity, with a minimum cost to accuracy.
BERT (Devlin et al., 2019) By default we fine-tune multi-lingual BERT (M-BERT) and in particular bert-base-multilingual-cased, except for English, Chinese and Finnish (Virtanen et al., 2019); for which monolingual models are available. 3,4 BERT splits the input into sub-word pieces (Wu et al., 2016) generating more sub-words than tokens, while we have a fixed amount of labels (equal to the number of tokens in the sentence) to assign. To solve this, we consider a word to be its first sub-word. Contrary to BiLSTMs, BERT does not use PoS tags as input, and we will only use them to decode the trees of the rel-PoS encoding.
To generate the output labels from the hidden vectors, we map each h i to two output softmax layers (tasks) following a hard-sharing multi-task learning architecture: one that predicts the subsequence of transitions associated to that word, and a second one that predicts the dependency relation between that word and its head. The loss is computed as the sum of the categorical cross-entropy of both tasks. Corrupted sequences of predicted labels are postprocessed according to Appendix C. Table 5 shows the results of the encodings trained with the BiLSTM architecture. Overall, we see that the transition-based encodings perform comparable to preexisting encodings. The performance of arcstandard, arc-eager and arc-hybrid is similar across the board, while Covington suffers more, probably due to the large output vocabulary (see Appendix A) as a Covington transition sequence has O(n 2 ) transitions  that we are grouping into n labels, contrary to the other three algorithms which run O(n) transitions per sentence. Still, for highly non-projective treebanks such as Ancient-Greek-Perseus, the Covington mapping performs the best among the transition-based encodings; showing that it is learnable and yet preferable in this case to using pure projective transition-based encodings.

Results
With respect to the BiLSTM setups 1 (with PoS tags) and 2 (without PoS tags), we observe that while the transition-based encodings (and the bracketing-based one) suffer little from the absence of PoS-tags, the rel-PoS based one greatly needs them, losing over 5 LAS points on average when they are not used. This is relevant since the use of PoS-tags increases the latency, and their usefulness has been questioned in some parsing setups (de Lhoneux et al., 2017a;Smith et al., 2018). In addition, in Appendix D we compute empirical upper bounds for the accuracy of the encodings by running them on the gold datasets (gold segmentation, tokenization and PoS tags) and compare them with the accuracy of the predicted setup with PoS tags. We can observe that the accuracy of upstream tasks has a large impact on all models, including the baselines, with the largest gaps between the parsing results on the predicted and gold dataset being observed for treebanks where segmentation, tokenization or tagging is difficult (for instance, the biggest difference is observed in Hebrew, where words and UPoS have the lowest prediction accuracy). Table 6 shows the results for BERT. We excluded Ancient-Greek, Uyghur and Wolof since M-BERT does not support them, and we are not aware of a monolingual model. The tendency is similar to the BiLSTM setup 2, but with higher scores for monolingual BERTs (not M-BERT though) and rel-PoS lagging further behind the bracketing and the projective transition-based encodings (on average around 4 and 2.5 LAS points, respectively).
It is also worth comparing the results of our sequence labeling implementations of the projective transition-based parsers to the experiments of Shi et al. (2017) with regular transition-based implementations and different sets of positional features. For this, we run our BiLSTM setup on the same dataset as them, i.e., the English PTB dev set. The results can be seen in Table 7. Shi et al. (2017) concluded that, with a BiLSTM-based architecture, two positional features (one stack and one buffer feature) were needed to obtain reasonable accuracy in the arc-eager and arc-hybrid parsers, and three (two stack and one buffer feature) in the case of the arc-standard parser. In contrast, and although the exact accuracy numbers do not provide a homogeneous comparison due to hyperparameter differences, it can be seen in Table 7 that our labels of sequences of transitions can be learned with only one positional feature (our setup just assigns a sequence of labels beginning with a SH to each word shifted from the input, i.e., it only uses the first buffer word b 0 , and has access to no explicit representation of stack elements at all). Thus, this shows that our multi-transition labels seem to be learnable with less data than individual transitions in transition systems, although the latter (with suitable features) are still ahead in terms of raw accuracy. While similar effects had been observed in seq2seq transition-based parsers (Zhang et al., 2017;Liu and Zhang, 2017a), these use more complex neural architectures than transition-based implementations, including attention weights that can focus on words in prominent stack positions (Liu and Zhang, 2017a). Here, we are just using plain BiLSTMs as in the standard transition-based implementation of Shi et al. (2017).  Table 7: Performance (in UAS%) of our system using a single positional feature b 0 compared with Shi et al. (2017) on the English PTB dev set.

Conclusion
This paper has established a theoretical relationship between transition-based and sequence labeling parsing valid for a broad definition of left-to-right transition-based algorithms. It also provides a new set of encodings for sequence labeling parsing which are automatically derived, in contrast to existing ones which were created ad-hoc for this purpose. To test the practical utility, we ran experiments on dependency parsing for a diverse set of languages. Interestingly, the mapping is meaningful and learnable, despite not using any representation of stack nodes. While our experiments focused on dependency parsing as we only aimed to validate the theory, an obvious avenue for future work is to implement and test the sequence-labeling encodings derived from the constituent and semantic parsers in Tables 3 and 4. The latter are particularly relevant in the sense that, to our knowledge, there is no previous work on semantic parsing as sequence labeling; so our method provides the first encodings for this purpose.   (Strzyz et al., 2019).

A Label sizes
In Table 8 we show the number of distinct labels that several transition-based algorithms generate when mapped to a sequence labeling setup, according to our theory, on the UD treebanks used in our experiments. For comparison purposes, we also include the number of labels of other existing encodings for sequence labeling dependency parsing presented in Strzyz et al. (2019): the rel-PoS encoding (Spoustová and Spousta, 2010) and the bracketing encoding (Yli-Jyrä and Gómez-Rodríguez, 2017).

B Model parameters
BiLSTMs (NCRF++) All models were trained with learning rate 0.02 and learning rate decay of 0.05, for 100 epochs with batch size 8 for training and 128 for testing. The dimension of the pretrained word vectors was 100, and 30 for character embeddings. If PoS tag embeddings were used, their dimension was 25. The word/character hidden vector dimension was set to 800 and 50, respectively. We used 2 BiLSTM layers with momentum 0.9. Dependency labels were learned in a multitask learning setup.
BERT All models were fine-tuned with the learning rate 10 −5 for 45 epochs with batch size 8. The maximum sequence length was set to 400, except for Russian, with 510.

C Postprocessing
In case a model outputs a label with an illegal action (due to violating preconditions in the transition system, e.g. a left arc in the arc-eager algorithm when the topmost stack node already has a head), we discard it and move to the next predicted action in the sequence.