Discontinuous Constituent Parsing as Sequence Labeling

This paper reduces discontinuous parsing to sequence labeling. It first shows that existing reductions for constituent parsing as labeling do not support discontinuities. Second, it fills this gap and proposes to encode tree discontinuities as nearly ordered permutations of the input sequence. Third, it studies whether such discontinuous representations are learnable. The experiments show that despite the architectural simplicity, under the right representation, the models are fast and accurate.


Introduction
Discontinuous constituent parsing studies how to generate phrase-structure trees of sentences coming from non-configurational languages (Johnson, 1985), where non-consecutive tokens can be part of the same grammatical function (e.g. nonconsecutive terms belonging to the same verb phrase). Figure 1 shows a German sentence exhibiting this phenomenon. Discontinuities happen in languages that exhibit free word order such as German or Guugu Yimidhirr (Haviland, 1979;Johnson, 1985), but also in those with high rigidity, e.g. English, whose grammar allows certain discontinuous expressions, such as wh-movement or extraposition (Evang and Kallmeyer, 2011). This makes discontinuous parsing a core computational linguistics problem that affects a wide spectrum of languages.
AVP (Yet) (never) (have) (I) (so) (much) (chosen) . Figure 1: An example of a German sentence exhibiting discontinuous structures, extracted from the NEGRA treebank (Skut et al., 1997). A valid English translation is: 'Never before I have chosen so much.' speed, while others give up significant performance to achieve an acceptable latency (Maier, 2015). Related to these research aspects, this work explores the feasibility of discontinuous parsing under the sequence labeling paradigm, inspired by Gómez-Rodríguez and Vilares (2018)'s work on fast and simple continuous constituent parsing. We will focus on tackling the limitations of their encoding functions when it comes to analyzing discontinuous structures, and include an empirical comparison against existing parsers.
Contribution (i) The first contribution is theoretical: to reduce constituent parsing of free word order languages to a sequence labeling problem. This is done by encoding the order of the sentence as (nearly ordered) permutations. We present various ways of doing so, which can be naturally combined with the labels produced by existing reductions for continuous constituent parsing. (ii) The second contribution is a practical one: to show how these representations can be learned by neural transducers. We also shed light on whether general-purpose architectures for NLP tasks (Devlin et al., 2019;Sanh et al., 2019) can effectively parse free word order languages, and be used as an alternative to adhoc algorithms and architectures for discontinuous constituent parsing.

Related work
Discontinuous phrase-structure trees can be derived by expressive formalisms such as Multiple Context Free Grammmars (Seki et al., 1991) (MCFGs) or Linear Context-Free Rewriting Systems (LCFRS) (Vijay-Shanker et al., 1987). MCFGs and LCFRS are essentially an extension of Context-Free Grammars (CFGs) such that non-terminals can link to non-consecutive spans. Traditionally, chart-based parsers relying on this paradigm commonly suffer from high complexity (Evang and Kallmeyer, 2011;Maier and Kallmeyer, 2010;Maier, 2010). Let k be the block degree, i.e. the number of nonconsecutive spans than can be attached to a single non-terminal; the complexity of applying CYK (after binarizing the grammar) would be O(n 3k ) (Seki et al., 1991), which can be improved to O(n 2k+2 ) if the parser is restricted to well-nested LCFRS (Gómez-Rodríguez et al., 2010), and Maier (2015) discusses how for a standard discontinuous treebank, k ≈ 3 (in contrast to k = 1 in CFGs). Recently, Corro (2020) presents a chart-based parser for k = 2 that can run in O(n 3 ), which is equivalent to the running time of a continuous chart parser, while covering 98% of the discontinuities. Also recently, Stanojević and Steedman (2020) present an LCFRS parser with k = 2 that runs in O(ln 4 + n 6 ) worst-case time, where l is the number of unique non-terminal symbols, but in practice they show that the empirical running time is among the best chart-based parsers.
Differently, it is possible to rely on the idea that discontinuities are inherently related to the location of the token in the sentence. In this sense, it is possible to reorder the tokens while still obtaining a grammatical sentence that could be parsed by a continuous algorithm. This is usually achieved with transition-based parsing algorithms and the swap transition (Nivre, 2009) which switches the topmost elements in the stack. For instance, Versley (2014) uses this transition to adapt an easy-first strategy (Goldberg and Elhadad, 2010) for dependency parsing to discontinuous constituent parsing. In a similar vein, Maier (2015) builds on top of a fast continuous shift-reduce constituent parser (Zhu et al., 2013), and incorporates both standard and bundled swap transitions in order to analyze discontinuous constituents. Maier's system produces derivations of up to a length of n 2 − n + 1 given a sentence of length n. More efficiently, Coavoux and Crabbé (2017) present a transition system which replaces swap with a gap transition. The intuition is that a reduction does not need to be always applied locally to the two topmost elements in the stack, and that those two items can be connected, despite the existence of a gap between them, using non-local reductions. Their algorithm ensures an upper-bound of n(n−1) 2 transitions. 2 With a different optimization goal, Stanojević and Alhama (2017) removed the traditional reliance of discontinuous parsers on averaged perceptrons and hand-crafted features for a recursive neural network approach that guides a swap-based system, with the capacity to generate contextualized representations.  replace the stack used in transition-based systems with a memory set containing the created constituents. This model allows interactions between elements that are not adjacent, without the swap transition, to create a new (discontinuous) constituent. Trained on a 2 stacked BiLSTM transducer, the model is guaranteed to build a tree with in 4n-2 transitions, given a sentence of length n.
A middle ground between explicit constituent parsing algorithms and this paper is the work based on transformations. For instance, Hall and Nivre (2008) convert constituent trees into a nonlinguistic dependency representation that is learned by a transition-based dependency parser, to then map its output back to a constituent tree. A similar approach is taken by Fernández-González and Martins (2015), but they proposed a more compact representation that leads to a much reduced set of output labels. Other authors such as Versley (2016) propose a two-step approach that approximates discontinuous structure trees by parsing context-free grammars with generative probabilistic models and transforming them to discontinuous ones. Corro et al. (2017) cast discontinuous phrase-structure parsing into a framework that jointly performs supertagging and non-projective dependency parsing by a reduction to the Generalized Maximum Spanning Arborescence problem (Myung et al., 1995). The recent work by Fernández-González and Gómez-Rodríguez (2020a) can be also framed within this paradigm. They essentially adapt the work by Fernández-González and Martins (2015) and replace the averaged perceptron classifier with pointer networks (Vinyals et al., 2015), adressing the problem as a sequence-to-sequence task (for dependency parsing) whose output is then mapped back to the constituent tree. And next, Fernández-González and Gómez-Rodríguez (2020b) extended pointer networks with multitask learning to jointly predict constituent and dependency outputs.
In this context, the closest work to ours is the reduction proposed by Gómez-Rodríguez and Vilares (2018), who cast continuous constituent parsing as sequence labeling. 3 In the next sections we build on top of their work and: (i) analyze why their approach cannot handle discontinuous phrases, (ii) extend it to handle such phenomena, and (iii) train functional sequence labeling discontinuous parsers.

Preliminaries
Let w = [w 0 , w 1 , ..., w |w|−1 ] be an input sequence of tokens, and T |w| the set of (continuous) constituent trees for sequences of length |w|; Gómez-Rodríguez and Vilares (2018) define an encoding function Φ : T |w| → L |w| to map continuous constituent trees into a sequence of labels of the same length as the input. Each label, l i ∈ L, is composed of three components l i = (n i , x i , u i ): • n i encodes the number of levels in the tree in common between a word w i and w i+1 . To obtain a manageable output vocabulary space, n i is actually encoded as the difference n i − n i−1 , with n −1 = 0. We denote by abs(n i ) the absolute number of levels represented by n i . i.e. the total levels in common shared between a word and its next one.
• x i represents the lowest non-terminal symbol shared between w i and w i+1 at level abs(n i ).
• u i encodes a leaf unary chain, i.e. nonterminals that belong only to the path from the terminal w i to the root. 4 Note that Φ cannot encode this information in (n i , x i ), as these components always represent common information between w i and w i+1 .
3 Related to constituent parsing and sequence labeling, there are two related papers that made early efforts (although not a full reduction of the former to the latter) and need to be credited too. Ratnaparkhi (1999) popularized maximum entropy models for parsing and combined a sequence labeling process that performs PoS-tagging and chunking with a set of shift-reduce-like operations to complete the constituent tree. In a related line, Collobert (2011) proposed a multi-step approach consisting of n passes over the input sentence, where each of them tags every word as being part of a constituent or not at one of the n levels of the tree, using a IOBES scheme. 4 Intermediate unary chains are compressed into a single non-terminal and treated as a regular branches. Figure 2 illustrates the encoding on a continuous example.  Figure 2: An example of a continuous tree encoded according to Gómez-Rodríguez and Vilares (2018).
Incompleteness for discontinuous phrase structures Gómez-Rodríguez and Vilares proved that Φ is complete and injective for continuous trees. However, it is easy to prove that its validity does not extend to discontinuous trees, by using a counterexample. Figure 3 shows a minimal discontinuous tree that cannot be correctly decoded. The inability to encode discontinuities lies on the assumption that w i+1 will always be attached to a node belonging to the path from the root to w i (n i is then used to specify the location of that node in the path). This is always true in continuous trees, but not in discontinuous trees, as can be seen in Figure 3 where c is the child of a constituent that does not lie in the path from S to b.

Encoding nearly ordered permutations
Next, we fill this gap to address discontinuous parsing as sequence labeling. We will extend the encoding Φ to the set of discontinuous constituent trees, which we will call T |w| . The key to do this relies on a well-known property: a discontinuous tree t ∈ T |w| can be represented as a continuous one using an in-order traversal that keeps track of the original indexes (e.g. the trees at the left and the right in Figure 4). 5 We will call this tree the (canonical) continuous arrangement of t, ω(t) ∈ T |w| .
Thus, if given an input sentence we can generate the position of every word as a terminal in ω(t), the existing encodings to predict continuous trees as sequence labeling could be applied on ω(t). In essence, this is learning to predict a permutation of w. As introduced in §2, the concept of location of a token is not a stranger in transition-based discontinuous parsing, where actions such as swap switch the position of two elements in order to create a discontinuous phrase. We instead propose to explore how to handle this problem in end-to-end sequence labeling fashion, without relying on any parsing structure nor a set of transitions.
To do so, first we denote by τ : {0, . . . , |w| − 1} → {0, . . . , |w| − 1} the permutation that maps the position i of a given w i in w into its position as a terminal node in ω(t). 6 From this, one can derive π : W n → W n , a function that encodes a permutation of w in such way that its phrase structure does not have crossing branches. For continuous trees, τ and π are identity permutations. Then, we extend the tree encoding function Φ to Φ : where p i is a discrete symbol such that the sequence of p i 's encodes the permutation τ (typically each p i will be an encoding of τ (i), i.e. the position of w i in the continuous arrangement, although this need not be true in all encodings, as will be seen below).
The crux of defining a viable encoding for discontinuous parsing is then in how we encode τ as a sequence of values p i , for i = 0 . . . |w| − 1. While the naive approach would be the identity encoding (p i = τ (i)), we ideally want an encoding that balances minimizing sparsity (by minimizing infrequently-used values) and maximizing learnability (by being predictable). To do so, we will look for encodings that take advantage of the fact that discontinuities in attested syntactic structures are mild (Maier and Lichte, 2011), i.e., in most cases, τ (i + 1) = τ (i) + 1. In other words, permutations τ corresponding to real syntactic trees tend to be nearly ordered permutations. Based on these principles, we propose below a set of concrete encodings, which are also depicted on an example in Figure 4. All of them handle multiple gaps (a discontinuity inside a discontinuity) and cover 100% of the discontinuities. Even if this has little effect in practice, it is an interesting property compared to algorithms that limit the number of gaps they can address Corro, 2020).

Absolute-position For every token
Otherwise, we use a special label INV, which represents that the word is a fixed point in the permutation, i.e., it occupies the same place in the sentence and in the continuous arrangement.
In the context of discontinuous parsing and encoding p i , n can be seen as the input sentence w where π(w) is encoded by σ. The Lehmer code is particularly suitable for this task in terms of compression, as in most of the cases we expect (nearly) ordered permutations, which translates into the majority of elements of σ being zero. 7 However, this encoding poses some potential learnability problems. The root of the problem is that σ i does not necessarily encode τ (i), but τ (j) where j is the index of the word that occupies the ith position in the continuous arrangement (i.e., j = τ −1 (i)). In other words, this encoding is expressed following the order of words in the continuous arrangement rather than the input order, causing a non-straightforward mapping between input words and labels. For instance, in the previous example, σ 2 does not encode the location of the object n 2 =2 but that of n 3 =3.
Lehmer code of the inverse permutation To ensure that each p i encodes τ (i), we instead interpret p i as meaning that w i should fill the (p i + 1)th currently remaining blank in a sequence σ that is initialized as a sequence of blanks, i.e. σ = [•, •, ..., •]. For instance, let n = [0, 1, 2, 3, 4] be 3. Applica�on of the proposed encodings to encode the sentence permuta�on π(w) that corresponds to the con�nuous arrangement in 2 as a sequence of labels p = [p 0 ... p |w|-1 ] 1. Original discon�nuous tree. . After that, n 3 and n 4 occupy the first available blank (so p 3 = p 4 = 0). Thus, we obtain the desired arrangement σ = [0, 1, 3, 4, 2], and the encoding is [0, 0, 2, 0, 0]. It is easy to check that this produces the Lehmer code for the inverse permutation to τ . Hence, it shares the property that the identity permutation is encoded by a sequence of zeros, but it is more straightforward for our purposes as each p i encodes information about τ (i), the target position of w i in the continuous arrangement. Note that this and the Lehmer code coincide iff τ is a self-conjugate permutation (i.e., a conjugate that is its own inverse, see (Muir, 1891)), of which the identity is a particular case.
Pointer-based encoding When encoding τ (i), the previous encodings generate the position for the target word, but they do not really take into account the left-to-right order in which sentences are naturally read, 8 nor they are linguistically inspired.
In particular, informally speaking, in human lin-guistic processing (i.e. when a sentence is read from left to right) we could say that a discontinuity is processed when we read a word that continues a phrase other than that of the previously read word. For example, for the running example sentence (Figure 4), from an abstract standpoint we know that there is a discontinuity because τ (2) = τ (1) + 1, i.e., "nie" and "habe" are not contiguous in the continuous arrangement of the tree. However, in a left-to-right processing of the sentence, there is no way to know the final desired position of "habe" (τ (2)) until we read the words "so viel gewählt", which go before it in the continuous arrangement. Thus, the requirement of the previous four encodings to assign a concrete nondefault value to the p i s associated with "habe" and "ich" is not too natural from an incremental reading standpoint, as learning p i requires information that can only be obtained by looking to the right of w i . This can be avoided by using a model that just processes "Noch nie habe ich" as if it were a continuous subtree (in fact, if we removed "so viel gewählt" from the sentence, the tree would be continuous). Then, upon reading "so", the model notices that it continues the phrase associated with "nie" and not with "ich", and hence inserts it after "nie" in the continuous arrangement.
This idea of incremental left-to-right processing of discontinuities is abstracted in the form of a pointer o that signals the last terminal in the current continuous arrangement of the constituent that we are currently filling. That said, to generate the labels this approach needs to consider two situations: • If w i is to be inserted right after w i−1 (this situation is characterized by τ . This case is abstracted by a single label, p i =NEXT, that means to insert at the position currently pointed by o, and then update o = τ i (i), where the function τ i is defined as τ can informally be described as a tentative value of τ (x), corresponding to the position of w x in the part of the continuous arrangement that involves the substring w 0 . . . w i .
• Otherwise, w i should be inserted after some w i−x with x ≥ 1, which means there is a discontinuity and that the current pointer o is no longer valid and needs to be first updated to point to τ i (i − x). To generate the label p i we use a tuple (j, t) that indicates that the predecessor of w i in ω(t) is the jth preceding word in w with the PoS tag t. After that, we update the pointer to o = τ i (i). While this encoding could work with PoS-tag-independent relative offsets, or any word property, the PoStag-based indexing provides linguistic grounding and is consistent with sequence labeling encodings that have obtained good results in dependency parsing (Strzyz et al., 2019).
Pointer-based encoding (with simplified PoS tags) A pointer-based variant where the PoS tags in (j, t) are simplified (e.g. NNS → NN). The mapping is described in Appendix A.1. Apart from reducing sparsity, the idea is that a discontinuity is not so much influenced by specific information but by the coarse morphological category.
Ill-formed permutations are corrected with postprocessing, following Appendix A.2, to ensure that the derived permutations contain all word indexes.

Limitations
The encodings are complete under the assumption of an infinite label vocabulary. In practice, training sets are finite and this could cause the presence of unseen labels in the test set, especially for the integer-based label components: 9 the levels in com-mon (n i ) and the label component p i that encodes τ (i). However, as illustrated in Appendix A.3, an analysis on the corpora used in this work shows that the presence of unseen labels in the test set is virtually zero.

Sequence labeling frameworks
To test whether these encoding functions are learnable by parametrizable functions, we consider different sequence labeling architectures. We will be denoting by ENCODER a generic, contextualized encoder that for every word w i generates a hidden vector h i conditioned on the sentence, i.e. ENCODER(w i |w)=h i . We use a hard-sharing multitask learning architecture (Caruana, 1997;Vilares et al., 2019) to map every h i to four 1-layered feedforward networks, followed by softmaxes, that predict each of the components of l i . Each task's loss is optimized using categorical cross-entropy L t = − log(P (l i |h i )) and the final loss computed as L = t∈T asks L t . We test four EN-CODERs, which we briefly review but treat as black boxes. Their number of parameters and the training hyper-parameters are listed in Appendix A.4.
Transducers without pretraining We try (i) a 2-stacked BiLSTM (Hochreiter and Schmidhuber, 1997;Yang and Zhang, 2018) where the generation of h i is conditioned on the left and right context. (ii) We also explore a Transformer encoder (Vaswani et al., 2017) with 6 layers and 8 heads. The motivation is that we believe that the multihead attention mechanism, in which a word attends to every other word in the sentence, together with positional embeddings, could be beneficial to detect discontinuities. In practice, we found training these transformer encoders harder than training BiLSTMs, and that obtaining a competitive performance required larger models, smaller learning rates, and more epochs (see also Appendix A.4).
The input to these two transducers is a sequence of vectors composed of: a pre-trained word embedding (Ling et al., 2015) further fine-tuned during training, a PoStag embedding, and a second word embedding trained with a character LSTM. Additionally, the Transformer uses positional embeddings to be aware of the order of the sentence. and Vilares, 2018)), and could potentially happen with any label component, e.g. predicting the non-terminal symbol. However, it is very unlikely that a non-terminal symbol has not been observed in the training set. Also, chart-and transitionbased parsers would suffer from this same limitation.
Transducers with pretraining Previous work on sequence labeling parsing (Gómez-Rodríguez and Vilares, 2018;Strzyz et al., 2019) has shown that although effective, the models lag a bit behind state-of-the-art accuracy. This setup, inspired in Vilares et al. (2020), aims to evaluate whether general purpose NLP architectures can achieve strong results when parsing free word order languages. In particular, we fine-tune (iii) pre-trained BERT (Devlin et al., 2019), and (iv) pre-trained DistilBERT (Sanh et al., 2019). BERT and DistilBERT map input words to sub-word pieces (Wu et al., 2016). We align each word with its first sub-word, and use their embedding as the only input for these models.

Experiments
Setup For English, we use the discontinuous Penn Treebank (DPTB) by Evang and Kallmeyer (2011). For German, we use TIGER and NEGRA (Brants et al., 2002;Skut et al., 1997). We use the splits by  which in turn follow the Dubey and Keller (2003) splits for the NEGRA treebank, the Seddah et al. (2013) splits for TIGER, and the standard splits for (D)PTB (Sections 2 to 21 for training, 22 for development and 23 for testing). See also Appendix A.5 for more detailed statistics. We consider gold and predicted PoS tags. For the latter, the parsers are trained on predicted PoS tags, which are generated by a 2stacked BiLSTM, with the hyper-parameters used to train the parsers. The PoS tagging accuracy (%) on the dev/test is: DPTB 97.5/97.7, TIGER 98.7/97.8 and NEGRA 98.6/98.1. BERT and Dis-tilBERT do not use PoS tags as input, but when used to predict the pointer-based encodings, they are required to decode the labels into a parenthesized tree, causing variations in the performance. 10 Table 1 shows the number of labels per treebank.
Metrics We report the F-1 labeled bracketing score for all and discontinuous constituents, using discodop (van Cranenburgh et al., 2016) 11 and the proper.prm parameter file. Model selection is based on overall bracketing F1-score. simplified PoS tags does not lead however to clear improvements, suggesting that the models can learn the sparser original PoS tags set. For the rest of encodings we also observe interesting tendencies. For instance, when running experiments using stacked BiLSTMs, the relative encoding performs better than the absolute one, which was somehow expected as the encoding is less sparse. However, the tendency is the opposite for the Transformer encoders (including BERT and DistilBERT), especially for the case of discontinuous constituents. We hypothesize this is due to the capacity of Transformers to attend to every other word through multihead attention, which might give an advantage to encode absolute positions over BiLSTMs, where the whole left and right context is represented by a single vector. With respect to the Lehmer and Lehmer of the inverse permutation encodings, the latter performs better overall, confirming the bigger difficulties for the tested sequence labelers to learn Lehmer, which in some cases has a performance even close to the naive absolute-positional encoding (e.g. for TIGER using the vanilla Transformer encoder and BERT). As introduced in §4, we hypothesize this is caused by the non-straightforward mapping between words and labels (in the Lehmer code the label generated for a word does not necessarily contain information about the position of such word in the continuous arrangement).

Results
In Table 3 we compare a selection of our models against previous work using both gold and predicted PoS tags. In particular, we include: (i) models using the pointer-based encoding, since they obtained the overall best performance on the dev sets, and (ii) a representative subset of encodings (the absolute positional one and the Lehmer code of the inverse permutation) trained with the best  performing transducer. Additionally, for the case of the (English) DPTB, we also include experiments using a bert-large model, to shed more light on whether the size of the networks is playing a role when it comes to detect discontinuities. Additionally, we report speeds on CPU and GPU. 12 The experiments show that the encodings are learnable, but that the model's power makes a difference. For instance, in the predicted setup BILSTMs and vanilla Transformers perform in line with predeep learning models (Maier, 2015;Fernández-González and Martins, 2015;Coavoux and Crabbé, 2017), DistilBERT already achieves a robust performance, close to models such as ; and BERT transducers suffice to achieve results close to some of the strongest approaches, e.g. (Fernández-González and Gómez-Rodríguez, 2020a). Yet, the results lag behind the state of the art. With respect to the architectures that performed the best the main issue is that they are the bottleneck of the pipeline. Thus, the computation of the contextualized word vectors under current approaches greatly decreases the importance, when it comes to speed, of the chosen parsing paradigm used to generate the output trees (e.g. chart-based versus sequence labeling). Finally, Table 4 details the discontinuous performance of our best performing models.
Discussion on other applications It is worth noting that while we focused on parsing as sequence labeling, encoding syntactic trees as labels is useful to straightforwardly feed syntactic information to downstream models, even if the trees themselves come from a non-sequence-labeling parser. For example, Wang et al. (2019) use the sequence labeling encoding of Gómez-Rodríguez and Vilares (2018) to provide syntactic information to a semantic role labeling model. Apart from providing fast and accurate parsers, our encodings can be used to do the same with discontinuous syntax.

Conclusion
We reduced discontinuous parsing to sequence labeling. The key contribution consisted in predicting a continuous tree with a rearrangement of the leaf nodes to shape discontinuities, and defining various ways to encode such a rearrangement as a sequence of labels associated to each word, taking advantage of the fact that in practice they are nearly ordered permutations. We tested whether those encodings are learnable by neural models and saw that the choice of permutation encoding is not trivial, and there are interactions between encodings    and models (i.e., a given architecture may be better at learning a given encoding than another). Overall, the models achieve a good trade-off speed/accuracy without the need of any parsing algorithm or auxiliary structures, while being easily parallelizable.

A Appendices
A.1 Simplified part-of-speech tags for the pointer-based encoding   We describe below the post-processing of the encodings to ensure that the generated sequences can be later decoded to a well-formed tree. Before postprocessing the predicted permutation, we make sure that one, and only one label (n i , x i , u i , p i ), can be identified as the last word in the continuous arrangement. This is required because the component n i encodes unique information for the last word (an empty dummy value, as n i always encodes information between a word and the next one, which does not exist for the last token); which can conflict with some of the predicted p i s, that might put a different word into the last position.
That said, we rely on the value n i to identify which word should be located as the last one. 13 Absolute-position and relative-position encodings Given the sequence p that encodes the permutation π(w) of the words of w in the continuous arrangement ω(t), we: (i) fill the indexes for which the predicted labels indicate that the token should remain in the same position, i.e. p i =INV, and (ii) for the remaining p i 's we check whether the predicted index has not been yet filled, and otherwise assign it to the closest available index (computed as the minimum absolute difference).
Lehmer encoding Given p and the list of available word indexes idxs (initially all the words), we process the elements in p in a left-to-right fashion: (i) if the corresponding index encoded at p i is in idxs, then we select the index and remove it from idxs, (ii) otherwise, we select the last element in idxs and, again, remove it.

Lehmer of the inverse permutation encoding
The post-processing is similar to the Lehmer code encoding, but considering the available blanks instead of a list of word indexes.
Pointer-based encodings Given the encoded permutation p, we process the elements left-toright and: (i) if p i =NEXT, then we apply no postprocessing and we consider the word will be inserted after the current pointer o at the moment of decoding, which is always valid. (ii) Otherwise, we are processing an element p i that encodes the pointer o = (j, t), and try to map it to τ (i). If such mapping is not possible, this is because j is greater than the number of previously processed words that have the postag t. If so, then we post-process p i to (k, t), where k was the first processed word with postag t, or to p i =NEXT, if there is no previous word labeled with the postag t.  A.4 Training hyper-parameters and size of the trained models Table 8 shows the hyper-parameters used to train the BiLSTMs, both for the gold and predicted setups. We use pre-trained embeddings for English 14 We consider NEGRA as the reference corpus since it was the treebank that showed the largest percentage of missing pi elements. and German (Ling et al., 2015). The embeddings for English have 100 dimensions, while the German ones only have 60. For the BiLSTMs, we did not do any hyper-parameter engineering and just considered the hyper-parameters reported by Gómez-Rodríguez and Vilares (2018 Table 9 shows the configuration used to train the vanilla Transformer encoders. As explained in the paper, we found out that Transformers were more unstable during training in comparison to BILSTMs. To overcome such unstability, we performed a small manual hyper-parameter search. This translated into training during more epochs, with a low learning rate and large dropout.
Finally, in Table 11 we list the number of parameters for each of the transducers trained on the pointer-based encoding. For the rest of the encodings, the models have a similar number of parameters, as the only change in the architecture is the small part involving the feed-forward output layer that predicts the label component p i .
More  Table 9: Main hyper-parameters for the training of the vanilla Transformer encoder, both for the gold and predicted setups. Except for the pointer-based encoding, where 0.003 was necessary to converge. The character embedding size used for the TIGER and NEGRA models, so the size of the input to the model is a multiplier of the number of attention heads. As Ling et al. (2015) embeddings for German only have 60 dimensions, this tweak was necessary for those treebanks.

Hyperparameter
Value loss cross-entropy learning rate 1e −5 batch size training 6 training epochs 45 batch size test 8  formers the TIGER model is the larger than the NEGRA one. This is because for these transducers we only store and use the word embeddings from Ling et al. (2015) that were seen in the training and dev sets, and the TIGER treebank is larger and contains more unique words. Also, we see that for the BiLSTMs the TIGER model is slightly larger than the DPTB one, while for the vanilla Transformer the opposite happens. This is due to the smaller char embedding size in the case of the Ger-man Transformers, which is required so the total size of the input vector is divisible by 8, the number of attention heads (the root of the need for the disparity in the char embedding sizes is that the pre-trained English and German embeddings also have a different number of dimensions). On the contrary, for the BERT-based models we use the same pre-trained model for TIGER and NEGRA, for example, which causes these models to have an almost identical number of parameters.