Bracketing Encodings for 2-Planar Dependency Parsing

We present a bracketing-based encoding that can be used to represent any 2-planar dependency tree over a sentence of length n as a sequence of n labels, hence providing almost total coverage of crossing arcs in sequence labeling parsing. First, we show that existing bracketing encodings for parsing as labeling can only handle a very mild extension of projective trees. Second, we overcome this limitation by taking into account the well-known property of 2-planarity, which is present in the vast majority of dependency syntactic structures in treebanks, i.e., the arcs of a dependency tree can be split into two planes such that arcs in a given plane do not cross. We take advantage of this property to design a method that balances the brackets and that encodes the arcs belonging to each of those planes, allowing for almost unrestricted non-projectivity (∼99.9% coverage) in sequence labeling parsing. The experiments show that our linearizations improve over the accuracy of the original bracketing encoding in highly non-projective treebanks (on average by 0.4 LAS), while achieving a similar speed. Also, they are especially suitable when PoS tags are not used as input parameters to the models.


Introduction
In the last few years, approaches that cast syntactic parsing as the task of finding a sequence have gained traction for both dependency and constituency parsing. In sequence-to-sequence (seq2seq) parsing (Vinyals et al., 2015;Li et al., 2018), parse trees are represented as arbitrary-length sequences, where the attention mechanism can be seen as an abstraction of the stack and the buffer in transition-based systems that decides what words are relevant to make a decision at a given time step. In sequence labeling parsing (Gómez-Rodríguez and Vilares, 2018;Strzyz et al., 2019b), the tree for a sentence of length n is represented as a sequence of n labels, one per word, so the parsing process is word-synchronous (Kitaev and Klein, 2019) and can be addressed by frameworks traditionally used for other natural language processing tasks, such as part-of-speech tagging or named-entity recognition. Current sequence labeling parsers combine competitive accuracy with high computational efficiency, while providing extra simplicity using off-the-shelf sequence labeling software without the need for ad-hoc parsing algorithms.
In the realm of dependency parsing, pioneering work dates back to Spoustová and Spousta (2010), who used a relative PoS-tag based encoding to represent trees as label sequences, but the resulting accuracy was not practical even for the standards of the time, probably due to the inability of pre-deep-learning architectures to successfully learn the representation. Using more modern architectures with the ability to contextualize words based on the sentence, and various tree encodings, Strzyz et al. (2019b) were the first to show that competitive accuracy could be reached. Subsequently, this accuracy has been improved further by techniques like the use of multi-task learning to parse dependencies and constituents together (Strzyz et al., 2019a) and of contextualized embeddings (Vilares et al., 2020).
While parsing as sequence labeling does not need specific parsing algorithms or data structures, as in graph-based or transition-based parsing, the responsibility of providing suitable parsing representations w 0 w 1 w 2 w 3 w 4 w 5 w 6 root ∅ ///> /> /> > > (a) Projective encoding restricted to a single plane. Infeasible to reconstruct a non-projective sentence.
w 0 w 1 w 2 w 3 w 4 w 5 w 6 root ∅ Non-projective 2-planar encoding with second-plane-averse greedy plane assignment. The arc w3 → w6 is not assigned a plane because it would cross arcs belonging to both planes, which is forbidden by the 2-planar constraint.
w 0 w 1 w 2 w 3 w 4 w 5 w 6 root ∅ / * //> / * > /> * > * > (c) Non-projective 2-planar encoding with plane assignment based on restriction propagation on the crossings graph. Figure 1: Bracketing-based encodings with their plane assignment strategies for a non-projective sentence. The red, dotted lines refer to the arcs represented in the second plane, denoted by * in the encoding label.
with reasonable coverage and learnability falls instead on the encoding used to represent trees as sequences of labels. Strzyz et al. (2019b) used four different encodings that obtained substantially different parsing accuracies in the experiments, with two encodings achieving competitive accuracy: the relative PoS tag (rel-PoS) encoding of Spoustová and Spousta (2010) and a new encoding based on balanced brackets, inspired by Yli-Jyrä and Gómez-Rodríguez (2017). While the encoding of Spoustová and Spousta (2010) achieved a good accuracy, and it has full coverage of non-projective dependency trees, it requires PoS tags to encode the dependency arcs. This can be seen as a weakness, not just because computing and feeding PoS tags increases the latency, but also because the traditional assumption that PoS tagging is needed for parsing is being increasingly called into question (de Lhoneux et al., 2017;Smith et al., 2018;Kitaev and Klein, 2018;Anderson and Gómez-Rodríguez, 2020). Low-frequency PoS tags can cause sparsity in the encoding, and low-quality PoS tags could be a potential source of errors in low-resource languages. For this reason, Lacroix (2019) proposed two alternative encodings with the same relative indexing philosophy, but without using PoS tags. However, these encodings require a composition of two sequence labeling processes instead of one. On the other hand, the bracketing encoding inspired in (Yli-Jyrä and Gómez-Rodríguez, 2017) represents the trees independently of PoS tags or any other previous tagging step, but it has the limitation of being restricted to a very mild extension of projective trees.
Contribution. In this paper, we extend the idea of the bracketing-based encoding to non-projective parsing by defining a variant that can encode all 2-planar dependency trees (Yli-Jyrä, 2003). 2-planar dependency trees have been shown to cover the vast majority of non-projective trees in attested sentences (Gómez-Rodríguez, 2016) and have been used in transition-based parsing (Gómez-Rodríguez and Nivre, 2013;Fernández-González and Gómez-Rodríguez, 2018). We show that our encoding provides better parsing accuracy than the original bracketing-based encoding on highly non-projective UD treebanks; and than the rel-PoS encoding when assuming PoS tags are not fed as input parameters to the models. The source code is available at https://github.com/mstrise/dep2label.

Preliminaries
Given a sentence w 1 . . . w n , we associate the words with nodes 0, 1, . . . , n, where 0 is a dummy root node. Then, a dependency graph is an edge-labeled graph (V, E) with V = {0, 1, . . . , n} and E a set of edges of the form (h, d, l) where h ∈ V is the head, d ∈ V \ {0} is the dependent, and l is the dependency label. The goal of a dependency parser is to find a dependency graph that is a tree (i.e. without cycles, and with no dependent having more than one head) rooted at node 0.

Bracketing encoding
Dependency arcs are encoded through a sequence of bracket elements from a set B = {<, /, /, >}. A balanced pair of brackets (<, /) in the labels of the words w i and w j represents a left arc from word w j to w i−1 . A balanced pair of brackets (/, >) in the labels of the words w i and w j represents a right arc from word w i−1 to w j . A token can have one incoming arc and several outgoing arcs, resulting on labels composed of several such brackets, following the regular expression (<)?((\) * |(/) * )(>)?. As shown in Figure 1a, the token w 2 is assigned a label ///> that can be interpreted as: the previous token w 1 has three outgoing arcs to the right and one of them matches the left incoming arc of w 2 (/>) meaning that w 1 is the head of w 2 . The remaining two dependents will be given by the matching > in the labels of the following words.
Since each opening bracket is always matched to the closest same-direction closing bracket, this encoding is unable to handle crossing arcs in the same direction. An attempt of encoding such crossing arcs will result in decoding into non-crossing arcs. However, the encoding can handle crossing arcs in opposite directions, as long as left and right brackets are balanced independently (e.g. by using separate stacks for each kind of bracket). The paper by Strzyz et al. (2019b) erroneously describes the encoding as only supporting projective trees. In fact, the implementation in that paper is supporting this mild extension of projectivity where crossing arcs in opposite directions are allowed.

2-Planarity
A dependency graph (V, E) is said to be k-planar, for k ≥ 1, if there is a partition of the edges into sets E 1 , . . . , E k , called planes, in such a way that edges that are in the same plane do not cross. For k = 1, this corresponds to the concept of a noncrossing dependency graph (Kuhlmann and Jonsson, 2015) or planar linear arrangement (Chao and Sha, 1992) (not to be confused with a planar graph). Under the assumption of trees rooted at the dummy root node 0, 1-planar trees are equivalent to the well-known projective trees. For k ≥ 2, this means that the dependency graph (together with the linear order of the words) is a k-page book embedding of a graph (see (Pitler et al., 2013)). Intuitively, a k-planar graph is one where each arc can be assigned one out of k colors in such a way that arcs with the same color do not cross (see Figure 1).

2-Planar bracketing encodings
In order to support the extended non-projective coverage provided by 2-planarity in the bracketing system, we balance a different set of brackets for each plane. We introduce a set of "star" bracket elements denoting arcs belonging to the second plane, B * = {< * , / * , / * , > * }. A token w i can be assigned elements from both B and B * . Brackets only match when they are on the same plane, i.e., (<, /), (/, >) are matching pairs of brackets that encode arcs in the first plane, and (< * , / * ), (/ * , > * ) are matching pairs of brackets that encode arcs in the second plane. The decoding process is implemented by operating on separate stacks for the first-plane brackets and the second-plane brackets.

Plane assignment strategies
According to the definition in Section 2.2, a tree is 2-planar if its edges can be partitioned into two planes, E 1 and E 2 , such that edges in the same plane do not cross. However, often this partition is not unique (for example, in the case of trees that are also 1-planar, any partition satisfies the condition). Thus, for the encoding in Section 3 to provide a single sequence of labels for each gold tree during training, we need to fix a plane assignment strategy, i.e., a canonical way of assigning each arc to a plane to obtain such a partition. While the number of possible partitions is exponential in the size of the tree, desirable partitions should be easily learnable, i.e., follow predictable patterns. Given that the amount of crossing dependencies in treebanks is scarce (Ferrer-i-Cancho et al., 2018), it makes sense to look for partitions that do not make use of an extra plane when not needed, so that the parsing of sentences or fragments without crossing arcs does not become more difficult or need more output labels than in the basic bracketing encoding (as they will only use one plane and thus one set of brackets). Following this general principle, we define the following plane assignment strategies: Second-Plane-Averse Greedy Plane Assignment Arcs in the gold tree are traversed in left-to-right order of their right endpoint, with shortest arcs first when they share a right endpoint (this is the order in which arcs will be decoded using a stack, see Section 4). For each arc a, we assign the first plane if possible (i.e., if no arc crossing a has already been assigned the first plane). Otherwise, we assign the second plane if possible, or no plane if the arc a crosses arcs assigned to both planes. The process is formally described with pseudocode in Algorithm 1.

Algorithm 1: 2p-greedy
Input: A set of arcs T , and input length n Result: Two sets (planes) of arcs P 1 , P 2 P 1 ← ∅; do nothing (failed to assign nextArc to a plane); end end end return P 1 , P 2 ;

Second-Plane-Averse Plane Assignment based on Restriction Propagation on the Crossings Graph
While the greedy approach is very simple, it has the disadvantage that it may make suboptimal decisions leading to reduced coverage: assigning an arc to a given plane may seem like a good local decision, but depending on how arcs cross each other in the whole tree, it may lead to a subsequent situation where an arc cannot be assigned a plane even if the tree is actually 2-planar.
An example of this can be seen in Figure 1b: the greedy strategy will assign the arcs w 1 → w 3 and w 1 → w 5 to the first plane, which in a local context is the simplest thing to do. However, the fact that w 1 → w 3 crosses w 2 → w 4 (which is thus assigned to the second plane) and w 3 → w 6 crosses both w 1 → w 5 (first plane) and w 2 → w 4 (second plane) then means that it is impossible to assign a plane to the arc w 3 → w 6 . This could have been prevented by assigning arc w 1 → w 5 to the second plane, but a greedy algorithm has no way to anticipate this. To deal with this problem, we propagate restrictions by traversing the crossings graph, i.e., a graph where its nodes represent the edges in the gold tree and two nodes are linked if the corresponding edges cross (Gómez-Rodríguez and Nivre, 2013). Whenever we assign a given arc to plane 1, then we forbid plane 1 for its neighbors in the crossings graph (i.e. the arcs that cross it), we forbid plane 2 for the neighbors of its neighbors, plane 1 for the neighbors of those, and so on. For arcs assigned to plane 2, we proceed symmetrically.
Thus, the traversal order of arcs is the same as in the previous strategy, but for each new arc a, we look at the restrictions and assign it to the first plane if allowed, otherwise to the second plane if allowed, and finally to no plane if neither are allowed. In this case, the latter will only happen for non-2-planar trees: it is easy to show that situations where both planes are forbidden for the same arc can only happen if the crossings graph has a cycle of odd length, which is equivalent to the tree not being 2-planar (see (Gómez-Rodríguez and Nivre, 2013)). Thus, this strategy guarantees full coverage of 2-planar structures. The pseudocode of the strategy can be seen in Algorithm 2, where P 1 and P 2 represent the arcs forbidden from planes 1 and 2, respectively.

Algorithm 2: 2p-prop
Input: A set of arcs T , and input length n Result: Two sets (planes) of arcs P 1 , P 2 function Propagate(Edge sets T, P 1 , P 2 , Edge e, Plane i): if nextArc ∈ P 1 then P 1 ← P 1 ∪ {nextArc}; Propagate(T, P 1 , P 2 ,nextArc,2); else if nextArc ∈ P 2 then P 2 ← P 2 ∪ {nextArc}; Propagate(T, P 1 , P 2 ,nextArc,1); else do nothing (failed to assign nextArc to a plane); end end end return P 1 , P 2 ; Switch-averse plane assignment strategies Another possibility is to implement variants of the previous two strategies that are switch-averse, rather than second-plane-averse. These variants work like the previous strategies, except for the difference than when both planes can be assigned to the current arc, we assign the last plane used, instead of always preferring to assign the first plane.
The implementation of the 2-planar transition-based parser by Gómez-Rodríguez and Nivre (2010) used a switch-averse restriction-propagation strategy. This is a reasonable choice because in their transitionbased parser it minimizes the number of transitions used: the algorithm's state holds the "current" plane being used, and switching to the other plane costs one transition. In our sequence labeling context, where  this is no longer true (the model always makes n predictions for a sequence of length n), we made some initial experiments with switch-averse strategies but we found that they performed consistently (albeit slightly) worse than second-plane-averse strategies, so we discarded them for our experiments.

Bracketing decoding
When a sentence is represented with the bracketing encoding in a single plane, a valid left arc is associated with a pair of matching brackets < and / while a right arc is associated with a pair of / and >. For each sentence we create two initially empty stacks, σ L and σ R , in order to keep the elements separate with respect to the arc direction. Thus, the output labels generated by the system are read from left to right, decomposed into their brackets, and then brackets corresponding to left arcs are processed in σ L and those that encode right arcs are processed in σ R . In order to handle a second plane with brackets represented as (< * , / * ) and (/ * , > * ), we simply use additional stacks: σ * L and σ * R . More particularly, decoding proceeds by reading a label for each token and pushing each opening bracketing element to the corresponding stack while preserving the token's index. For instance, when reading a new label that contains <, the bracket element is pushed into the σ L stack and can only be popped once there is a later matching label with a closing bracketing element / that will be used to create a left arc, by recovering the index stored together with the < bracket. Analogously, right arcs are processed in the same way, but in a different stack.
Postprocessing Decoded labels do not ensure creating a well-formed tree. For that reason, we adapt some common heuristics for all encodings in order to postprocess them. In case some of the brackets in any of the stacks are unbalanced, the outermost bracket elements are discarded. Tokens that are not assigned any head are recovered by attaching them to the word that is attached to the dummy root (i.e., the syntactic head of the sentence). Cycles are also solved by removing the leftmost arc in the cycle.

Experiments
Data We extracted the most non-projective treebanks from UDv2.4 (Nivre and others, 2019) based on the percentage of non-projective sentences, and discarded some of them due to the lack of a pre-trained UDpipe model or due to the lack of a development set. The selected treebanks were: Ancient Greek Perseus , Basque BDT , Hungarian Szeged , Portuguese Bosque , Urdu UDTB , Afrikaans AfriBooms , Korean Kaist , Danish DDT , Gothic PROIEL , Lithuanian HSE . In addition, two fully projective treebanks (Galician CTG and Japanese GSD ) were included as control treebanks. Table 1 shows the selected treebanks with their percentage of nonprojective sentences and dependencies. For all of them, we ran UDPipe models (Straka and Straková, 2017) to obtain predicted segmentation and tokenization. We also computed predicted PoS tags, but they were not used (nor gold PoS tags were) to train any of the models, but just to decode the labels from the rel-PoS encoding (Strzyz et al., 2019b). In addition, we included dummy beginning-and end-of-sentence tokens (BOS,EOS) as in previous work in parsing as labeling.
Model For our experiments we use bidirectional long short-term memory networks (Hochreiter and Schmidhuber, 1997;Schuster and Paliwal, 1997) as implemented in the NCRF++ framework (Yang and Zhang, 2018). 1 Each input word w i is represented as a vector which comes from a concatenation of (i) an external pre-trained word embedding, which is further fine-tuned during training, and (ii) a second word embedding which results from the output of a char-LSTM, which is trained end-to-end together with the rest of the network.
In this context, let LSTM θ ( x) be a black-box long short-term memory network that processes the sequence of vectors x = [ x 1 , ..., x | x| ], then the output for x i is a hidden vector h i which represents the word based on its left and right sentence context: More particularly, we stack 2 BiLSTMs before computing the output layer. For this, we consider a simple hard-sharing multi-task learning architecture, where each h i is sent to three separate layers in order to generate the classifications through regular softmaxes: two labels predicted for each plane (one label per plane) 2 and another one for the word's dependency relation. Afterwards, label decoding is followed by a postprocessing step with some heuristics to ensure a valid dependency tree (as described in §4).

Analysis and results
Next, we compare the performance of the original bracketing encoding (1p-brackets), 2-planar with greedy plane assignment (2p-greedy) and 2-planar with restriction propagation (2p-prop) with respect to their theoretical arc coverage, as well as their empirical recall and precision. For UAS/LAS, we also report results for models trained on the rel-PoS encoding.  Table 2: Percentage of arcs covered by the proposed encodings on the gold training set from highly non-projective treebanks.
Theoretical advantage Table 2 compares the dependency arc coverage by the encodings on the gold training sets. It is easy to conclude that the 2-planar encodings almost fully succeed to reconstruct highly non-projective datasets, while the bracketing encoding suffers more. When comparing both plane assignments for the 2-planar encodings, we see that the coverage of 2p-greedy is already so high (99.9% or more in all but two treebanks) that the extra coverage provided by 2p-prop is not large in absolute terms. In fact, in some treebanks, 2p-prop even has slightly less measured coverage than 2p-greedy, even though (as explained earlier) the former guarantees full coverage of 2-planar trees while the latter does not. This can be explained because there are non-2-planar trees where 2p-greedy happens to cover more arcs. In such trees, the theoretical guarantee provided by 2p-prop does not apply. With respect to the number of labels that each encoding generates (which will directly impact the output size of the softmax layers), Table 3 shows the comparison of the output vocabulary sets for each of the tasks in the multi-task learning setup. We can see that, for most languages, bracketing encodings generate a smaller tag set than rel-PoS 3 ; and in general, the 2-planar encodings do not produce increases in tagset size with respect to the 1-planar bracketing encoding. In fact, for the most non-projective languages (like Ancient Greek or Basque), the 2-planar encodings clearly compress the tag set as, in spite of having a larger variety of brackets, they appear distributed among the two planes so that the bracket strings in each label will tend to be shorter.    Results To investigate how the coverage in Table 2 translates into non-projective performance in actual parsing, we report models' precision and recall. In Table 4, the precision and recall on non-projective sentences 4 increase across the treebanks with 2-planar models, suggesting that they are capable of identifying non-projective sentences to a greater extent than the original bracketing model. Table 5 shows that 2p-greedy and 2p-prop models improve the recall and precision of non-projective dependencies in the majority of treebanks. 5 Again, 2-planar encodings outperform the original bracketing baseline, even though the latter is able to cover non-projectivity to some degree (crossing arcs pointing in opposite directions). Both 2p-greedy and 2p-prop obtain similar scores, showing that their coverage is comparable.  Table 6: UAS and LAS (%) for the respective encodings on the predicted dev and test set of highly non-projective treebanks and control treebanks. Table 6 compares the LAS and UAS performance of the 1-and 2-planar, and also of the rel-PoS encoding. 6 2-planar encodings outperform the existing bracketing encoding in the majority of treebanks. The gains vary between languages but on average 2p-greedy improves UAS by 0.4 and 2p-prop by 0.3, and both improve LAS by 0.4 across highly non-projective treebanks. Comparing both assignment strategies for the 2-planar encoding, the theoretical advantage in coverage provided by 2p-prop over 2p-greedy does not translate into accuracy gains in general, as the actual difference in coverage is small when measured in the treebanks (as was seen in Table 2) and the simpler greedy assignment strategy is likely to be easier to learn by the machine learning setup.
Since the syntactic dependencies are represented by a finite set of labels that have been seen in the training and development sets, as in all parsing as sequence labeling approaches, it is expected that at test time our model may encounter unseen labels. In Appendix B we show the label coverage of all encodings on the test set. In general, it seems that the unseen labels do not have significant impact on the overall performance due to their rare occurrence.  Finally, we measured speeds for each of the encodings on various treebanks, run on a single core CPU 7 and GPU 8 , which we breakdown in Table 7. We can observe that the speed is very similar between 1-planar and 2-planar encodings. This is because the bottleneck of the model is in the BiLSTMs, and computing the softmaxes comes at almost no cost despite the differences in the output vocabularies.

Conclusion
We have shown a new bracketing-based linearization of 2-planar trees compatible with parsing as sequence labeling. Our main goal was to introduce a bracketing encoding with the ability to perform unrestricted non-projective dependency parsing, which remained as an open challenge in sequence labeling parsing under the family of bracketing encodings. Together with the proposed plane assignment strategies and a BiLSTM-based network, our 2-planar bracket representations improve the performance over the existing bracketing-based encoding for parsing as sequence labeling, and also outperform the PoS-based encoding in the absence of PoS-tags as input parameters to the model. Thus, it can be a useful alternative where an encoding that depends on PoS tags is not desirable, e.g. domains with low-frequency or low-quality PoS tags, or to decrease even further the latency of sequence labeling parsers.
Finally, it is worth noting that we have proposed plane assignment strategies that minimize the use of the second plane. However, it is a possible avenue for future work to examine other strategies based on different criteria than the one presented in this paper.

A Treebank sizes
We provide some statistics about the chosen treebanks. In Table 8, we report the total number of sentences for each dataset split with their respective non-projectivity percentage.

B Label coverage
At test time, our model assigns a label for each task by choosing one from a finite set learned during training. As a result, it is expected that the model may not be able to predict some of the labels occurring in the test set. Table 9 reports the number of labels that have not been seen in the training and dev set and the total number of unique labels found in the test set. In addition, we include data about the percentage of occurrences of unseen labels with respect to the the occurrences of all labels in the test set.  Table 9: Label coverage in each task at test time.