Bounded-Depth High-Coverage Search Space for Noncrossing Parses

A recently proposed encoding for non-crossing digraphs can be used to implement generic inference over families of these digraphs and to carry out ﬁrst-order factored dependency parsing. It is now shown that the recent proposal can be substantially streamlined without information loss. The improved encoding is less dependent on hierarchical processing and it gives rise to a high-coverage bounded-depth approximation of the space of non-crossing digraphs. This subset is presented elegantly by a ﬁnite-state machine that recognizes an inﬁnite set of encoded graphs. The set includes more than 99.99% of the 0.6 million noncrossing graphs obtained from the UDv2 treebanks through planari-sation. Rather than taking the low probability of the residual as a ﬂat rate, it can be modelled with a joint probability distribution that is factorised into two underlying stochastic processes – the sentence length distribution and the related conditional distribution for deep nesting . This model points out that deep nesting in the streamlined code requires extreme sentence lengths. High depth is categori-cally out in common sentence lengths but emerges slowly at infrequent lengths that prompt further inquiry.

A recently proposed encoding for noncrossing digraphs can be used to implement generic inference over families of these digraphs and to carry out first-order factored dependency parsing. It is now shown that the recent proposal can be substantially streamlined without information loss. The improved encoding is less dependent on hierarchical processing and it gives rise to a high-coverage boundeddepth approximation of the space of noncrossing digraphs. This subset is presented elegantly by a finite-state machine that recognizes an infinite set of encoded graphs. The set includes more than 99.99% of the 0.6 million noncrossing graphs obtained from the UDv2 treebanks through planarisation. Rather than taking the low probability of the residual as a flat rate, it can be modelled with a joint probability distribution that is factorised into two underlying stochastic processes -the sentence length distribution and the related conditional distribution for deep nesting. This model points out that deep nesting in the streamlined code requires extreme sentence lengths. High depth is categorically out in common sentence lengths but emerges slowly at infrequent lengths that prompt further inquiry.
Syntactic and semantic dependency structures -rooted trees and more general digraphs -have tremendous importance in multilingual language analysis as demonstrated by the Universal Dependencies (UD) initiative 1 and many applications of dependency annotations. The main approaches 1 http://universaldependencies.org/ to produce dependency structures include graphbased parsers (Eisner and Satta, 1999;McDonald et al., 2005) that build the structures in the bottom up fashion and transition-based parsers that produce structures while reading the input buffer (Nivre, 2008;Bohnet et al., 2016).
Recently, Yli-Jyrä and Gómez-Rodríguez (2017) have explored a perspective that combines graph-based parsing with coding theory: instead of rewriting digraphs directly, they propose a linear encoding for noncrossing digraphs and then manipulate the code strings using string automata. This method brings the two parsing approaches closer to each other as the graphical parsing reduces to a combination of state-driven processing of the underlying regular component and graphical processing of context-free component of the encoding. Their main result is that some 50 natural families of dependency structures reduce to unambiguous context-free languages. Generic parsing to noncrossing digraphs can thus be viewed as weighted context-free parsing.
The parsing objective in the framework of Yli-Jyrä and Gómez-Rodríguez (2017) is to maximize the total arc weight of the parse using an exact cubic-time inference procedure over the language associated with the input sentence. Polynomial time is often too expensive as such alternative methods as transition-based parsing with beam search may produce similar accuracy in linear time. Since higher efficiency is welcome in many real-time data applications, one may ask whether the encoded search space could be optimized to allow efficient, linear-time inference over the most plausible candidate graphs.
In this paper, we present an improved representation for the search space. First, the linear encoding of noncrossing graphs is streamlined by a technique called weak edge bracketing. Second, the context-free language of the streamlined en-coding is further approximated with a regular subset that contains the most probable dependency analyses. This gives us a finite-state representation of the search space. We also construct a factored joint probability model for the event that the correct parse is outside of the search space. Under the approximation and the current experiments, the probability for a failure is less than 0.3% for all languages and 0.006% on average.
The low error rate means that if the proposed approximation was used to restrict the search space of the state-of-the-art parsers, the obtained finite-state approximation would potentially improve the efficiency significantly while leaving the parsing accuracy nearly intact. When used in this way as a search space restriction, the currently proposed regular approximations for the families of digraphs -as proposed by Yli-Jyrä and Gómez-Rodríguez (2017) -become available to higherorder graphical parsers, neural transition-based parsers and generative neural models of syntax.
The current paper focuses on the structure and the motivation for the finite-state search space for noncrossing digraphs. Section 2 presents the problem of finding a good finite-state approximation of the search space. The streamlined context-free encoding for the noncrossing digraphs is introduced in Section 3. Sections 4 and 5 discuss its minimum-DFA state complexity and coverage under a bounded nesting depth. Before conclusion, the results are related to the prior work and discussed critically in Section 6.

Definitions
In this section, I follow Yli-Jyrä and Gómez-Rodríguez (2017) to define noncrossing dependency graphs and describe how they can be encoded as linear strings.
Graphs A graph is a pair (V, E) where V is a finite set of vertices and E ⊆ {{u, v} ⊆ V } is a set of edges. It is common to assume that the edges do not contain self-loops of the form {v, v}. For convenience, the vertices in graphs are ordered V = [1, ..., n]. Two edges {i, j}, {k, l} in an ordered graph are said to be crossing if min{i, j} < min{k, l} < max{i, j} < max{k, l}. A graph is noncrossing if it has no crossing edges.
Yli-Jyrä and Gómez-Rodríguez (2017) have proposed a scheme according to which any noncrossing ordered graph ([1, ..., n], E) is encoded as a string of brackets using the algorithm enc in  ]. Intuitively, pairs of brackets of the form {} can be interpreted as spaces between vertices, and then each set of matching brackets [...] encodes an arc that covers the spaces represented inside the brackets. The noncrossing ordered graphs are encoded with strings that constitute a context-free language. These encoded graphs are generated exactly by the context-free grammar S → This language of encoded graph corresponds bijectively to the set of all non-crossing graphs.
Digraphs The encoding scheme extends to digraphs. A digraph is a pair (V, A) where A ⊆ V ×V is a set of arcs u → v. Its underlying graph, (V, E A ), has edges E A = {{u, v} | (u, v) ∈ A}. A noncrossing digraph is a digraph whose underlying graph is noncrossing. Any noncrossing ordered digraph ([1, . . . , n], A) can be encoded with slight modifications to the encoding algorithm. Instead of printing [ ] for an edge {i, j} ∈ E A , i ≤ j, the algorithm should now print In this way, we can simply encode the digraph ({1, 2, 3, 4}, {(1, 2), (4, 1), (4, 2)}) as the string . Again, there is a bijection between noncrossing digraphs and their encodings, L DIGRAPHS . All n-vertex noncrossing digraphs are represented with the language where W n = B * ({}B * ) n−1 describes the alternation between vertex boundaries {} and edge brackets B that exclude these boundaries.
Yli-Jyrä and Gómez-Rodríguez (2017) construct representations of context-free languages that encode important families of digraphs and graphs. Accordingly, there are context-free languages that correspond to the rooted noncrossing trees, projective trees, noncrossing dags etc.
Dependency Parsing The complete digraph (V, A) of a sentence S = x 1 ...x n consists of vertices V = {1, ..., n} and all possible arcs A = {(i, j) | i = j}. In the arc-factored model (McDonald et al., 2005), every arc i → j in this digraph is equipped with a positive weight w i j that is predicted on the basis of the feature vectors associated with tokens i and j in the sentence. To facilitate local weight assignment to the pairs of brackets in the encoding, the brackets in the encoded digraphs (1) can be indexed with the corresponding vertex numbers. This indexing turns string The total weight of this string is computed using a dynamically constructed semiring-weighted context-free grammar Let ⊗ be a commutative monoid operation used to compute the total weight of a derivation under the grammar. For the above string, the grammar then returns the total weight w 12 ⊗ w 24 ⊗ w 14 .
The task of arc-factored dependency parsing is to find in the specified family of the graphs (e.g. noncrossing dags), the maximal subgraph (V, A ) of the complete digraph (V, A). When the inference is restricted to a noncrossing family L FAMILY of digraphs, the natural choice is to carry out the inference using a cubic-time algorithm that recognises the weighted context-free language and finds the indexed string w ∈ L FAMILY ∩ W n that maximizes the total weight of the derivation.

The Problem
For long sentences, a linear-time parsing algorithm can be considerably more efficient and attractive than a cubic-time parsing algorithm. Since the encoded digraphs support generic parsing to different families of digraphs, we would now like to provide foundations for a linear time variant of this generic parsing architecture. The most obvious approach would be to replace the context-free grammar of (di)graphs with a regular subset approximation.
Since the approximation is regular, there are at most a finite number of equivalence classes over possible parser states at any moment. The number of equivalence classes tells the size of a deterministic finite automaton. The search space of an arbitrary sentence consists of those digraphs that are encoded by the intersection of the language L FAMILY recognized by this automaton and the language W n defining the position-indexed brackets.
A good, practical approximation must satisfy at least three requirements: • Complete core. A language-independent and generic approximation for the set of digraphs should certainly contain all digraphs over a small number (n ≤ 7) of tokens.
• Convenient size. The amount of the memory needed to carry out inference over long sentences should not stretch the limits of a convenient implementation.
• Good coverage. The approximation should have a good coverage of the existing analyses in treebanks.
During the inference for the best parse, the memory of a typical algorithm stores the parse candidates either as a non-center-embedding grammar or a fully expanded finite-state automaton that represents the crossproduct of two finitestate representations, one for the search space and one for the strings with vertex indexed bracket. Since the convenience of the expanded representation is not at all obvious, the current work focuses on its deterministic state complexity, i.e. the number of states in a minimal DFAs recognizer.
With the encoding scheme of Yli-Jyrä and Gómez-Rodríguez (2017) .. # For når f.eks. Harald frå partiet med det meir eller mindre passande namnet "Framstegspartiet" seier at vi har... kan ein spekulere ... # Since when e.g. Harald from the party with the more or less suitable name "Progress Party" says that we have... can one speculate ... By the Complete core requirement, we obtain a rough lower-bound state complexity for the required approximation L FAMILY . The search space for n = 7 equals to the intersection W n=7 ∩ L FAMILY whose state complexity is 106 372 states. Since the recognizer for W 7 requires exactly 14 states, the recognizer G for L FAMILY should contain at least 106 372/14 = 7 598 states as the cross product of these automata cannot have fewer than 106 372 states. With this 7 598-state automaton G, the parsing of a 500-token sentence would require at least |W 500 ||G| = 7 598 000 states. 2 But an automaton of this size is usually inconvenient to operate with and does not satisfy the Convenient size requirement. 3 Since only 6 overlapping edges are observed in 7-word sentences, this approximation would also fail to cover treebanks where more than 10 nested edges are quite common (Figure 2), breaking the Good coverage requirement. Thus, a DFA approximation based on the prior encoding is doomed to fail in the real life scenario. 4 The current research problem is to improve the representation of the search space in such a way that fewer DFA states are needed and more complex structures can be captured with a convenient number of equivalence classes.

The Improved Encoding
The prior encoding can be improved through weak, reduced bracketing that packs adjacent closing or opening brackets into a single symbol. The idea has historical links to superbrackets in Interlisp (Teitelman, 1978), but similar ideas have been introduced to the bracketing of phrase struc-2 The seven longest sentences in the UDv2 dataset viewed by the author consist of 399, 428, 493, 496, 504, 534 and 610 tokens.
3 A hierarchical grammar representation would be much more succinct but it assumes richer structures that we would like to preserve for optimizations of the implementation. 4 We did not even consider the latent encoding of Yli-Jyrä and Gómez-Rodríguez. The latent encoding has a more complex local structure and requires drastically more states in a DFA representation. ture trees (Langendoen, 1975;Krauwer and des Tombe, 1981;. In the context of edge brackets, the idea of weak bracketing appears partially in Yli-Jyrä (2004).

Nested Sibling Edges
The key observation is that nested sibling edges give rise to adjacent copies of one-sided brackets: The idea is that if we indicate both sides of the outermost sibling edge, its nested siblings share one of its ends and thus need only a one-sided bracket: Our current contribution to this idea is to observe that (1) both sides of the outermost brackets can be shared and (2) that every edge can be independently directed or undirected: Figure 2 shows how this encoding is applied to a real dependency tree.

Re-encoder
There is a transducer that converts the original "strong" edge bracketing into weak edge bracketing. The transducer is represented with four components in Figure 3: 3. Sequential function E that elides the redundant brackets.

4.
A reflexive regular relation T 4 whose iterated application implements the fact that balanced brackets may cancel each other.
If X is a finite set of encoded graphs, we can re-encode these graphs as the corresponding set of weakly bracketed graphs X w . We compute where Dom returns the input projection and X, C and ε are viewed as identity relations. Since the composition closure T 4 • T 4 • ... • T 4 • ε maps the balanced bracket strings to the empty word, its input projection is actually a Dyck language D 4 -a balanced language over the four kinds of balanced brackets. With this context-free language, we can define the re-encoder as function that maps the strongly bracketed strings X to the weakly bracketed strings X w . This re-encoder function extends to encoded digraphs in the natural way by extending the set of brackets.

Improved State Complexity
The state complexity of the set of all n-vertex noncrossing digraphs for the streamlined encoding is essentially smaller than previously: We are also able to go beyond n = 7 and build complete search spaces of undirected graphs with more vertices:  Figure 4 shows that the state complexity difference between the original and the new encoding scheme for digraphs, projective trees and graphs is indeed exponential to the length of the sentences. With the simple trick that replaces /! and <! with [!, but otherwise separates the brackets /, >, ], [, <, / , [!, ]!, / !, >!, the search space generalizes from noncrossing graphs to noncrossing digraphs without any increase in the state complexity. This gives another significant saving in the state complexity of digraph search space compared to the original encoding. 5

The Context-Free Set of Digraphs
There is an extended context-free grammar that generates all encoded undirected graphs using the weak edge bracketing: The derivation steps of this grammar do not correspond exactly to nesting of brackets. The grammar can, however, be converted to an extended context-free grammar where each derivation step corresponds to a new nesting level. This is illustrated in the left of Figure 5. We implement the grammar as a transducer that is iterated until the sentential form is in the language {} | S * . The iterated transducer G S is shown in the right of Figure 5.

Finite-State Search Space
A finite-state approximation is obtained from the grammar by composing copies of the bottom-up recognizer G S and by restricting the output language to {} | S * . The language L GRAPHS(5) is constructed using 6 copies of the G S transducer: The language L GRAPHS(5) has 442 distinguishable DFA states and 1388 transitions. This approximation of the grammar can be used to represent most of the search space of n-vertex sentences with fewer states than the length-limited finite subset of context-free language L GRAPHS that contains all graphs. When L GRAPHS(5) and L GRAPHS are constrained with W n , we obtain regular languages whose state complexity can be measured. The difference between these is illustrated in Figure 6. It is interesting that although the approximation captures only 6 levels (5 nested ones) of super brackets, it gives exact results until the graphs have 14 vertices. This is because each pair of superbrackets encode an edge whose two endpoints are non-incident with the edges that correspond to nested superbrackets. Since the approximated search space grows linearly to the sentence length, the approximated search space of 21-vertex graphs requires 5 501 states. Interestingly, this subset approximation still covers 99% of the complete search space as it contains 3 358 682 892 406 358 016 graphs.

Data Coverage
The length of real sentences in treebanks and in texts varies a lot. At very high lengths, it is not easy to tell without real data how probable it is that a finite-state approximation does not contain the correct analysis graph or digraph. In order to learn about the probability of the out-of-thesearch-space event, we used the UD treebanks as the first proxy to find out how often a given nesting depth is exceeded in gold trees.
Our current encoding can handle only noncrossing trees such as projective trees. However, the trees in UD version 2 treebanks do not have this restriction. It is well known that the proportion of non-projective, and thus crossing, trees is relatively high for some languages (Gómez-Rodríguez, 2016). If all nonprojective analyses were discarded, most of the long sentences would have been excluded, with significant effect on our experiments.
For the experiments, we had to enforce noncrossing structure to the data. The standard approach is to perform lift transformations that move crossing edges higher in the dependency tree. Although there are methods to minimize the number of lifts either heuristically or exactly, the output of such a transformation is not uniquely determined. The second way to projectivise trees is to keep the dominance tree intact but reorder the nodes of the tree into a canonical order that maintains the relative order of the immediate dependents of each head. The third approach, actually used by us, views the trees as undirected trees and takes advantage of the algebraic properties of bal-  [{}]] that encodes a slightly different set of edges. This method works also with reduced bracketing where the opening and closing superbrackets cancel one another. The resulting code string preserves the number of edges but some ends of these edges may change. The result may be a cyclic or disconnected graph but it preserves noncrossing parts of the graph intact because there the brackets match the two ends of each edge.
The total number of sentences in the sample was 630 518. This includes both the training and the development sections of the UD v.2 treebanks. The data was encoded with our new encoding scheme and automatically converted to noncrossing undirected graphs before the nesting depth of each sentence was computed. 6 Table 1 describes how the nesting depth corresponds to coverage in the UD2 data set. It indicates that the depth 5 is a good compromise between depth and coverage. Under this depth, the search space contains 99.994% of all the trees in the treebanks. Only 35 sentences require weak bracketing whose nesting depth is more than 5. Thus the flat out-of-the-search-space failure rate is 0.006% of the sentences only.
Figure 7: Left: the sentence length distribution P(n-bucket) =count(n−bucket)/all sentences and the conditional probability P(d ≥ k | n) =count(d, n−bucket)/count(n−bucket) of exceeding depth of weak bracketing, given the sentence length. Right: the joint distribution P(d ≥ k | nbucket) =count(d, n−bucket)/bucketsize/total for the event of exceeding depth, given sentence length and the conditional distribution P(≤ n | d) =count(≤ n−bucket, d)/count(d) of maximum sentence length, given the bracketing depth. Scaling of some distributions was necessary as the numeric values of some nearby distributions are of different orders of magnitude.
The error rate leaves us with doubts about the representativeness of the sample.
To gain more insight into the underlying process, a more sophisticated statistical model was constructed. For this purpose, the multilingual data was smoothed by splitting the sentence lengths into buckets that contained typically some 30k sentences.
The sentence length and nesting depth combine to form a joint distribution P(d > k, n) that can be factored in different ways. Figure 7 presents the distribution of sentence lengths and the related conditional and joint distributions as they could be observed. The statistics reveal the following observations: 4. The current experiments seem to break the direct link from long sentences to deep nesting but supports the opposite tendency. The probability of a sentence being deep is zero when the number of tokens is less than 14, but very deep nesting predicts high sentence length (P(n > 45 | d = 6) > 99%).
5. The proportion of deep analyses in the search space is higher than the corresponding proportion in the real data for each length. In particular, deep nesting in sentences with 20-21 tokens is expected to happen almost at the probability 1% if all analyses were equally probable ( Figure 6), but observed probability of the event is under 0.002% (P(d ≥ 6 | n ∈ [20, 21]) = 0.002%).

Discussion
There are three factors that contribute to deep nesting structures in bracketed binarised trees: • Tail Recursion. Repeated left-and rightbranching structures generate initial and final forms of tail recursion, as well as zigzag embedding that involves both.
• Unit Rules. Non-branching trees embed trees in a way that may generate new nesting levels in bracketing.
• Local Factorization. Local factorization of unranked parse trees and unbounded sibling edges can generate unbounded recursion.
Unit rules and tail recursion have been addressed in several prior studies as their naive bracketing may involve unbounded stack or nested brackets. The prior approaches include grammar transformations in parsers (Langendoen, 1975;Johnson, 1998;Nederhof, 2000), weak bracketing of constituent trees (Langendoen and Langsam, 1984;Krauwer and des Tombe, 1981;Black, 1989;Koskenniemi, 1990;, edge bracketing of dependency trees (Oflazer, 2003;Yli-Jyrä, 2005, 2012 and further ideas (Church, 1980;Langendoen, 2008;Hulden and Silfverberg, 2014). The current work goes another mile in the study of bounded nesting by introducing weak edge bracketing that avoids unbounded recursion when unbounded local trees are binarized and bracketed. 8 The work also shows that the classic technique can be extended to all noncrossing digraphs. This extends the relevance of finite-state transducer techniques from syntax to semantic dependency parsing.
The weak edge bracketing is effective in improving the state complexity of finite sets of noncrossing digraphs (Figure 4). Thanks to the new bracketing scheme, the coverage of depthbounded approximations is very high and seems to deteriorate slower than the state complexity of the exact search space grows ( Figure 6). A further improvement was obtained by the trick that maintained the state complexity when the encoding for graphs was generalized to digraphs.
A careful comparison between methods that ensure noncrossing structures could reveal some interesting differences. While the nonprojective trees can be projectivised (and thus turned into noncrossing graphs) with optimal number of lifts, the lifted tree is still not unique and it may be intractable to optimize the nesting depth over the choices. It may also be a problem if the input is not a tree and therefore the existing graph banks should be also explored in the statistics. 8 The idea started to develop already in Yli-Jyrä (2004). Natural next steps in the proposed methodology would be to adapt the parametrizable parsing framework of Yli-Jyrä and Gómez-Rodríguez (2017) to weak bracketing and to implement a linear-time arc-factored parser to important families of non-crossing digraphs. It is also important to find ways to combine the current encoding with higher-order dependency parsing and neural network grammars that weight local trees. The main challenge is, however, to extend the current work to crossing graphs and nonprojective trees.

Conclusion
In this paper, we have presented the first weak edge bracketing scheme for noncrossing digraphs and described the bijectively related context-free language of code strings. Our scheme improves over (Yli-Jyrä and Gómez-Rodríguez, 2017) as the minimum DFA state complexity of search space representations is smaller and unranked dependencies can be processed without recursion.
The experiments indicate that 5 nested levels of balanced brackets are sufficient to cover nearly all noncrossing UD v.2 data. According the joint distribution of length and depth, deep nesting is a rare event that correlates with an exceptional sentence length.
The language L GRAPHS(5) representing the subset approximation of the search space for all noncrossing graphs with nesting depth 5 can be applied beyond the Yli-Jyrä and Gómez-Rodríguez (2017) framework to guide and restrict syntactic parsing and generation. This is expected to lead to linear time syntactic and semantic dependency parsing with noncrossing output structures. If combined with multiplanar/book embedding methods Kuhlmann and Johnsson, 2015), the techniques might extend to non-projective parsing and crossing graphs.