A Linear-Time Transition System for Crossing Interval Trees

We deﬁne a restricted class of non-projective trees that 1) covers many natural language sentences; and 2) can be parsed exactly with a generalization of the popular arc-eager sys-tem for projective trees (Nivre, 2003). Cru-cially, this generalization only adds constant overhead in run-time and space keeping the parser’s total run-time linear in the worst case. In empirical experiments, our proposed transition-based parser is more accurate on average than both the arc-eager system or the swap-based system, an unconstrained non-projective transition system with a worst-case quadratic runtime (Nivre, 2009).


Introduction
Linear-time transition-based parsers that use either greedy inference or beam search are widely used today due to their speed and accuracy (Nivre, 2008;Zhang and Clark, 2008;Zhang and Nivre, 2011). Of the many proposed transition systems (Nivre, 2008), the arc-eager transition system of Nivre (2003) is one of the most popular for a variety of reasons. The arc-eager system has a well-defined output space: it can produce all projective trees and only projective trees. For an input sentence with n words, the arc-eager system always performs 2n operations and each operation takes constant time. Another attractive property of the arc-eager system is the close connection between the parameterization of the parsing problem and the final predicted output structure. In the arc-eager model, each operation has a clear interpretation in terms of constraints on the final output tree (Goldberg and Nivre, 2012), which allows for more robust learning procedures (Goldberg and Nivre, 2012).
The arc-eager system, however, cannot produce trees with crossing arcs. Alternative systems can produce crossing dependencies, but at the cost of taking O(n 2 ) transitions in the worst case (Nivre, 2008;Nivre, 2009;Choi and McCallum, 2013), requiring more transitions than arc-eager to produce projective trees (Nivre, 2008;Gómez-Rodríguez and Nivre, 2010), or producing trees in an unknown output class 1 (Attardi, 2006).
Graph-based non-projective parsing algorithms, on the other hand, have been able to preserve many of the attractive properties of their corresponding projective parsing algorithms by restricting search to classes of mildly non-projective trees (Kuhlmann and Nivre, 2006). Mildly non-projective classes of trees are characterizable subsets of directed trees. Classes of particular interest are those that both have high empirical coverage and that can be parsed efficiently. With appropriate definitions of feature functions and output spaces, exact higher-order graph-based non-projective parsers can match the asymptotic time and space of higher-order projective parsers (Pitler, 2014).
In this paper, we propose a class of mildly nonprojective trees ( §3) and a transition system ( §4) that is sound and complete with respect to this class ( §5) while preserving desirable properties of arc-eager: it runs in O(n) time in the worst case ( §6), and each operation can be interpreted as a prediction about the final tree structure. At the same time, it can produce trees with crossing dependencies. Across ten languages, on average 96.7% of sentences have dependency trees in the proposed class (Table 1), compared with 79.4% for projective trees. The implemented mildly non-projective transition-based parser is more accurate than a fully projective parser (arc-eager, (Nivre, 2003)) and a fully non-projective parser (swap-based, (Nivre, 2009)) ( §7.1).

Preliminaries
Given an input sentence w 1 w 2 . . . w n , a dependency tree for that sentence is a set of vertices V = {0, 1, . . . , n} and arcs A ⊂ V × V . Each vertex i corresponds to a word in the sentence and vertex 0 corresponds to an artificial root word, which is standard in the literature. An arc (i, j) ∈ A represents a dependency between a modifier w j and a head w i . Critically, the arc set A is constrained to form a valid dependency tree: its root is at the leftmost vertex 0; each vertex i has exactly one incoming arc (except 0, which has no incoming arcs); and there are no cycles. A common extension is to add labels of syntactic relations to each arc. For ease of exposition, we will focus on the unlabeled variant during the discussion but use a labeled variant during experiments.
A dependency tree is projective if and only if the nodes in the yield of each subtree form a contiguous interval with respect to the words and their order in the sentence. For instance, the tree in Figure 1a is non-projective since the subtrees rooted at came and parade do not cover a contiguous set of words. Equivalently, a dependency tree is non-projective if and only if the tree cannot be drawn in the plane above the sentence without crossing arcs. As we will see, these crossing arcs are a useful measure when defining sub-classes of non-projectivity. We will often reason about the set of vertices incident to a particular arc. The incident vertices of an arc are its endpoints: for an arc (u, v), u and v are the two vertices incident to it.

k-Crossing Interval Trees
We begin by defining a class of trees based on restrictions on crossing dependencies. The class definition is independent of any transition system; it is easy to check whether a particular tree is within the  class or not. We compare the coverage of this class on various natural language datasets with the coverage of the class of projective trees.
Definition 1. Let A be a set of unlabeled arcs. The Interval of A, Interval(A), is the interval from the leftmost vertex in A to the rightmost vertex in A, i.e., Definition 2. For any dependency tree T , the below procedure partitions the crossed arcs in T into disjoint sets A 1 , A 2 , . . . ., A l such that Interval(A 1 ), Interval(A 2 ), . . . , Interval(A l ) are all vertex-disjoint. These intervals are the crossing intervals of the tree T .
Procedure: Construct an auxiliary graph with a vertex for each crossed arc in the original tree. Two such vertices are connected by an arc if the intervals defined by the arcs they correspond to have a non-empty intersection. Figure 1b shows the auxiliary graph for the sentence in Figure 1a. The connected components of this graph form a partition of the graph's vertices, and so also partition the crossed arcs in the original sentence. The intervals defined by these groups cannot overlap, since then the crossed arcs that span the overlapping portion would have been connected by an arc in the auxiliary graph and hence been part of the same connected component.
Definition 3. A tree is a k-Crossing Interval tree if for each crossing interval, there exists at most k ver- tices such that a) all crossed arcs within the interval are incident to at least one of these vertices and b) any vertex in the interval that has a child on the far side of its parent is one of these k vertices. Figure 1a shows a 2-Crossing Interval tree. For the first crossing interval, think and came satisfy the conditions; for the second, parade and held do. The coverage of 2-Crossing Interval trees is shown in Table 1. Across datasets from ten languages with a non-negligible proportion of crossing dependencies, on average 96.7% of dependency trees are 2-Crossing Interval, within 1.3% of the larger 1-Endpoint-Crossing class  and substantially larger than the 79.4% coverage of projective trees. Coverage increases as k increases; for 3-Crossing Interval trees, the average coverage reaches 98.6%. Punctuation tokens are excluded when computing coverage to better reflect language specific properties rather than treebank artifacts; for example, the Turkish CoNLL data attaches punctuation tokens to the artificial root, causing a 15% absolute drop in coverage for projective trees when punctuation tokens are included (89.9% vs. 74.7%).

Connections to Other Tree Classes
k = 0 or k = 1 gives exactly the class of projective trees (even a single crossing implies two vertexdisjoint crossed edges). 2-Crossing Interval trees are a subset of the linguistically motivated 1-Endpoint-Crossing trees  (each crossed edge is incident to one of the two vertices for the root b a 1 b 1 a 2 b 2 . . . a n−1 b n−1 a n b n a Figure 2: A 2-Crossing Interval tree that is not wellnested and has unbounded block degree. interval, so all edges that cross it are incident to the other vertex for the interval); all of the examples from the linguistics literature provided in Pitler (2013, p.132-136) for 1-Endpoint-Crossing trees are 2-Crossing Interval trees as well. 2-Crossing Interval trees are not necessarily well-nested and can have unbounded block degree (Kuhlmann, 2013). Figure 2 shows an example of a 2-Crossing Interval tree (all crossed edges are incident to either a or b; no children are on the far side of their parent) in which the subtrees rooted at a and b are ill-nested and each has a block degree of n + 1.

Two-Registers Transition System
A transition system for dependency parsing comprises: 1) an initial configuration for an input sentence; 2) a set of final configurations after which the parsing derivation terminates; and 3) a set of deterministic transitions for transitioning from one configuration to another (Nivre, 2008).
Our transition system builds on one of the most commonly used transition systems for parsing projective trees, the arc-eager system (Nivre, 2003). An arc-eager configuration, c, is a tuple, (σ, β, A), where 1) σ is a stack consisting of a subset of processed tokens; 2) β is a buffer consisting of unprocessed tokens; and 3) A is the set of dependency arcs already added to the tree.
We define a new transition system called tworegisters. Configurations are updated to include two registers R1 and R2, i.e., c = (σ, β, R1, R2, A). A register contains one vertex or is empty: R1, R2 ∈ V ∪ {null}. Table 2 defines both the arc-eager and two-registers transition systems. The two-registers system includes the arc-eager transitions (top half of Table 2) and three new transitions that make use of the registers (bottom half of Table 2): • Store: Moves the token at the front of the buffer into the first available register, optionally   Table 3: An excerpt from a gold standard derivation of the sentence in Figure 3. The two words paint and house are added to the registers and then crossed arcs are added between them and the top of the stack.  (Shieber, 1985).
adding an arc between this token and the token in the first register. • Clear: Removes tokens from the registers, reducing them completely if they are covered by an edge in A or otherwise placing them back on the stack in order. If either R2 or the top of the stack is the token immediately to the left of the front of the buffer, that token is placed back on the buffer instead. • Register-Stack: Adds an arc between the top of the stack and one of the registers. A derivation excerpt for the clause in Figure 3 is shown in Table 3. The two tokens incident to all crossed arcs helped and paint are stored in the registers. The crossed arcs are then added through Register-Stack transitions, working outward from the registers through the previous words in the sentence: (paint, house), then (helped, Hans), etc. After all the crossed arcs incident to these two tokens have been added, the registers are cleared.
Preconditions related to rootedness, singleheadedness, and acyclicity follow the arc-eager system straightforwardly: each transition that adds an arc (h, m) checks that m is not the root, m does not already have a head, and that h is not a descendant of m. Preconditions used to guarantee that trees output by the system are within the desired class are listed in Table 4. In particular, they ensure that all crossed arcs are incident to registers, and that each pair of registers entails an interval corresponding to a selfcontained set of crossed edges. To avoid traversing A while checking preconditions, two helper constants are used: IsCovered(Rk) 2 and last 3 .
2 IsCovered(R1) is true if there exists an arc in A with endpoints on either side of R1. Rather than enumerating arcs, this boolean can be updated in constant time by setting it to true only after a Register-Stack(2, dir) transition with σ1 < R1; likewise R2 can only be covered with a Register-Stack(1, dir) transition with σ1 > R2.
3 last is used to indicate the rightmost partially processed unreduced vertex after the last pair of registers were cleared (set to the rightmost in γ, ψ after each Clear transition). Lemma 1. In the two-registers system, all crossed arcs are added through register-stack operations.
Proof. Suppose for the sake of contradiction that a right arc (s, b) added when σ 1 = s and β 1 = b is crossed in the final output tree (the argument for leftarcs is identical). Let (l, r) with l < r be an arc that crosses (s, b). One of {l, r} must be within the open interval (s, b) and one of {l, r} / ∈ [s, b]. When the arc (s, b) is added, no tokens in the open interval (s, b) remain. They cannot be in the stack or buffer since the stack and buffer always remain in order; they cannot be in registers by the precondition R1 / ∈ (σ 1 , β 1 ) ∧ R2 / ∈ (σ 1 , β 1 ) for Right-Arc transitions. Thus, (l, r) must already have been added. It cannot be that l ∈ (s, b) and r > b, since the rest of the buffer has never been accessible to tokens left of b. The ordering must then be l < s < r < b. Figure 4 shows that for each way (l, r) could have been added (Right-Arc, 4a; Store(right), 4b; Register-Stack(k, to-stack), 4c; Register-Stack(k, to-register), 4d), it is impossible to keep s unreduced without violating one of the preconditions.

Parsing 2-Crossing Interval Trees with the Two-Registers Transition System
In this section we show the correspondence between the two-registers transition system and 2-Crossing Interval trees: each forest output by the transition system is a 2-Crossing Interval tree (soundness) and every 2-Crossing Interval tree can be produced by the two-registers system (completeness).

Soundness: Two-Registers System → 2-Crossing Interval trees
Proof. Every crossed arc is incident to a token that was in a register (Lemma 1). There cannot be any overlap between register arcs where the corresponding tokens were not in the registers simultaneously: the Clear transition updates the book-keeping constant last to be the rightmost vertex associated with   Figure 4: If a stack-buffer arc (s, b) is added in the two-registers system, there cannot have been an earlier arc (l, r) with l < s < r < b, since it would then be impossible to keep s unreduced without violating the preconditions. the registers being cleared, and subsequent actions cannot introduce crossed arcs to the last token or to its left (by the β 1 > last and σ 1 > last preconditions on storing and register-stack arcs, respectively). Thus, each set of tokens that were in registers simultaneously defines a crossing interval. Condition (a) of Definition 3 is satisfied, since all crossed arcs are incident to registers and at most two vertices are in registers at the same time.
Assume that a vertex h, h / ∈ {R1, R2}, has a child m on the far side of its parent g (i.e., either h < g < m or m < g < h). The edge (h, m) is guaranteed to be crossed and so was added through a register-stack arc (Lemma 1). The ordering h < g < m is not possible, since if (g, h) had been added through a left-arc, then h would have been reduced, and if (g, h) and (h, m) were both added through register-stack arcs, then one of them would have violated the (R close , σ 1 ) / ∈ A or the (σ 1 , R f ar ) / ∈ A precondition. Similar reasoning can rule out m < g < h. Thus Condition (b) of Definition 3 is also satisfied.

Completeness: 2-Crossing Interval trees → Two-Registers System
Proof. The portions of a 2-Crossing Interval tree inbetween the crossing intervals can be constructed using the transitions from arc-eager. All arcs incident to neither a nor b must lie entirely within L, M , or R. 4 The parser begins by adding all arcs with both endpoints in L, using the standard arc-eager Shift/Reduce/Left-Arc/Right-Arc. It then shifts until a is at the front of the buffer and stores a. It then repeats the same process to add the arcs lying entirely in M until b reaches the front of the buffer, adding the parent of a with a Register-Stack(1, to-register) transition if the parent is in M and the arc is uncrossed. b is then stored, adding the arc between a and b if necessary. Throughout this process, the precondition R1 / ∈ (σ 1 , β 1 ) ∧ R2 / ∈ (σ 1 , β 1 ) for left and right arcs is satisfied.
Next, the parser will repeatedly take Register-Stack transitions, interspersed with Reduce transitions, to add all the arcs with one endpoint in {a, b} and the other in L or M , working right-to-left from b (i.e., from the top of the stack downwards). No shifts are done at this stage, so the σ 2 < R2 precondition on Register-Stack arcs is always satisfied. The σ 1 > last precondition is also always satisfied since all vertices in the crossing interval will be to the right of the previous crossing interval boundary point. After all these arcs are done, if there are any uncrossed arcs incident to a to the left that go outside of the crossing interval, they are added now with a Register-Stack transition. 5 Finally, the arcs with at least one endpoint in R are added, using Register-Stack arcs for those with the other endpoint in {a, b} and Left-Arc/Right-Arc for those with both endpoints in R. Before any vertex incident to a or b is shifted onto the stack, all tokens on the stack to the right of b are reduced.
After all these arcs are added, the crossing interval is complete. The boundary points of the interval that can still participate in uncrossed arcs with the exterior are left on the stack and buffer after the clear operation, so the rest of the tree is still parsable.

Worst-case Runtime
The two-registers system runs in O(n) time: it completes after at most O(n) transitions and each transition takes constant time.
The total number of arc-adding actions (Left-Arc, Right-Arc, Register-Stack, or a Store that includes an arc) is bounded by n, as there are at most n arcs in the final output. The net result of {Store, Store, Clear} triples of transitions decreases the number of tokens on the buffer by at least one, so these triples, plus the number of Shifts and Right-Arcs, are bounded by n. Finally, each token can be removed completely at most once, so the number of Left-Arcs and Reduces is bounded by n. Every transition fell into one of these categories, so the total number of transitions is bounded by 5n = O(n).
Each operation can be performed in constant time, as all operations involve moving vertices and/or adding arcs, and at most three vertices are ever moved (Clear) and at most one arc is ever added. Most preconditions can be trivially checked in constant time, such as checking whether a vertex already has a parent or not. The non-trivial precondition to check is acyclicity, and this can also be checked by adding some book-keeping variables that can be updated in constant time (full proof omitted due to space constraints). For example, in the derivation in Table 3, prior to the Register-Stack(2, to-stack) transition, R1 → A R2 (helped → A paint). After the arc (R2, σ 1 ) (paint, house) is added, R2 → A σ 1 and by transitivity, R1 → A σ 1 . The top of the stack is then reduced, and since σ 2 does not have a parent to its right, it is not a descendant of σ 1 , and so after Hans becomes the new σ 1 , the system makes the update that R1, R2 A σ 1 .

Experiments
The experiments compare the two-registers transition system for mildly non-projective trees proposed here with two other transition systems: the arceager system for projective trees (Nivre, 2003) and the swap-based system for all non-projective trees (Nivre, 2009). We choose the swap-based system as our non-projective baseline as it currently represents the state-of-the-art in transition-based parsing (Bohnet et al., 2013), with higher empirical performance than the Attardi system or pseudo-projective parsing (Kuhlmann and Nivre, 2010).
The arc-eager system is a reimplementation of Zhang and Nivre (2011), using their rich feature set and beam search. The features for the two other transition systems are based on the same set, but with slight modifications to account for the different relevant domains of locality. In particular, for the swap transition system, we updated the features to account for the fact that this transition system is based on the arc-standard model and so the most relevant positions are the top two tokens on the stack. For the two-register system, we added features over properties of the tokens stored in each of the registers. All experiments use beam search with a beam of size 32 and are trained with ten iterations of averaged structured perceptron training. Training set trees that are outside of the reachable class (projective for arc-eager, 2-Crossing Intervals for two-registers) are transformed by lifting arcs (Nivre and Nilsson, 2005) until the tree is within the class. The test sets are left unchanged. We use the standard technique of parameterizing arc creating actions with dependency labels to produce labeled dependency trees.
Experiments use the ten datasets in Table 1 from the CoNLL 2006 and 2007 shared tasks (Buchholz and Marsi, 2006;Nivre et al., 2007). We report numbers using both gold and automatically predicted part-of-speech tags and morphological attribute-values as features. For the latter, the part of speech tagger is a first-order CRF model and the morphological tagger uses a greedy SVM perattribute classifier. Evaluation uses CoNLL-X scoring conventions (Buchholz and Marsi, 2006) and we report both labeled and unlabeled attachment scores.    Table 5 shows the results using gold tags as features, which is the most common set-up in the literature. The two-registers transition system has on average 0.8% absolute higher unlabeled attachment accuracy than arc-eager across the ten datasets investigated. Its UAS is higher than arc-eager for eight out of the ten languages and is up to 2.5% (Dutch) or 3.0% (Turkish) absolute higher, while never more than 0.4% worse (Portuguese). The two-registers transition system is also more accurate than the alternate non-projective swap system on seven out of the ten languages, with more than 1% absolute improvements in UAS for Basque, Dutch, and German. The two-registers transition-system is still on average more accurate than either the arc-eager or swap systems using predicted tags as features (Table 6).   Table 5 for tokens in which the incoming arc in the gold tree is crossed or uncrossed (recall of both crossed and uncrossed arcs).

Results
Finally, we analyzed the performance of each of these parsers on both crossed and uncrossed arcs. Even on languages with many non-projective sentences, the majority of arcs are not crossed. Table 7 partitions all scoring tokens into those whose incoming arc in the gold tree is crossed and those whose incoming arc is not crossed, and presents the UAS scores from Table 5 for each of these groups. On the crossed arcs, the swap system does the best, followed by the two-registers system, with the arceager system about 20% absolute less accurate. On the uncrossed arcs, the arc-eager and two-registers systems are tied, with the swap system less accurate.

Discussion and Related Work
There has been a significant amount of recent work on non-projective dependency parsing. In the transition-based parsing paradigm, the pseudoprojective parser of Nivre and Nilsson (2005) was an early attempt and modeled the problem by transforming non-projective trees into projective trees via transformations encoded in arc labels. While improving parsing accuracies for many languages, this method was both approximate and inefficient as the increase in the cardinality of the label set affected run time. Attardi (2006) directly augmented the transition system to permit limited non-projectivity by allowing transitions between words not directly at the top of the stack or buffer. While this transition system had significant coverage, it is unclear how to precisely characterize the set of dependency trees that it covers. Nivre (2009) introduced a transition system that covered all non-projective trees via a new swap transition that locally re-ordered words in the sentence. The downside of the swap transition is that it made worst-case run time quadratic. Also, as shown in Table 7, the attachment scores of uncrossed arcs decreases compared with arc-eager.
Two other transition systems that can be seen as generalizations of arc-eager are the 2-Planar transition system (Gómez-Rodríguez and Gómez-Rodríguez and Nivre, 2013), which adds a second stack, and the transition system of Choi (Choi and McCallum, 2013), which adds a deque. The arc-eager, 2-registers, 2-planar, and the Choi transition systems can be seen as along a continuum for trading off various properties. In terms of coverage, projective trees (arc-eager) ⊂ 2-Crossing Interval trees (this paper) ⊂ 2-planar trees ⊂ all directed trees (Choi). The Choi system uses a quadratic number of transitions in the worst case, while arc-eager, 2-registers, and 2-planar all use at most O(n) transitions. Checking for cycles does not need to be done at all in the arc-eager system, can be with a few constant operations in the 2-registers system, and can be done in amortized constant time for the other systems (Gómez-Rodríguez and Nivre, 2013).
In the graph-based parsing literature, there has also been a plethora of work on non-projective parsing (McDonald et al., 2005;Martins et al., 2009;Koo et al., 2010). Recent work by Pitler and colleagues is the most relevant to the work described here (Pitler et al., 2012(Pitler et al., , 2014. Like this work, Pitler et al. define a restricted class of non-projective trees and then a graph-based parsing algorithm that parses exactly that set. The register mechanism in two-registers transition parsing bears a resemblance to registers in Augmented Transition Networks (ATNs) (Woods, 1970). In ATNs, global registers are introduced to account for a wide range of natural language phenomena. This includes long-distance dependencies, which is a common source of non-projective trees. While transition-based parsing and ATNs use quite different control and data structures, this observation does raise an interesting question about the relationship between these two parsing paradigms.
There are many additional points of interest to explore based on this study. A first step would be to generalize the two-registers transition system to a k-registers system that can parse exactly k-Crossing Interval trees. This will necessarily lead to an asymptotic increase in run-time as k approaches n. With larger values of k, the system would need additional transitions to add arcs between the registers (extending the Store transition to consider all subsets of arcs with the existing registers would become exponential in k). If k were to increase all the way to n, such a system would probably look very similar to list-based systems that consider all pairs of arcs (Covington, 2001;Nivre, 2008).
Another direction would be to define dynamic oracles around the two-registers transition system (Goldberg and Nivre, 2012;Goldberg and Nivre, 2013). The additional transitions here have interpretations in terms of which trees are still reachable (Register-Stack(·) adds an arc; Store and Clear indicate that particular vertices should be incident to crossed arcs or are finished with crossed arcs, respectively). The two-registers system is not quite arc-decomposable (Goldberg and Nivre, 2013): if the wrong vertex is stored in a register then a later pair of crossed arcs might both be individually reachable but not jointly reachable. However, there may be a "crossing-sensitive" variant of arcdecomposability that takes into account the vertices crossed arcs are incident to that would apply here.

Conclusion
In this paper we presented k-Crossing Interval trees, a class of mildly non-projective trees with high empirical coverage. For the case of k = 2, we also presented a transition system that is sound and complete with respect to this class that is a generalization of the arc-eager transition system and maintains many of its desirable properties, most notably a linear worst-case run-time. Empirically, this transition system outperforms its projective counterpart as well as a quadratic swap-based transition system with larger coverage.