Please Mind the Root: Decoding Arborescences for Dependency Parsing

The connection between dependency trees and spanning trees is exploited by the NLP community to train and to decode graph-based dependency parsers. However, the NLP literature has missed an important difference between the two structures: only one edge may emanate from the root in a dependency tree. We analyzed the output of state-of-the-art parsers on many languages from the Universal Dependency Treebank: although these parsers are often able to learn that trees which violate the constraint should be assigned lower probabilities, their ability to do so unsurprisingly de-grades as the size of the training set decreases. In fact, the worst constraint-violation rate we observe is 24%. Prior work has proposed an inefficient algorithm to enforce the constraint, which adds a factor of n to the decoding runtime. We adapt an algorithm due to Gabow and Tarjan (1984) to dependency parsing, which satisfies the constraint without compromising the original runtime.


Introduction
Developing probabilistic models of dependency trees requires efficient exploration over a set of possible dependency trees, which grows exponentially with the length of the input sentence n.
Under an edge-factored model (McDonald et al., 2005;Ma and Hovy, 2017;Dozat and Manning, 2017), finding the maximum-a-posteriori dependency tree is equivalent to finding the maximum weight spanning tree in a weighted directed graph. More precisely, spanning trees in directed graphs are known as arborescences. The maximum-weight arborescence can be found in O(n 2 ) (Tarjan, 1977;Camerini et al., 1979 However, an oversight in the relationship between dependency trees and arborescences has gone largely unnoticed in the dependency parsing literature. Most dependency annotation standards enforce a root constraint: Exactly one edge may emanate from the root node. 3 For example, the Universal Dependency Treebank (UD; Nivre et al. (2018)), a large-scale multilingual syntactic annotation effort, states in their documentation (UD Contributors): There should be just one node with the root dependency relation in every tree.
This oversight implies that parsers may return malformed dependency trees. Indeed, we examined the output of a state-of-the-art parser (Qi et al., 2020) for 63 UD treebanks. We saw that decoding without a root constraint resulted in 1.80% (on average) of the decoded dependency trees being malformed. This increased to 6.21% on languages that contain less than one thousand training instances with the worst case of 24% on Kurmanji.
The NLP literature has proposed two solutions to enforce the root constraint: (1) Allow invalid dependency trees-hoping that the model can learn to assign them low probabilities and decode singly rooted trees, or (2) return the best of n runs of the CLE each with a fixed edge emanating from the root (Dozat et al., 2017). 4 The first solution is clearly problematic as it may allow parsers to predict malformed dependency trees. This issue is further swept under the rug with "forgiving" evaluation metrics, such as attachment scores, which give (2005)) opt for the simpler CLE algorithm (Chu and Liu, 1965;Bock, 1971;Edmonds, 1967), which has a worst-case bound of O(n 3 ), but is often fast in practice.
3 A notable exception is the Prague Dependency Treebank (Bejček et al., 2013), which allows for multi-rooted trees. root Someplace that is like $ 30 an entree Figure 1: A malformed dependency tree from our experiment. Shown are the incorrect (highlighted) and correct (highlighted) dependency relations for token 8.
partial credit for malformed output. 5 The second solution, while correct, adds an unnecessary factor of n to the runtime of root-constrained decoding.
In this paper, we identify a much more efficient solution than (2). We do so by unearthing an O(n 2 ) algorithm due to Gabow and Tarjan (1984) from the theoretical computer science literature. This algorithm appears to have gone unnoticed in NLP literature; 6 we adapt the algorithm to correctly and efficiently handle the root constraint during decoding in edge-factored non-projective dependency parsing. 7

Approach
In this section, the marker indicates that a recently introduced concept is illustrated the worked example in Fig. 2. Let G = (ρ, V, E) be a rooted weighted directed graph where V is a set of nodes, E is a set of weighted edges, E ⊆ {(i w − A j) | i, j ∈ V, w ∈ R}, 8 and ρ ∈ V is a designated root node with no incoming edges. In terms of dependency parsing, each non-ρ node corresponds to a token in the sentence, and ρ represents the special root token that is not a token in the sentence. Edges represent possible dependency relations between tokens. The edge weights are scores from a model (e.g., linear (McDonald et al., 2005), or neural network (Dozat et al., 2017)). Fig. 1 shows an example. We allow G to be a multigraph, i.e., we allow multiple edges between pairs of nodes. Multi-graphs are a natural encoding of labeled dependency relations where possible labels between words are captured by multiple edges be- 5 We note exact match metrics, which consider the entire arborescence, do penalize root constraint violations 6 There is one exception: Corro et al. (2016) mention Gabow and Tarjan (1984)'s algorithm in a footnote. 7 Much like this paper, efficient root-constrained marginal inference is also possible without picking up an extra factor of n, but it requires some attention to detail (Koo et al., 2007;Zmigrod et al., 2020). 8 When there is no ambiguity, we may abuse notation using G to refer to either its node or edge set, e.g., we may write (i − A j) ∈ G to mean (i − A j) ∈ E, and i ∈ G to mean i ∈ V . tween nodes in the graph. Multi-graphs pose no difficulty as only the highest-weight edge between two nodes may be selected in the returned tree.
A dependency tree of G is an arborescence that additionally satisfies In words, (C3) says A contains exactly one out-edge from ρ. Let A(G) and A † (G) denote the sets of arborescences and dependency trees, respectively. The weight of a graph or subgraph is defined as In §2.1, we describe an efficient algorithm for finding the best (highest-weight) arborescence and, in §2.2, the best dependency tree. 9

Finding the best arborescence
A first stab at finding G * would be to select the best (non-self-loop) incoming edge for each node. Although, this satisfies (C1), it does not (necessarily) satisfy (C2). We call this subgraph the greedy − A G happens to be acyclic, it is clearly equal to G * . What are we to do in the event of a cycle? That answer has two parts.
cycle. Naturally, (C2) implies that critical cycles can never be part of an arborescence. However, they help us identify optimal arborescences for certain subproblems. Specifically, if we were to "break" the cycle at any node j ∈ C by removing its (unique) incoming edge, we would have an opti- Step (a) shows the contraction G /C where C is replaced by c , and edges are cast as enter, exit, external, or dead edges in G /C . We see the bookkeeping function π (as ), e.g., π(c Step (b) takes the greedy (sub)graph of G /C and since it contains no cycles, it is (G /C ) * as (highlighted). Note that if we did not require a dependency tree, we could now use Theorem 1 to break C at 2 .
Step (c) takes (G /C ) * , which has two root edges, − − A c) leads to w = 210. We pick the latter. As deleting (ρ 170 − − A c) does not lead to a critical cycle (optimization case), we remove it from the graph (shown as ) and so we get (G /C ) † (highlighted). Step mal arborescence rooted at j for the subgraph over the nodes in C. Let C (j) be a subgraph of C rooted at j that denotes the broken cycle at j. Let G (j) C be the subgraph rooted at j where G C contains all the nodes in C and all edges between them from The key to finding the best arborescence of the entire graph is, thus, determining where to break critical cycles.
Part 2: Breaking cycles is done with a recursive algorithm that solves the "outer problem" of fitting the (unbroken) cycle into an optimal arborescence. The algorithm treats the cycle as a single contracted node. Formally, a cycle contraction takes a graph G and a (not necessarily critical) cycle C, and creates a new graph denoted G /C with the same root, nodes (V C ∪ {c}) where c / ∈ V is a new node that represents the cycle, and contains the following set of edges: For any (i ). Akin to dynamic programming, this choice edge weight (due to Georgiadis (2003)) gives the best "cost-to-go" for breaking the cycle at j.
This is because such an edge (c − A c) would be a self-cycle, which can never be part of an arborescence.
Additionally, we define a bookkeeping function, π, which maps the nodes and edges of G /C to their counterparts in G. We overload π(G) to apply point-wise to the constituent nodes and edges.
By (C1), we have that for any A C ∈ A(G /C ), there exists exactly one incoming edge (i − A c) to the cycle node c. We can use π to infer where the cycle was broken with π(i − A c) = (i − A j). We call j the entrance site of A C . Consequently, we can stitch together an arborescence as π(A C ) ∪ C (j) . We use the shorthand A C C (j) for this operation due to its visual similarity to unraveling a cycle.
G /C may also have a critical cycle, so we have to apply this reasoning recursively. This is captured by Karp (1971)'s Theorem 1. 10 Theorem 1. For any graph G, either G * = − A G or G contains a critical cycle C and G * = (G /C ) * C (j) where j is the entrance site of (G /C ) * . Furthermore, w((G /C ) * ) = w(G * ).
Theorem 1 suggests a recursive strategy for finding G * , which is the basis of many efficient algorithms (Tarjan, 1977;Camerini et al., 1979;Georgiadis, 2003;Chu and Liu, 1965;Bock, 1971;Edmonds, 1967). We detail one such algorithm in Alg 1. Alg 1 can be made to run in O(n 2 ) time for dense with the appropriate implementation choices, such as Union-Find (Hopcroft and Ullman, 1973) to maintain membership of nodes to contracted nodes, as well as radix sort (Knuth, 1973) to sort incoming edges to contracted nodes; using a regular sort would add a factor of log n to the runtime.

Finding the best dependency tree
Gabow and Tarjan (1984) propose an algorithm that does additional recursion at the base case of opt(G) (the additional if-statement at Line 5) to recover G † instead of G * . Suppose that the set of edges emanating from the root in − A G is given by σ and |σ| > 1. We consider removing each edge in (ρ − A j) ∈ σ from G. Since G may have multiple edges from ρ to j, we write G\ \e to mean deleting all edges with the same edge points as e. Let G be the graph G\ \e where e ∈ σ is chosen greedily to maximize w( − A G ). Consider the two possible cases: Optimization case. If G has no critical cycles, then − A G must be the best arborescence with one fewer edges emanating from the root than − A G by our greedy choice of e .
Reduction case. If G has a critical cycle C, then all edges in C that do not point to j are in − A G . If e / ∈ G † , then C is critical cycle in the context of constrained problem and so we can apply Theorem 1 to recover G † . Otherwise, e ∈ G † and we can break C at j to get C (j) , which is comprised of edges in − A G . Therefore, we can find (G /C ) † to retrieve G † . This notion is formalized in the following theorem. 11 Theorem 2. For any graph G with G * = − A G , let σ be the set of outgoing edges from ρ in G * . If |σ| = 1, then G † = G * . Otherwise, let G = G\ \e for e ∈ σ that maximizes w( − A G ), then either G † = G † or there exists a critical cycle C in G such that 11 For completeness, App. B provides a proof of Theorem 2.
Theorem 2 suggests a recursive strategy constrain (Alg 1) for finding G † given G * . Gabow and Tarjan (1984, Theorem 7.1) prove that such a strategy will execute in O(n 2 ) and so when combined with opt(G) (Alg 1) leads to a O(n 2 ) runtime for finding G † given a graph G. The efficiency of the algorithm amounts to requiring a bound of O(n) calls to constrain that will lead to the reduction case in order to obtain any number optimization cases. Each recursive call does a linear amount of work to search for the edge to remove and to stitch together the results of recursion. Rather than computing the greedy graph from scratch, implementations should exploit that each edge removal will only change one element of the greedy graph.
Thus, we can find w( − −− A G\ \e ) in constant time.

Experiment
How often do state-of-the-art parsers generate malformed dependency trees? We examined 63 Universal Dependency Treebanks (Nivre et al., 2018) and computed the rate of malformed trees when decoding using edge weights generated by pre-trained models supplied by Qi et al. (2020). On average, we observed that 1.80% of trees are malformed. We were surprised to see that-although the edgefactored model used is not expressive enough to capture the root constraint exactly-there are useful correlates of the root constraint in the surface form of the sentence, which the model appears to use to workaround this limitation. This becomes further evident when we examine the relative change 12 in UAS (0.0083%) and exact match scores (0.60%)  Table 1: Average malformed rate, relative UAS change, and relative exact match score change for different data settings. The 63 languages are split by their training set size |train| into high (|train| ≥ 10, 000), medium (1, 000 ≤ |train| < 10, 000), and low (|train| < 1, 000). when using the constrained algorithm as opposed to the unconstrained algorithm. Nevertheless, given less data, it is harder to learn to exploit the surface correlates; thus, we see an increasing average rate of violation, 6.21%, when examining languages with training set sizes of less than 1, 000 sentences. Similarly, the relative change in UAS and exact match score increases to 0.0368% and 2.91% respectively. Indeed, the worst violation rate was 24% was seen for Kurmanji which only contains 20 sentences in the training set. Kurmanji consequently had the largest relative changes to both UAS and exact match scores of 0.41% and 22.22%. We break down the malformed rate and accuracy changes by training size in Tab. 1. Furthermore, the correlation between training size and malformed tree rate can be seen in Fig. 3 while the correlation between training size and relative accuracy change can be seen in Fig. 4. We provide a full table of the results in App. C.

Conclusion
In this paper, we have bridged the gap between the graph-theory and dependency parsing literature. We presented an efficient O(n 2 ) for finding the maximum arborescence of a graph. Furthermore, we highlighted an important distinction between dependency trees and arborescences, namely that dependency trees are arborescences subject to a root constraint. Previous work uses inefficient algorithms to enforce this constraint. We provide a solution which runs in O(n 2 ). Our hope is that this paper will remind future research in dependency parsing to please mind the root. Paolo M. Camerini, Luigi Fratta, and Francesco Maffioli. 1979. A note on finding optimum branchings. Networks, 9(4). A Proof of Theorem 1 To prove Theorem 1, we note a correspondence between graphs and contracted graphs.
Proposition 1. Given a rooted graph G and a (not necessarily critical) cycle C in G. For any A ∈ A(G) that has a single edge e = (i w − A j) ∈ A such that i / ∈ C and j ∈ C, there exists A C ∈ A(G /C ) and Proof. Since e is the only edge in A from a non-cycle node to a cycle node (enter), every edge e ∈ G /C such that π(e ) ∈ A forms an arborescence A C ∈ A(G /C ). Note that the set of edges in A for which there is no corresponding edge in G /C are dead edges. In fact, as A satisfies (C1), these edges form an arborescence A ∈ A(G (j) C ). Therefore, A = A C A . Furthermore, consider the weight of A: Note that (7) follows because e is the only edge in A from a non-cycle node to a cycle node, and (8) follows by the construction of enter edges in G /C .
As a corollary, we also have that every arborescence in the contracted graph G /C can be expanded into an arborescence in G.
Corollary 1 (Expansion lemma). Given a rooted graph G with a cycle C, every arborescence A C ∈ A(G /C ) is related to an arborescence A ∈ A(G) by A = A C C (j) where j is the entrance site of A C . Furthermore w(A) = w(A C ).
Proof. Let j be the entrance site of A C into C. As A C ∈ A(G /C ) and C (j) ∈ A(G (j) C ), Proposition 1 constructs A ∈ A ρ (G) as desired. Furthermore, w(A) = w(A C ) − w(C (j) ) + w(C (j) ) = w(A C ).
Note that Proposition 1 does not account for all arborescences in A(G). We next show that such arborescences which cannot be constructed using Proposition 1 will never be G * .
Lemma 1. Given a rooted graph G with a critical cycle C. We have that for all j ∈ C Proof. Since G (j) C is a subgraph of G it must be that − − A G (j) C is also a subgraph of − A G . Since C is a critical cycle, C (j) does not have cycles and equals − − A G (j) C . Therefore C (j) = G (j) C * .
Lemma 2. Given a rooted graph G with a critical cycle C and A ∈ A(G). If e = (i − A j) ∈ A and e = (i − A j ) such that i, i / ∈ C and j, j ∈ C, then there exists a A ∈ A(G) with e ∈ A and e / ∈ A such that w(A) ≤ w(A ).