Dependency Parsing with Bounded Block Degree and Well-nestedness via Lagrangian Relaxation and Branch-and-Bound

We present a novel dependency parsing method which enforces two structural properties on dependency trees: bounded block degree and well-nestedness. These properties are useful to better represent the set of admissible dependency structures in treebanks and connect dependency parsing to context-sensitive grammatical formalisms. We cast this problem as an Integer Linear Program that we solve with Lagrangian Relaxation from which we derive a heuristic and an exact method based on a Branch-and-Bound search. Experimen-tally, we see that these methods are efﬁ-cient and competitive compared to a base-line unconstrained parser, while enforcing structural properties in all cases.


Introduction
We address the problem of enforcing two structural properties on dependency trees, namely bounded block degree and well-nestedness, without sacrificing algorithmic efficiency. Intuitively, bounded block degree constraints force each subtree to have a yield decomposable into a limited number of blocks of contiguous words, while well-nestedness asserts that every two distinct subtrees must not interleave: either the yield of one subtree is entirely inside some gap of the other or they are completely separated. These two types of constraints generalize the notion of projectivity: projective trees actually have a block degree bounded to one and are well-nested.
Our first motivation is the fact that most dependency trees in NLP treebanks are well-nested and have a low block degree which depends on the language and the linguistic representation, as shown in (Pitler et al., 2012). Unfortunately, al-though polynomial algorithms exist for this class of trees (Gómez-Rodríguez et al., 2009), they are not efficient enough to be of practical use in applications requiring syntactic structures. In addition, if either property is dropped, but not the other, then the underlying decision problem becomes harder. That is why practical parsing algorithms are either completely unconstrained (Mc-Donald et al., 2005) or enforce strict projectivity . This work is, to the best of our knowledge, the first attempt to build a discriminative dependency parser that enforces well-nestedness and/or bounded block degree and to use it on treebank data.
We base our method on the following observation: a non-projective dependency parser, thus not requiring neither well-nestedness nor bounded block degree, returns dependency trees satisfying these constraints in the vast majority of sentences. This would tend to indicate that the heavy machinery involved to parse with these constraints is only needed in very few cases.
We consider arc-factored dependency parsing with well-nestedness and bounded block degree constraints. We formulate this problem as an Integer Linear Program (ILP) and apply Lagrangian Relaxation where the dualized constraints are those associated with bounded block degree and well-nestedness. The Lagrangian dual objective then reduces to a maximum spanning arborescence and can be solved very efficiently. This provides an efficient heuristic for our problem. An exact method can be derived by embedding this Lagrangian Relaxation in a Branch-and-Bound procedure to solve the problem with an optimality certificate. Despite the exponential worst-time complexity of the Branch-and-Bound procedure, it is tractable in practice. Our formulation can enforce both types of constraints or only one of them without changing the resolution method.
As stated in (Bodirsky et al., 2009), well-nested dependency trees with 2-bounded block degree are structurally equivalent to derivations in Lexicalized Tree Adjoining Grammars (LTAGs) (Joshi and Schabes, 1997). 12 While LTAGs can be parsed in polynomial time, developing an efficient parser for these grammars remains an open problem (Eisner and Satta, 2000) and we believe that this work could be a useful step in that direction.
Related work is reviewed in Section 2. We define arc-factored dependency parsing with block degree and well-nestedness constraints in Section 3. We derive an ILP formulation of this problem in Section 4 and then present our method based on Lagrangian Relaxation in Section 5 and Branch-and-Bound in Section 6. Section 7 contains experimental results on several languages.

Related Work
A dynamic programming algorithm has been proposed for parsing well-nested and k-bounded block degree dependency trees in (Gómez-Rodríguez et al., 2009;Gómez-Rodríguez et al., 2011). Unfortunately, it has a prohibitive O(n 3+2k ) time complexity, equivalent to Lexicalized TAG parsing when k = 2. Variants of this algorithm have also been proposed for further restricted classes of dependency trees: 1-inherit (O(n 6 )) (Pitler et al., 2012), head-split (O(n 6 )) (Satta and Kuhlmann, 2014) and both 1-inherit and head-split (O(n 5 )) (Satta and Kuhlmann, 2014). Although those restricted classes have good empirical coverage, they do not cover the exact search space of Lexicalized TAG derivation and their time complexity is still prohibitive. Spinal TAGs, described as a dependency parsing task in (Carreras et al., 2008), weaken even more the empirical coverage in practice, restricted to projective trees, but still remain hardly tractable with a complexity of O(n 4 ). On the contrary, the present work does not restrict the search space.
Parsing mildly context-sensitive languages with dependencies has been explored in (Fernández-González and Martins, 2015) but the resulting parser cannot guarantee compliance with strict structural properties. On the other hand, the 1 It is possible to express a wider class of dependencies with LTAG if we allow dependencies direction to be different from the derivation tree (Kallmeyer and Kuhlmann, 2012).
2 In order to be fully compatible with LTAGs, we must ensure that the root has only one child. For algorithmic issues see (Fischetti and Toth, 1992) or (Gabow and Tarjan, 1984). present method enforces the well-nestedness and bounded block degree of solutions.
The methods mentioned above all use the graph-based approach and rely on dynamic programming to achieve tractability. There is also a line of of work in transition-based parsing for various dependency classes. Systems have been proposed for projective dependency trees (Nivre, 2003), non-projective, or even unknown classes (Attardi, 2006). Pitler and McDonald (2015) propose a transition system for crossing interval trees, a more general class than well-nested trees with bounded block degree. In the case of spinal TAGs, we can mention the work of Ballesteros and Carreras (2015) and Shen and Joshi (2007). Transition-based algorithms offer low space and time complexity, typically linear in the length of sentences usually by relying on local predictors and beam strategies and thus do not provide any optimality guarantee on the produced structures. The present work follows the graph-based approach, but replaces dynamic programming with a greedy algorithm and Lagrangian Relaxation.
The use of Lagrangian Relaxation to elaborate sophisticated parsing models based on plain maximum spanning arborescence solutions originated in  where this method was used to parse with higher-order features. This technique has been explored to parse CCG dependencies in (Du et al., 2015) without a precise definition of the class of trees. We can also draw connections between our problem reduction procedure and the use of Lagrangian Relaxation to speed up dynamic programming and beam search with exact pruning in (Rush et al., 2013).
In this work we rely on Non-Delayed Relaxand-Cut for lazy constraint generation (Lucena, 2006). This can be linked to (Riedel, 2009) which uses a cutting plane algorithm to solve MAP inference in Markov Logic and  which uses column and row generation for higherorder dependency parsing.
In NLP, the Branch-and-Bound framework (Land and Doig, 1960) has previously been used for dependency parsing with high order features in (Qian and Liu, 2013), and Das et al. (2012) combined Branch-and-Bound to Lagrangian Relaxation in order to retrieve integral solutions for shallow semantic parsing.

Dependency Parsing
We model the dependency parsing problem using a graph-based approach. Given a sentence s = s 0 , . . . , s n where s 0 is a dummy root symbol, we consider the directed graph D = (V, A) with V = {0, . . . n} and A ⊆ V × V . Vertex i ∈ V corresponds to word s i and arc (i, j) ∈ A models a dependency from word s i to word s j . In the rest of the paper, we denote V \ {0} by V + .
An arborescence is a set of arcs T inducing a connected graph with no circuit such that every vertex has at most one entering arc. The set of vertices incident with any arc of T is denoted by then T is a spanning arborescence. Among the vertices of V [T ], the one with no entering arc is called the root of T . A vertex t is reachable from a vertex s with respect to T if there exists a path from s to t using only arcs of T . The yield of a vertex v ∈ V corresponds to the set of vertices reachable from v with respect to T .
It is well-known that there is a bijection between dependency trees for s and spanning arborescences with root 0 (McDonald et al., 2005). In what follows, the term dependency tree will refer to both the dependency tree of s and its associated spanning arborescence of D with root 0.
In the dependency parsing problem, one has to find a dependency tree with maximal score. Several scores can be associated with each dependency tree and different conditions can restrict the set of valid dependency trees.
In this paper, we consider an arc-factored model: each arc (i, j) ∈ A is assigned a score w ij ; the score of a dependency tree is defined as the sum of the scores of the arcs it contains. This model can be computed in O(n 2 ) with Chu-Liu-Edmonds' algorithm for Maximum Spanning Arborescence (MSA) (McDonald et al., 2005). Unfortunately, this algorithm forbids any modification of the score function, for example adding score contribution for combinations of arcs (i.e. grand-parent or sibling models). Moreover, adding score contribution for combinations of couple of arcs makes the problem NP-hard (McDonald and Pereira, 2006), although several methods have been developed to tackle this problem, for instance dual decomposition .
Likewise, restrictions on the tree structure such as the well-nestedness and bounded block degree conditions are not permitted in the MSA algorithm. We first give a precise definition of these structural properties, equivalent to (Bodirsky et al., 2009), before we present a method to take them into account. From now on, we suppose that instances are equipped with a positive integer k and we call valid dependency trees those satisfying the k-bounded block degree and well-nestedness conditions. A graph-theoretic definition of these two conditions can be given as follows.
Block degree The block degree of a vertex set W ⊆ V is the number of vertices of W without a predecessor 3 inside W . Given an arborescence T , the block degree of a vertex is the block degree of its yield and the block degree of T is the maximum block degree of its incident vertices. An arborescence satisfies the k-bounded block degree condition if its block degree is less than or equal to k. We then say it is k-BBD for short. Figure 1 (left) gives an example of a 2-BBD arborescence.
Well-nestedness Two disjoint subsets I 1 , I 2 ⊂ V + interleave if there exist i, j ∈ I 1 and k, l ∈ I 2 such that i < k < j < l. An arborescence is well-nested if it is not incident to two vertices whose yields interleave. Figure 1 (right) shows an arborescence which is not well-nested.

ILP Formulation
In this section we formulate the dependency parsing problem described in Section 3 as an ILP. We start with some notation and two theorems characterizing k-BBD and well-nested dependency trees. Given a subset W ⊆ V , the set of arcs entering W is denoted by δ in (W ) and the set of arcs leaving W is denoted by δ out (W ). The set δ(W ) = δ in (W )∪δ out (W ) is called the cut of W . Given a positive integer l, let W ≥l be the family of vertex subsets of V + with block degree greater than or equal to l. For instance, given any sentence with more than 6 words, {1, 3, 5, 6} ∈ W ≥3 , while {1, 2, 5, 6} ∈ W ≥3 . We also denote by I the family of couples of disjoint interleaving vertex subsets of V + . For instance, ({1, 4}, {2, 3, 5}) belongs to I. Finally, given a vector x ∈ R A and a subset B ⊆ A, x(B) corresponds to a∈B x a .
Theorem 1. A dependency tree T is not k-BBD iff there exists a vertex subset W ∈ W ≥k+1 whose cut δ(W ) contains a unique arc of T .
Proof. By definition of block degree, a dependency tree is not k-BBD iff it is incident with a vertex whose yield W belongs to W ≥k+1 . It is equivalent to say that T contains a subarborescence T such that V [T ] equals W . This holds iff W has one entering arc (since 0 / ∈ W ) and no leaving arc belonging to T .
Theorem 2. A dependency tree T is not wellnested iff there exists (I 1 , I 2 ) ∈ I such that δ(I 1 ) ∩ T and δ(I 2 ) ∩ T are singletons.
Proof. δ(I 1 ) and δ(I 2 ) both intersect T only once iff T contains two arborescences T 1 and T 2 such that V [T 1 ] = I 1 and V [T 2 ] = I 2 . This means that T is incident with two vertices whose yields are I 1 and I 2 , respectively. Result follows from the definition of I and well-nested arborescences.
The dependency parsing problem can be formulated as follows. A dependency tree will be represented by its incidence vector. Hence, we use variables z ∈ R A such that z a = 1 if arc a belongs to the dependency tree and 0 otherwise.
The objective function (1) maximizes the score of the dependency tree. Inequalities (2) ensure that all vertices but the root have one entering arc. Inequalities (3) force the set of arcs associated with z to induce a connected graph. Inequalities (2) and (3), together with z ≥ 0, give a linear description of the convex hull of the incidence vectors of the spanning arborescences with root 0 -see e.g., (Schrijver, 2003). Inequalities (4) ensure that the dependency tree is k-BBD and inequalities (5) impose well-nestedness. The validity of (4) and (5) follows from Theorems 1 and 2, respectively. Remark that (3) could be replaced by a polynomial number of additional flow variables and constraints, see (Martins et al., 2009). 4

Lagrangian Relaxation
Solving this ILP using an off-the-shelf solver is ineffective due to the huge number of constraints. We tackle this problem with Lagrangian Relaxation, which has become popular in the NLP community, see for instance (Rush and Collins, 2012). Note that contrary to most previous work on Lagrangian Relaxation for NLP, we do not use it to derive a decomposition method.
We note that optimizing objective (1) subject to constraints (2), (3) and (6) amounts to finding a MSA and can be solved combinatorially (McDonald et al., 2005). Thus, since formulation (1)-(6) is based only on arc variables, by relaxing constraints (4) and (5), one obtains a Lagrangian dual objective which is nothing but a MSA problem with reparameterized arc scores. Our Lagrangian approach relies on a subgradient descent where a MSA problem is solved at each iteration. We give more details in the rest of the section.

Dual Problem
Let Z be the set of the incidence vectors of dependency trees. Keeping tree shape constraints (2), (3) and (6) while dualizing k-bounded block degree constraints (4) and well-nestedness constraints (5), we build the following Lagrangian (Lemaréchal, 2001): 4 Based on this remark, we also developed a formulation of this problem with a polynomial number of variables and constraints. However it requires adding many more variables than (Martins et al., 2009). This leads to a formulation which is not tractable, see Section 7.2. Moreover, it cannot be tackled by our Lagrangian Relaxation approach. with z ∈ Z and u = (u 1 , u 2 ) ≥ 0 is a vector of Lagrangian multipliers. We refactor to: where θ are modified scores and c a constant term. The dual objective is L * (u) = max z L(z, u) with z ∈ Z. Note that computing L * (u) amounts to solving the MSA problem with modified scores θ and can be efficiently computed. The dual problem is min u≥0 L * (u). L * is a non-differentiable convex piece-wise linear function and one can find its minimum via subgradient descent. For any vector u, we use the following subgradient. Denote M z ≤ b the set of constraints given by (4) and (5) and z * = arg max z L(z, u). Let g = b − M z * be a subgradient at u, see (Lemaréchal, 2001) for more details. From this subgradient, we compute the descent direction following (Camerini et al., 1975), which aggregates information during the iteration of the subgradient descent algorithm. Unfortunately, optimizing the dual is expensive with so many relaxed constraints. We handle this problem in the next subsection.

Efficient Optimization with Many Constraints
The Non Delayed Relax-and-Cut (NDRC) method (Lucena, 2005) tackles the problem of optimizing a Lagrangian dual problem with exponentially many relaxed constraints. In standard subgradient descent, at each iteration p of the descent, the Lagrangian update can be formulated as: where s p > 0 is the stepsize 5 and () + denotes the projection onto R + , which replaces each negative component by 0. If all Lagrangian multipliers are initialized to 0, the compononent corresponding to a constraint will not be changed until this constraint is violated for the first time. Indeed, by definition of g, we have [g p ] i ≥ 0 if constraint i is satisfied at iteration p: the projection on R + ensure that [u p+1 ] i stays at 0. 6 Thus we do not need to know constraints that have not been violated yet in order to correctly update the Lagrangian multipliers: this is the main intuition behind the NDRC 5 As stated above, instead of the subgradient we follow an improved descent direction which aggregates information of previous iterations. However, this does not change the proposal of this subsection. 6 [x]i denotes the ith component of vector x.
method. However, s p may depend on the full subgradient information. A common step size (Fisher, 1981) is: with α p a scalar and LB p the best known lower bound. This is also the case with more recent approaches like AdaGrad (Duchi et al., 2011) and AdaDelta (Zeiler, 2012). As reported in (Beasley, 1993;Lucena, 2006), when dealing with many relaxed constraints, the g p 2 term can result in each Lagrangian update being almost equal to 0. Therefore, a good practice is to modify the subgradient such that if [g p ] i > 0 and [u p ] i = 0, then we set [g p ] i = 0: this has the same effect on the multipliers as the projection on R + in (9), but it prevents the stepsize from becoming too small. Hence, instead of generating a full subgradient at each iteration, which is an expensive operation because we would need to consider all multipliers associated with constraints, we process only a subpart, namely the one associated with constraints that have been violated. Following (Lucena, 2005), at each iteration p of the subgradient descent we define two sets: Currently Violated Active Constraints (CA p ) and Previously Violated Active Constraints (PA p ). CA p and PA p are not necessarily disjoint. The subgradient is computed only for constraints in CA p ∪ PA p . At each iteration p, we update PA p = PA p−1 ∪ CA p−1 and a violation detection step, similar to the separation step in a cutting plane algorithm, generates CA p . Two strategies are possible for the detection: (1) adding to CA p all the constraints violated by the current dual solution; (2) adding only a subset of them. The latter is justified by the fact that many constraints may overlap thus leading to exageration of modified scores on some arcs. We found that strategy (2) gives better convergence results.
Detection for violated block degree constraints (4) can be done with the algorithm described in (Möhl, 2006) in O(n 2 ). If no violated block degree constraint is found, we search for violated well-nestedness constraints (5) using the O(n 2 ) algorithm described in (Havelka, 2007).

Lagrangian Heuristic
We derive a heuristic from the Lagragian Relaxation. First, a dependency tree is computed with the MSA algorithm. If it is valid, it then corresponds to the optimal solution. Otherwise, we proceed as follows. The computation of the step size in (10) in the subgradient descent needs a lower bound which can be given by the score of any valid dependency tree. In our experiments, we compute the best projective spanning arborescence (Eisner, 2000). Each iteration of the subgradient descent computes a spanning arborescence. Since violating (4) and (5) is penalized in the objective function, it tends to produce valid dependency trees. The heuristic returns the highest scoring one.

Branch and Bound
Solving the Lagrangian dual problem may not always give an optimal solution to the original problem because of the potential duality gap. Still, we always obtain an upper bound on the optimal solution and if a dual solution satisfies constraints (4) and (5), its score with the original weights provides a lower bound. 7 Moreover, the subgradient descent algorithm theoretically converges but we have no guarantee that this will happen in a realistic number of iterations. Therefore, in order to retrieve an optimal solution in all cases, we embed the Lagrangian Relaxation of the problem within a Branch-and-Bound procedure (Land and Doig, 1960).
The search space is recursively split according to an arc variable, creating two subspaces, one where it is fixed to 1 and the other to 0 (branching step). The procedure returns a candidate solution when all arc variables are fixed and constraints are satisfied, and the optimal solution is the highestscoring candidate solution.
For each subspace, we estimate an upper bound using the Lagrangian Relaxation (bounding step). The recursive exploration of a subspace stops (pruning step) if (1) we can prove that all candidate solutions it contains have a score lower than the best found so far, or (2) we detect an unsastifiable constraint.
The branching strategy is built upon Lagrangian multipliers: we branch on the variable z a with highest value θ a − w a . Intuitively, if the branching step sets z a = 1, it indicates that we add a hard constraint on an arc which has been strongly promoted by Lagrangian Relaxation. This strategy, compared to other variants, gave the best parsing 7 Because relaxed constraints are inequalities, constraint satisfaction does not guarantee optimality (Beasley, 1993). time on development data.

Problem Reduction
The efficiency of the Branch-and-Bound procedure crucially depends on the number of free variables. To prune the search space, we rely on problem reduction (Beasley, 1993), once again based on duality and Lagrangian Relaxation, which provides certificates on optimal variable settings.
We fix a variable to 1 (resp. 0), and compute an upper bound on the optimal score with this new constraint. If it is lower than the score of the best solution found so far without this constraint, we can guarantee that this variable cannot (resp. must) be in the optimal solution and safely set it to 0 (resp. 1).
Problem reduction is performed at each node of the Branch-and-Bound tree after computing the upper bound with subgradient descent.

Fixing Variables to 1
Since a node in V + must have exactly one parent, fixing z ij = 1 for an arc a = (i, j) greatly reduces the problem size, as it will also fix z hj = 0, ∀h = i. Among all arc variables that can be set to 1, promising candidates are the arcs in a solution of the unconstrained MSA and the arcs obtained in a solution after the subgradient descent.
There are exactly n such arcs in each set of candidates, so we test fixation for less than 2n variables. In this case, we are ready to pay the price of a quadratic computation for each of these arcs.
Hence, for each candidate arc we obtain an upper bound by seeking the (unconstrained) MSA on the graph where this arc is removed. If this upper bound is lower than the score of the best solution found so far, we can safely say that this arc is in the optimal solution.

Fixing Variables to 0
We could apply the same strategy for fixing variables to 0. However, this reduction is less rewarding and there are many more variables set to 0 than 1 in a MSA solution. Instead, we solve an easier problem, at the expense of a looser upper bound.
For each arc a which is not in the MSA, we compute a maximum directed graph that contains this arc and where all nodes but the root have one parent. Remark that if this graph is connected then it corresponds to a dependency tree. Therefore, the score of this directed graph provides an upper bound on a solution containing arc a. If this upper bound is lower than the score of the best solution found so far, we can fix the variable z a to 0.
Note that this whole fixing procedure can be done in O(n 2 ).

Experiments
We ran a series of experiments to test our method in the case of unlabelled dependency parsing. Our prototype has been developped in Python with some parts in Cython and C++. We use the MSA implementation available in the LEMON library. 8

Datasets
We ran experiments on 5 different corpora: English: Dependencies were extracted from the WSJ part of the Penn Treebank (PTB) with additional NP bracketings (Vadas and Curran, 2007) with the LTH converter 9 (default options). Sections 02-21 are used for training, 22 for development and 23 for testing. POS tags were predicted by the Stanford tagger 10 trained with 10jackkniffing. 11 German: We used dependencies from the SPMRL dataset (Seddah et al., 2014), with predicted POS tags and the official split. We removed sentences of length greater than 100 in the test set.
Dutch, Spanish and Portuguese: We used the Universal Dependency Treebank 1.2 ( Van der Beek et al., 2002;McDonald et al., 2013;Afonso et al., 2002) with gold POS tags and the official split. We removed sentences of length greater than 100 in the test sets.
Those datasets contain different structure distributions as shown in Table 1. Fortunately, our method allows us to easily change the bounded degree constraint or toggle the well-nestedness one. For each language, we decided to use the most constrained combination of bounded block degree constraints and well-nestedness which covers over 99% of the data. Therefore, we chose to enforce 2-BBD and well-nestedness for English and Spanish, 3-BBD and well-nestedness for Dutch and Portuguese and 3-BBD only for German.

Decoding
In order to compare our methods with previous approaches, we tested five decoding strategies.
(MSA) computes the best unconstrained dependency tree. (Eisner) computes the best projective tree. (LR) and (B&B) are the heuristic and the exact method presented in Sections 5.3 and 6 respectively. 12 Finally (MSA/Eisner) consists in running the MSA algorithm and, if the solution is invalid, returning the (Eisner) solution instead.
Our attempt to run the dynamic programming algorithm of (Gómez-Rodríguez et al., 2009) was unsuccessful. Even with heavy pruning we were not able to run it on sentences above 20 words. We also tried to use CPLEX on a compact ILP formulation based on multi-commodity flows (see footnote 4). Parsing time was also prohibitive: a total of 3473 seconds on English data without the well-nestedness constraint, 7298 for German.
We discuss the efficiency of our methods on data for English and German. Other languages give similar results. Optimality rate after the subgradient descent are reported in Figure 2. We see that Lagrangian Relaxation often returns optimal solutions but fails to give a certificate of their optimality. Table 2 shows parsing times. We see that (LR) and (B&B), while slower than (MSA), are fast in the majority of cases, below the third quartile. Inevitably, there are some rare cases where a large portion of the search space is explored, and thus their parsing time is high. Let us remark that these algorithms are run only when (MSA) returns an invalid structure, and so total time is very acceptable compared to the baseline.
Finally, we stress the importance of problem reduction as a pre-processing step in B&B: after subgradient descent is performed, it removes an average of 83.97% (resp. 76.59%) of arc variables in the English test set (resp. German test set).

Training
Feature weights are trained using the averaged structured perceptron (Collins, 2002) with 10 iterations where the best iteration is selected on the development set. We used the same feature set as in TurboParser (Martins et al., 2010), including features for lemma. For German, we additionally use morpho-syntactic features.
The decoding algorithm used at training time is the MSA. We experimented with Branch-and-Bound and Lagrangian Relaxation decoding dur-  ing training. It did not significantly improve accuracy and made training and decoding slower. Table 3 shows attachment score (UAS), percentage of valid dependency trees and relative time to (MSA) for different systems for our five decoding strategies. We can see (B&B) is on a par with (LR) on some corpora and more accurate on the others. The former takes more time, and the improvement is correlated with time difference for all corpora but the PTB. Dividing the five corpora in three cases, we can see that:

Parsing Results
1. For English and Spanish, where projective dependency trees represent more than 90% of the data, (Eisner) outperforms (MSA). Our methods lie between the two. Here it is better to search for projective trees and (LR) and (B&B) are not interesting in terms of UAS. This is confirmed by the results of (MSA/Eisner).
2. For German and Dutch, where projective dependency trees represent less than 70% of the data, (MSA) outperforms (Eisner). For German, where well-nestedness is not required, our methods are as accurate as (MSA) 13 , while for Dutch our methods seem to be useful, as (B&B) outperforms all sys-tems. Moreover, our two methods guarantee validity.
3. For Portuguese, where projective dependency trees represent around 80% of the data, (MSA) is as accurate as (Eisner). In this case we see that, while our heuristic is below, the exact method is more accurate. This seems to be an edge case where neither unconstrained nor projective dependency trees seem to adequately capture the solution space. We also see that it is harder for our methods to give solutions (longer computation times, which tend to indicate that LR cannot guarantee optimality). Our methods are best fitted for this case.
In order to see how much well-nested and bounded block-degree structures are missed by a state-of-the-art parser, we compare our results with TurboParser. 14 We run the parser with three different feature sets: arc-factored, standard (second-order features), and full (third-order features). The results are shown in Table 4. Our model, by enforcing strict compliance to structural rules (100% valid dependency trees), is closer to the empirical distribution than TurboParser in arc-factored mode on all languages but German. Higher-order scoring functions manage to get more similar to the treebank data than our strict thresholds for all languages but Portuguese, at the expense of a significative computational burden.   We interpret this fact as an indication that adding higher order features into our system would make the relaxation method converge more often and faster.

Conclusion
We presented a novel characterization of dependency trees complying with two structural rules: bounded block degree and well-nestedness from which we derived two methods for arc-factored dependency parsing. The first one is a heuristic which relies on Lagrangian Relaxation and the Chu-Liu-Edmonds efficient maximum spanning arborescence algorithm. The second one is an exact Branch-and-Bound procedure where bounds are provided by Lagrangian Relaxation. We showed experimentally that these methods give results comparable with state-of-the-art arcfactored parsers, while enforcing constraints in all cases.
In this paper we focused on arc-factor models, but our method could be extended to higher order models, following the dual decomposition method presented in  in which the maximum-weight spanning arborescence component would be replaced by our constrained model.
Our method opens new perspectives for LTAG parsing, in particular using decomposition techniques where dependencies and templates are pre-dicted separately. Moreover, while well-nested dependencies with 2-bounded block degree can represent LTAG derivations, toggling the wellnestedness or setting the block degree bound allows to express the whole range of derivations in lexicalized LCFRS, whether well-nested or with a bounded fan-out. Our algorithm can exactly represent these settings with a comparable complexity.