Parsing to 1-Endpoint-Crossing, Pagenumber-2 Graphs

We study the Maximum Subgraph problem in deep dependency parsing. We consider two restrictions to deep dependency graphs: (a) 1-endpoint-crossing and (b) pagenumber-2. Our main contribution is an exact algorithm that obtains maximum subgraphs satisfying both restrictions simultaneously in time O(n5). Moreover, ignoring one linguistically-rare structure descreases the complexity to O(n4). We also extend our quartic-time algorithm into a practical parser with a discriminative disambiguation model and evaluate its performance on four linguistic data sets used in semantic dependency parsing.


Introduction
Dependency parsing has long been studied as a central issue in developing syntactic or semantic analysis. Recently, some linguistic projects grounded on deep grammar formalisms, including CCG, LFG, and HPSG, draw attentions to rich syntactic and semantic dependency annotations that are not limited to trees (Hockenmaier and Steedman, 2007;Ivanova et al., 2012). Parsing for these deep dependency representations can be viewed as the search for Maximum Subgraphs (Kuhlmann and Jonsson, 2015). This is a natural extension of the Maximum Spanning Tree (MST) perspective (McDonald et al., 2005) for dependency tree parisng.
One main challenge of the Maximum Subgraph perspective is to design tracTable algorithms for certain graph classes that have good empirical coverage for linguistic annotations. Unfortunately, no previously defined class simultaneously has high * The first two authors contribute equally. coverage and low-degree polynomial parsing algorithms. For example, noncrossing dependency graphs can be found in time O(n 3 ), but cover only 48.23% of sentences in CCGBank (Kuhlmann and Jonsson, 2015).
We study two well-motivated restrictions to deep dependency graphs: (a) 1-endpoint-crossing (1EC hereafter; Pitler et al., 2013) and (b) pagenumber is less than or equal to 2 (P2 hereafter; Kuhlmann and Jonsson, 2015). We will show that if the output dependency graphs are restricted to satisfy both restrictions, the Maximum Subgraph problem can be solved using dynamic programming in time O(n 5 ). Moreover, if we ignore one linguistically-rare sub-problem, we can reduce the time complexity to O(n 4 ). Though this new algorithm is a degenerated one, it has the same empirical coverage for various deep dependency annotations. We evaluate the coverage of our algorithms on four linguistic data sets: CCGBank, DeepBank, Enju HPSGBank and Prague Dependency Tree-Bank. They cover 95.68%, 97.67%, 97.28% and 97.53% of dependency graphs in the four corpora. The relatively satisfactory coverage makes it possible to parse with high accuracy.
Based on the quartic-time algorithm, we implement a parser with a discriminative disambiguation model. Our new parser can be taken as a graph-based parser which is complementary to transition-based (Henderson et al., 2013;Zhang et al., 2016) and factorization-based (Martins and Almeida, 2014;Du et al., 2015a) systems. We evaluate our parser on four data sets: those used in SemEval 2014 Task 8 (Oepen et al., 2014), and the dependency graphs extracted from CCGbank (Hockenmaier and Steedman, 2007). Evaluations indicate that our parser produces very accurate deep dependency analysis. It reaches state-of-the-art results on average produced by a transition-based system of Zhang et al. (2016) and factorization-based systems (Martins and Almeida, 2014;Du et al., 2015a).

Background
Dependency parsing is the task of mapping a natural language sentence into a dependency graph. Previous work on dependency parsing mainly focused on tree-shaped representations. Recently, it is shown that data-driven parsing techniques are also applicable to generate more flexible deep dependency graphs Martins and Almeida, 2014;Du et al., 2015b,a;Zhang et al., 2016;Sun et al., 2017). Parsing for deep dependency representations can be viewed as the search for Maximum Subgraphs for a certain graph class G (Kuhlmann and Jonsson, 2015), a generalization of the MST perspective for tree parsing. In particular, we have the following optimization problem: Given an arc-weighted graph G = (V, A), find a subgraph G ′ = (V, A ′ ⊆ A) with maximum total weight such that G ′ belongs to G.
The choice of G determines the computational complexity of dependency parsing. For example, if G is the set of projective trees, the problem can be solved in time O(|V | 3 ), and if G is the set of noncrossing dependency graphs, the complexity is O(|V | 3 ). Unfortunately, no previously defined class simultaneously has high coverage on deep dependency annotations and low-degree polynomial decoding algorithms for practical parsing. In this paper, we study well-motivated restrictions: 1EC (Pitler et al., 2013) and P2 (Kuhlmann and Jonsson, 2015). We will show that relatively satisfactory coverage and parsing complexity can be obtained for graphs that satisfy both restrictions.
3 The 1EC, P2 Graphs 3.1 The 1EC Restriction Pitler et al. (2013) introduced a very nice property for modelling non-projective dependency trees, i.e. 1EC. This property not only covers a large amount of tree annotations in natural language treebanks, but also allows the corresponding MST problem to bo solved in time of O(n 4 ). The formal description of the 1EC property is adopted from (Pitler et al., 2013). Definition 1. Edges e 1 and e 2 cross if e 1 and e 2 have distinct endpoints and exactly one of the endpoints of e 1 lies between the endpoints of e 2 .
Definition 2. A dependency graph is 1-Endpoint-Crossing if for any edge e, all edges that cross e share an endpoint p.
Given a sentence s = w 0 w 1 · · · w n−1 of length n, the vertices, i.e. words, are indexed with integers, an arc from w i to w j as a (i,j) , and the common endpoint, namely pencil point, of all edges crossed with a (i,j) or a (j,i) as pt(i, j). We denote an edge as e (i,j) , if we do not consider its direction.

The P2 Restriction
The term pagenumber is referred to as planar by some other authors, e.g. (Titov et al., 2009;Gómez-Rodríguez and Nivre, 2010;Pitler et al., 2013). We give the definition of related concepts as follows.
Definition 3. A book is a particular kind of topological space that consists of a single line called the spine, together with a collection of one or more half-planes, called the pages, each having the spine as its boundary. Empirically, a deep dependency graph is not very dense and can typically be embedded onto a very thin book. To measure the thickness of a graph, we can use its pagenumber.
Definition 5. The book pagenumber of G is the minimum number of pages required for a book embedding of G.
For sake of concision, we say a graph is "pagenumber-k", meaning that the pagenumber is at most k.
Theorem 1. The pagenumber of 1EC graph may be greater than 2.
Proof. The graph in Figure 1 gives an instance which is 1EC but the pagenumber of which is 3. There is a cycle, namely a → c → e → b → d → a, consisting of odd number of edges. Pitler et al. (2013) proved that 1EC trees are a subclass of graphs whose pagenumber is at most 2. This property provides the foundation to the indicates whether the restriction "P2" is satisfied; Column "1EC" indicates whether the restriction "1EC" is satisfied.
d . e Figure 1: A 1EC graph whose pagenumber is 3.
success in designing dynamic programming algorithms for trees. Theorem 1 indicates that when we consider more general graph, the case is more complicated. In this paper, we study graphs that are constrained to be both 1EC and P2. We call them 1EC/P2 graphs.

Coverage on Linguistic Data
To show that the two restrictions above are wellmotivated for describing linguistic data, we evaluate their empirical coverage on four deep dependency corpora (as defined in Section 5.2). These corpora are also used for training and evaluating our data-driven parsers. The coverage is evaluated using sentences in the training sets. Table 1 shows the results. We can see that 1EC is also an empirical well-motivated restriction when it comes to deep dependency structures. The P2 property has an even better coverage. Unfortunately, it is a NP-hard problem to find optimal P2 graphs (Kuhlmann and Jonsson, 2015). Though theoretically a 1EC graph is not necessarily P2, the empirical evaluation demonstrates the high overlap of them on linguistic annotations. In particular, almost all 1EC deep dependency graphs are P2. The percentages of graphs satisfying both restrictions vary between 95.68% for CCGBank and 97.67% for DeepBank. The relatively satisfactory coverage enables accurate practical parsing.

The Algorithm
This section contains the main contribution of this paper: a polynomial time exact algorithm for solving the Maximum Subgraph problem for the class of 1EC/P2 graphs.
Theorem 2. Take 1EC/P2 graphs as target subgraphs, the maximum subgraph problem can be solved in time O(|V | 5 ).
For sake of formal concision, we introduce the algorithm of which the goal is to calculate the maximum score of a subgraph. Extracting corresponding optimal graphs can be done in a number of ways. For example, we can maintain an auxiliary arc table which is populated parallel to the procedure of obtaining maximum scores.
Our algorithm is highly related to the following property: Every subgraph of a 1EC/P2 graph is also a 1EC/P2 graph. We therefore focus on maximal 1EC/P2 graphs, a particular type of 1EC/P2 graphs defined as follows.
Definition 6. A maximal 1EC/P2 graph is a 1EC/P2 graph that cannot be extended by including one more edge.
Our algorithm is a bottom-up dynamic programming algorithm. It defines different structures corresponding to different sub-problems, and visits all structures from bottom to top, finding the best combination of smaller structures to form a new structure. The key design is to make sure that it can produce all maximal 1EC/P2 graphs. During the search for maximal 1EC/P2 graphs, we can freely delete bad edges whose scores are negative. In particular, we figure out some edges, in each construction step, which can be created without violating either 1EC or P2 restriction. Assume the arc weight associated with a (i,j) is w[i, j]. Then we define a function SELECT(i, j) according to the comparison of 0 and w[i, j] as well as , we then select a (i,j) (or a (j,i) ) and add it to currently the best solution of a sub-problem. SELECT(i, j) returns max(max(0, w[i, j]) + max(0, w[j, i])). If we allow at most one arc between two nodes, . .
b Figure 2: Graphic representations of sub-problems.
The graphical illustration of our algorithm uses undirected graphs 1 . In other words, we use e (i,j) to include the discussion about both a (i,j) and a (j,i) .

Sub-problems
We consider six sub-problems when we construct a maximum dependency graph on a given (closed) interval [i, k] ⊆ V of vertices. When we focus on the nodes strictly inside this interval, and we use an open interval (i, k) to exclude i and j. See Figure 2 for graphical visualization. The first five are adapted in concord with Pitler et al. (2013)'s solution for trees, and we introduce a new sub-problem, namely C. Because graphs allow for loops as well as disconnectedness, the subproblems are simplified to some extent, while a special case of LR is now prominent. C is thus introduced to represent the special case. The subproblems are explained as follows.
Int Int[i, j] represents a partial analysis associated with an interval from i to j inclusively. Int[i, j] may or may not contain edge e (i,j) .
To parse a given sentence is equivalent to solve the problem Int[0, n − 1].
L L[i, j, x] represents a partial analysis associated with an interval from i to j inclusively as well as an external vertex represents a partial analysis associated with an interval from i to j inclusively as well as an external vertex The single-head property does not hold. We currently do not consider other constraints of directions. So prediction of the direction of one edge does not affect prediction of other edges as well as their directions. The directions can be assigned locally, and our parser builds directed rather than undirected graphs in this way. Undirected graphs are only used to conveniently illustrate our algorithms. All experimental results in Section 5.2 consider directed dependencies in a standard way. We use the official evaluation tool provided by SDP2014 shared task. The numberic results reported in this paper are directly comparable to results in other papers.
LR LR[i, j, x] represents a partial analysis associated with an interval from i to j inclusively as well as an external vertex j) .
represents a partial analysis associated with an interval from i to j inclusively and an external vertex a partial analysis associated with an interval from i to max{a, b} inclusively and an external vertex x. Intuitively, C depicts a class of graphs constructed by upper-and lowerplane edges arranged in a staggered pattern. a stands for the last endpoint in the upper plane, and b the last endpoint in the lower plane.
We give a definition of C. There exists in C[x, i, a, b] a series {s 1 , · · · , s m } that fulfills the following constraints: 1. s 1 = i < s 2 < ... < s m = max{a, b}.
The distinction between C1 and C2 is whether there is one more edge below than above.

Decomposing an Int Sub-problem
Consider an Int[i, j] sub-problem. Assume that k(k ∈ (i, j)) is the farthest vertex that is linked with i, and l = pt(i, k). When j − i > 1, there must be such a k given that we consider maximal 1EC/P2 graphs. There are three cases. Case 2: l ∈ (k, j). In this case, we can freely add e (i,l) without violating either 1EC or P2 conditions. Therefore Case 2 does not lead to any maximal 1EC/P2 graph. Our algorithm does not need to explicitly handle this case, given that they can be derived from solutions to other cases.
Case 3: l ∈ (i, k). Now assume that there is an edge from i to a vertex in (l, k). Consider the farthest vertex that is linked with l, say p(p ∈ (k, j). We can freely add e (i,p) without violating the 1EC and P2 restrictions. Similar to Case 2, we do not explicitly deal with this case. If there is no edge from i to any vertex in (l, k), then [i, l], [l, k], [k, j] are R, Int, L respectively. Three external edges are e (i,k) , e (l,j) , and e (i,j) . The decomposition is:

Decomposing an L Sub-problem
If there is no edge from x to any node in (i, j), the graph is reduced to Int[i, j]. If there is one, let k be the vertex farthest from i and adjacent to x. There are two different cases, as shown in Case 1: l = j Case 3: l ∈ (i, k) . .
Does such a dashed edge exist?

Decomposing an R Sub-problem
If there is no edge from x to (i, j), then the graph is reduced to Int[i, j]. If there is one, let k be the farthest vertex from j and adjacent to x. There are two different cases: The decomposition is similar to L, we thus do not give a graphical representation to save space.

Decomposing an N Sub-problem
If there is no edge from x to (i, j), then the graph is reduced to 2. If there is no such k in concord with the condition in (1), it comes a much more difficult case for which we introduce sub-problem C.
Here we put forward the conclusion: Lemma 1. Assume that k(k ∈ (i, j)) is the vertex that is adjacent to x and farthest from i. The decomposition for the second case is Proof. The distinction between Case 1 and 2 implies the following property, which is essential, ∀t ∈ (i, j), ∃e (pl,pr) We can recursively generate a series of length n-{e (sl k ,sr k ) }-in LR[i, j, x] as follows. k = 1 Let sl k = i, sr k = max{p|p ∈ (i + 1, j) and ∃e (i,p) }; k > 1 For sr k−1 , we denote all edges that cover it as e (pl 1 ,pr 1 ) , · · · , e (pls,prs) . Note that there is at least one such edge. For any two edges in them, viz e (plu,pru) and e (plv,prv) , (pl u , pr u ) ⊂ (pl v , pr v ) or (pl v , pr v ) ⊂ (pl u , pr u ). Otherwise, the P2 property no longer holds due to the interaction among e (sl k−1 ,sr k−1 ) , e (plu,pru) and e (plv,prv) . Assume (pl w , pr w ) is the largest one, then we let sl k = pl w , sr k = pr w . When sr k = j, recursion ends.
We are going to prove that if we delete two edges e (x,sr n−1 ) and e (i,j) from LR[i, j, x], the series {sl 1 , sl 2 , sl 3 , ..., sl n−2 , sl n−1 , sl n , sr n−1 , sr n } satisfies each and all the conditions of C1. Condition 1. Because e (sln,srn) covers sr n−1 , Condition 1 holds for k = m − 3, m − 2. Consider k ≤ m − 4 = n − 2. Assume that s k+1 < s k , then we have e (s k+1 ,sr k+1 ) is larger than e (s k ,sr k+1 ) . This is impossible because we select the largest edge in every step.
Condition 2. The LR sub-problem we discussed now cannot be reduced to L nor R, so there must be two edges from x that respectively cross edges linked to i and j. We are going to prove that the two edges must be e (x,s 2 ) and e (x,sr n−1 ) . Assume that there is e (x,p) , where p ∈ (i, j), p ̸ = s 2 and p ̸ = sr n−1 . If p ∈ (i, s 2 ), then e (s 1 ,s 3 ) crosses with e (x,p) and e (s 2 ,s 4 ) simultaneously. 1EC is violated. If p ∈ (s 2 , sr n−1 ), e (x,p) necessarily crosses with some edge e (s k ,s k+2 ) . Furthermore, i < s k < s k+2 < j. Thus 1EC is violated. If p ∈ (sr n−1 , j), the situation is similar to p ∈ (i, s 2 ).
Condition 4. This condition is easy to verify because (s k , s k+2 ) is the largest with respective to sr k .
All in all, the assumption does not hold and thus satisfies Condition 5.
Condition 6. e (x,s 1 ) , e (x,sm) are disallowed due to definition of an LR problem. e (x,s m−1 ) , e (s 1 ,sm) are disallowed due to the decomposition.
Condition 7. Due to the existence of e (x,s 2 ) and e (x,sr n−1 ) , there must be two edges: e (x,p 1 ) and e (x,p 2 ) that cross e (i,s 2 ) and e (sr n−1 ,j) respectively. There must be an odd number of edges in the series {e (sl k ,sr k ) }, otherwise P2 is violated as the case shown in Figure 1. In summary, the last condition . .
+ . + is satisfied and we have a C1 structure in this LR sub-problem.

Decomposing a C Sub-problem
We illustrate the decomposition using the graphical representations shown in Figure 7. When a < b, since a is the upper-plane endpoint farthest to the right, and b is the lower-plane counterpart, in this case a precedes b (i.e., a is to the left of b). Let C[x, i, a, k] be a C in which the lower-plane endpoint k precedes a. Add e (k,b) gives a new C sub-problem with lower-plane endpoint preceded by the upper-plane one. The decomposition is then When a > b and n > 2, the lower-plane endpoint b precedes a. In analogy, the case can be obtained by adding e (k,a) to C [x, i, k, b]. The decomposition: When n = 2, we reach the most fundamental case. Only 4 vertices are in the series, namely i,k,b,a. Moreover, there are three edges: e (x,k) , e (i,b) , e (k,a) , and the interval [i,a] is divided by k,b into three parts. The decomposition is

Soundness and Completeness
The algorithm is sound and complete with respective to 1EC/P2 graphs. We present our algorithms by detailing the decomposition rules. The completeness is obvious because we can decompose any 1EC/P2 graph from an Int, use our rules to reduce it into smaller sub-problems, and repeat this procedure. The decomposition rules are also construction rules. During constructing graphs by applying these rules, we never violate 1EC nor P2  restrictions. So our algorithm is sound.

Greedy Search during Construction
There is an important difference between our algorithm and Eisner-style MST algorithms (Eisner, 1996b;McDonald and Pereira, 2006;Carreras, 2007;Koo and Collins, 2010) for trees as well as Kuhlmann and Jonsson's Maximum Subgraph algorithm for noncrossing graphs. In each construction step, our algorithm allows multiple arcs to be constructed, but whether or not such arcs are added to the target graph depends on their arc-weights. In each step, we do greedy search and decide if adding an related arc according to local scores. If all arcs are assigned scores that are greater than 0, the output of our algorithm includes the most complicated 1EC/P2 graphs. That means adding one more arc voilates the 1EC or P2 restrictions. For all other aforementioned algorithms, in a single construction step, it is clear whether to add a new arc, and which one. There is no local search.

Spurious Ambiguity
To generate the same graph, even a maximal 1EC/P2 graph, we may have different derivations. Figure 8 is an example. This is similar to syntactic analysis licensed by Combinatory Categorial Grammar (CCG; Steedman, 1996Steedman, , 2000. To derive one surface string, there usually exists multiple CCG derivations. A practice of CCG parsing is defining one particular derivation as the standard one, namely normal form (Eisner, 1996a). The spurious ambiguity in our algorithm does not affect the correctness of first-order parsing, because scores are assigned to individual dependen-cies, rather than derivation processes. There is no need to distinguish one special derivation here.

Complexity
The sub-problem Int is of size O(n 2 ), each graph of which takes a calculating time of order O(n 2 ). For sub-problems L, R, LR, and N, each has O(n 3 ) elements, with a unit calculating time O(n). C has O(n 4 ) elements, with a unit calculating time O(n). Therefore the full version algorithm runs in time of O(n 5 ) with a space requirement of O(n 4 ).

A Degenerated Version
We find that graphical structures involved in the C sub-problem, namely coupled staggered pattern, is extremely rare in linguistic analysis. If we ignore this special case, we get a degenerated version of dynamic programming algorithm. This algorithm can find a strict subset of 1EC/P2 graphs. We can improve efficiency without sacrificing expressiveness in terms of linguistic data. This degenerated version algorithm requires O(n 4 ) time and O(n 3 ) space.

Disambiguation
We extend our quartic-time parsing algorithm into a practical parser. In the context of data-driven parsing, this requires an extra disambiguation model. As with many other parsers, we employ a global linear model. Following Zhang et al. (2016)'s experience, we define rich features extracted from word, POS-tags and pseudo trees. For details we refer to the source code. To estimate parameters, we utilize the averaged perceptron algorithm (Collins, 2002).

Data
We conduct experiments on unlabeled parsing using four corpora: CCGBank (Hockenmaier and Steedman, 2007), DeepBank , Enju HPSGBank (EnjuBank; Miyao et al., 2004)    2014), and the data splitting policy follows the shared task. All the four data sets are publicly available from LDC (Oepen et al., 2016). Experiments for CCG-grounded analysis were performed using automatically assigned POS-tags that are generated by a symbol-refined HMM tagger (Huang et al., 2010). Experiments for the other three data sets used POS-tags provided by the shared task. We also use features extracted from pseudo trees. We utilize the Mate parser (Bohnet, 2010) to generate pseudo trees. The pre-processing for CCGBank, DeepBank and En-juBank are exactly the same as in experiments reported in (Zhang et al., 2016).

Accuracy
We evaluate two parsing algorithms, the algorithm for noncrossing dependency graphs (Kuhlmann and Jonsson, 2015), i.e. pagenumber-1 (denoted as P1) graphs, and our quartic-time algorithm (denoted as 1ECP2 d ). Table 2 summerizes the accuracy obtained our parser. Same feature templates are applied for disambiguation. We can see that our new algorithm yields significant improvements on all data sets, as expected. Especially, due to the improved coverage, the recall is improved more.

Comparison with Other Parsers
Our new parser can be taken as a graph-based parser which employ a different architecture from transition-based and factorization-based (Martins and Almeida, 2014;Du et al., 2015a) systems. We compare our parser with the best reported systems in the other two architectures. ZDSW (Zhang et al., 2016) is transition-based parser while MA (Martins and Almeida, 2014) and DSW (Du et al., 2015a) are two factorization-based systems. All of them achieves state-of-the-art performance. All results on the test set is shown in Table 3. We can see that our parser, as a graph-based parser, is comparable to state-of-the-art transition-based and factorization-based parsers.

Conclusion and Future Work
In this paper, we explore the strength of the graphbased approach. In particular, we enhance the Maximum Subgraph model with new parsing algorithms for 1EC/P2 graphs. Our work indicates the importance of finding appropriate graph classes that on the one hand are linguistically expressive and on the other hand allow efficient search. Within tree-structured dependency parsing, higher-order factorization that conditions on wider syntactic contexts than arc-factored relationships have been proved very useful. The arcfactored model proposed in this paper may be enhanced with higher-order features too. We leave this for future investigation.