Quasi-Second-Order Parsing for 1-Endpoint-Crossing, Pagenumber-2 Graphs

We propose a new Maximum Subgraph algorithm for first-order parsing to 1-endpoint-crossing, pagenumber-2 graphs. Our algorithm has two characteristics: (1) it separates the construction for noncrossing edges and crossing edges; (2) in a single construction step, whether to create a new arc is deterministic. These two characteristics make our algorithm relatively easy to be extended to incorporiate crossing-sensitive second-order features. We then introduce a new algorithm for quasi-second-order parsing. Experiments demonstrate that second-order features are helpful for Maximum Subgraph parsing.


Introduction
Previous work showed that treating semantic dependency parsing as the search for Maximum Subgraphs is not only elegant in theory but also effective in practice (Kuhlmann and Jonsson, 2015;. In particular, our previous work showed that 1-endpoint-crossing, pagenumber-2 (1EC/P2) graphs are an appropriate graph class for modelling semantic dependency structures . On the one hand, it is highly expressive to cover a majority of semantic analysis. On the other hand, the corresponding Maximum Subgraph problem with an arc-factored disambiguation model can be solved in low-degree polynomial time.
Defining disambiguation models on wider contexts than individual bi-lexical dependencies improves various syntactic parsers in different architectures. This paper studies exact algorithms for second-order parsing for 1EC/P2 graphs. The existing algorithm, viz. our previous algorithm (GCHSW, hereafter), has two properties that make it hard to incorporate higher-order features in a principled way. First, GCHSW does not explicitly consider the construction of noncrossing arcs. We will show that incorporiating higher-order factors containing crossing arcs without increasing time and space complexity is extremely hard. An effective strategy is to only include higher-order factors containing only noncrossing arcs (Pitler, 2014). But this crossing-sensitive strategy is incompatible with GCHSW. Second, all existing higherorder parsing algorithms for projective trees, including (McDonald and Pereira, 2006;Carreras, 2007;Koo and Collins, 2010), require that which arcs are created in a construction step be deterministic. This design is also incompatible with GCHSW. In summary, it is not convenient to extend GCHSW to incorporate higher-order features while keeping the same time complexity.
In this paper, we introduce an alternative Maximum Subgraph algorithm for first-order parsing to 1EC/P2 graphs. while keeping the same time and space complexity to GCHSW, our new algorithm has two characteristics that make it relatively easy to be extended to incorporate crossingsensitive, second-order features: (1) it separates the construction for noncrossing edges and possible crossing edges; (2) whether an edge is created is deterministic in each construction rule. We then introduce a new algorithm to perform secondorder parsing. When all second-order scores are greater than or equal to 0, it exactly solves the corresponding optimization problem.
We implement a practical parser with a statistical disambiguation model and evaluate it on four data sets: those used in SemEval 2014 Task 8 (Oepen et al., 2014), and the dependency graphs extracted from CCGbank (Hockenmaier and Steedman, 2007). On all data sets, we find that our second-order parsing models are more ac-curate than the first-order baseline. If we do not use features derived from syntactic trees, we get an absolute unlabeled F-score improvement of 1.3 on average. When syntactic analysis is used, we get an improvement of 0.4 on average.

Maximum Subgraph Parsing
Semantic dependency parsing can be formulated as the search for Maximum Subgraph for graph class G: Given a graph G = (V, A), find a subset A ⊆ A with maximum total score such that the induced subgraph G = (V, A ) belongs to G. Formally, we have the following optimization problem: arg max G(s, G) denotes the set of all graphs that belong to G and are compatible with s and G. G is usually a complete digraph. s part (s, p) evaluates the event that part p (from a candidate graph G * ) is good. We define the order of p according to the number of arcs it contains, in analogy with tree parsing in terminology. Previous work only discussed the first-order case: If G is the set of noncrossing or 1EC/P2 graphs, the above optimization problem can be solved in cubic-time (Kuhlmann and Jonsson, 2015) and quintic-time  respectively. Furthermore, ignoring one linguistically-rare structure in 1EC/P2 graphs descreases the complexity to O(n 4 ). This paper is concerned with secondorder parsing, with a special focus on the following factorizations: And the objective function turns to be:
Definition 1. Edges e 1 and e 2 cross if e 1 and e 2 have distinct endpoints and exactly one of the endpoints of e 1 lies between the endpoints of e 2 .
Definition 2. A dependency graph is 1-Endpoint-Crossing if for any edge e, all edges that cross e share an endpoint p named pencil point.
Given a sentence s = w 0 w 1 · · · w n−1 of length n, the vertices, i.e. words, are indexed with integers, an arc from w i to w j as a (i,j) , and the common endpoint, namely pencil point, of all edges crossed with a (i,j) or a (j,i) as pt(i, j). We denote an edge as e (i,j) , if we do not consider its direction. Figure 1 is an example.
Definition 3. A pagenumber-k graph means it consists at most k half-planes, and arcs on each half-plane are noncrossing.
These half-planes may be thought of as the pages of a book, with the vertex line corresponding to the books spine, and the embedding of a graph into such a structure is known as a book embedding. Figure 2 is an example. (Pitler et al., 2013) proved that 1-endpointcrossing trees are a subclass of graphs whose pagenumber is at most 2. In , we studied graphs that are constrained to be both 1-endpoint-crossing and pagenumber-2. In this paper, we ignored a complex and linguistic-rare  To decompose this structure, GCHSW focuses on e (i,j) and e (l,j) , because these two edges can be optionally created without violation of both 1EC and P2 restrictions. Our algorithm focuses on the existence of e (i,k) , and makes it the only edge that is constructed by applying a corresponding rule. structure and studied a subset of 1EC/P2 graphs. The complex structure is named as C structures in our previous paper, and Figure 3 is the prototype of C structures. In this paper, we present new algorithms for finding optimal 1EC/P2, C-free graphs.

The GCHSWAlgorithm
Cao et al. (2017) designed a polynomial time Maximum Subgraph algorithm, viz. GCHSW, for 1EC/P2 graphs by exploring the following property: Every subgraph of a 1EC/P2 graph is also a 1EC/P2 graph. GCHSW defines a number of prototype backbones for decomposing a 1EC/P2 graph in a principled way. In each decomposition step, GCHSW focuses on the edges that can be created without violating either the 1EC nor P2 restriction. Sometimes, multiple edges can be created simultaneously in one single step. Figure 4 is an example.
There is an important difference between GCHSW and Eisner-style Maximum Spanning Tree algorithms (MST; Eisner, 1996;McDonald and Pereira, 2006;Koo and Collins, 2010). In each construction step, GCHSW allows multiple arcs to be constructed, but whether or not such arcs are added to the target graph depends on their arc-weights. If all arcs are assigned scores that are greater than 0, the output of our algorithm includes the most complicated 1EC/P2 graphs. For the higher-order MST algorithms, in a single construction step, it is clear whether adding a new arc, and which one. There is no local search. This deterministic strategy is also followed by Kuhlmann and Jonsson's Maximum Subgraph algorithm for noncrossing graphs. Higher-order MST models associate higher-order score functions with the construction of individual dependencies. Therefore the deterministic strategy is a prerequisite to incorporate higher-order features. The design of GCHSW is incompatible with this strategy.
x i k j r i l j r x Figure 5: A typical structure of crossing arcs.

Challenge of Second-Order Decoding
It is very difficult to enumerate all high-order features for crossing arcs. Figure 5 illustrates the idea. There is a pair of corssing arcs, viz. e (x,k) and e (i,j) . The key strategy to develop a dynamic programming algorithm to generate such crossing structure is to treat parts of this structures as intervals/spans together with an external vertex (Pitler et al., 2013;. Without loss of generality, we assume [i, j] makes up such an interval and x is the corresponding external vertex. When we consider e (i,j) , its neighboring edges can be e (i,r i ) and e (l j ,j) , and therefore we need to consider searching the best positions of both r i and l j . Because we have already taken into account three vertices, viz. x, i and j, the two new positions increase the time complexity to be at least quintic.
Now consider e (x,k) . When we decompose the whole graph into inverval [i, j] plus x and remaining part, we will factor out e (x,k) in a successive decomposition for resolving [i, j] plus x. We cannot capture the second features associated to e (x,k) and e (x,rx) , because they are in different intervals, and when these intervals are combined, we have already hidden the position information of k. Explicitly encoding k increases the time complexity to be at least quintic too. Pitler (2014) showed that it is still possible to build accurate tree parsers by considering only higher-order features of noncrossing arcs. This is in part because only a tiny fraction of neighboring arcs involve crossing arcs. However, this strategy is not easy to by applied to GCHSW, because GCHSW does not explicitly analyze sub-graphs of noncrossing arcs.

A New Maximum Subgraph Algorithm
Based on the discussion of Section 2.3 and 2.4, we can see that it is not easy to extend the existing algorithm, viz. GCHSW, to handle second-order features. In this paper, we propose an alternative first-order dynamic programming algorithm. Because ignoring one linguistically-rare structure associated with the C problem in GCHSW descreases the complexity, we exclude this structure in our algorithm. Formally, we introduce a new algorithm x i j Figure 6: Graphical representations of sub-problems. Gray curves mean the corresponding edge in this sub-problem, but should be included in the final generated graph.
Figure 7: A dynamic program to find optimal 1EC/P2, C-free graphs with arc-factored weights.
to solve the following optimization problem: where G means 1EC/P2, C-free graphs. Our algorithm has the same time and space complexity to the degenerated version of GCHSW. We represent our algorithm using undirected graphs.

Sub-problems
Following GCHSW, we consider five sub-problems when we construct a maximum dependency graph on a given interval [i, k]. Though the subproblems introduced by GCHSW  3.2 Decomposing Sub-problems Figure 7 gives a sketch of our dynamic programming algorithm. We give a detailed illustration for Int, a rough idea for L and LR, and omit other sub-problems. More details about the whole algorithm can be found in the supplementary note.

Decomposing an Int Sub-problem
is the farthest vertex that is adjacent to i, and x = pt(i, k). If there is no such k (i.e. there no arc from i to some other node in this interval), then we denote k as ∅. So it is to x. We illustrate different cases as following and give a graphical representation in Figure 8.
, because there may be e (i,j) , we add one more rule: And we do not need to create e (i,j) in all cases.

Decomposing an LR Sub-problem
LR[i, j, x] means i or j is the pencil point of edges from x to (i, j). We show the decomposition of LR[i, j, x] as follows: (b) Case b. If there is no such vertex k, there must be edges from [i, k ) to (k , j] for every k in (i, j) without considering e (i,j) . For i + 1, we assume e (i,a 1 ) is the farthest edge that goes from i. For a 1 , we assume e (b 1 ,b 2 ) is the farthest edge from b 1 where b 1 is in (i, a 1 ) and b 2 is in (a 1 , j). For b 2 , we assume e (a 1 ,a 3 ) is the farthest edge from a 1 where a 3 is in (b 2 , j) and a 1 is the pencil point. We then get the series {a 1 , a 2 , a 3 ...a n } and and max(a n , b m ) = j. If b m = j, we will get a graph like Figure 10. If e (x,b 1 ) exists, this LR subproblem degenerates to an L subproblem. If e (x,an) exists, this subproblem degenerates to an R subproblem.
If a m = j, we will get a graph like Figure 11. If there exists only e (x,b 1 ) or e (x,bm) , we can solve it like b m = j. If both exist, this is a typical C-structure like Figure 3 and we cannot get it through other decompostion.
The above discussion gives the rough idea of the correctness of the following conclusion.
Theorem 1. Our new algorithm is sound and complete with respect to 1EC/P2, C-free graphs.

Spurious Ambiguity
An LR, L, R or N sub-problem allows to build crossing arcs, but does not necessarily create crossing arcs. For example, L C [i, j, x] allows e (i,j) to cross with e (x,y) (y ∈ (i, j)). Because every subgraph of a 1EC/P2 graph is also a 1EC/P2 graph, we allow an L C [i, j, x] to be directly degenerated to I O [i, j]. In this way, we can make sure that all subgraphs can be constructed by our algorithm. Figure 12 shows the rough idea. To generate the same graph, we have different derivations. The spurious ambiguity in our algorithm does not affect the correctness of first-order parsing, because scores are assigned to individual dependencies, rather than derivation processes. There is no need to distinguish one special derivation here.

Quasi-Second-Order Extension
We propose a second-order extension of our new algorithm. We focus on factorizations introduced in Section 2.1. Especially, the two arcs in a factor should not cross other arcs. Formally, we introduce a new algorithm to solve the optimization problem with the following objective: In the first-order algorithm, all noncrossing edges can be constructed as the frontier edge of an Int C . a b c d e Figure 12: Illustration of spurious ambiguity. The two solid curves represent two arcs in the target graph, but not the dashed one. Excluding crossing edges leads to the first derivation: Int C [a, e] ⇒ e (a,e) + Int C [a, c] + Int O [c, e] + e (a,c) . Assuming that a pair of crossing arcs may exist yields another derivation: So we can develop an exact decoding algorithm by modifying the composition for Int C while keeping intact the decomposition for LR, N, L, R.

New Decomposition for Int C
In order to capture the second-order features from noncrossing neighbors, we need to find the rightmost node adjacent to i, denoted as r i , and the leftmost node adjacent to j, denoted as l j ,while i < r i ≤ l j < j. To do this, we split Int C [i, j] into at most three parts to capture the sibling factors. Denote the score of adjacent edges e (i,j 1 ) and e (i,j 2 ) as s 2 (i, j 1 , j 2 ). When j is the inner most node adjacent to i, we denote the score as s 2 (i, ∅, j). We give a sketch of the decomposition in Figure 14 and a graphical representation in Figure 13. The following is a rough illustration.
Case a: r i = ∅. We further distinguish three sub-problems: a.1 If l j = ∅ too, both sides are the inner most second-order factor.
a.2 There is a crossing arc from j. This case is handled in the same way as the first-order algorithm.
a.3 l j = ∅. We introduce a new decomposition rule.
Case b: There is a crossing arc from i.
b.2 There is a crossing arc from j. Similar case to (a.2).
b.3 There is a noncrossing arc from j. We introduce a new rule to calculate SIB(j, l j , i).
Case c: There is a noncrossing arc from i.
c.2 There is a crossing arc from j. Similar to (b.3).
c.3 There is a noncrossing arc from j too. We introduce a new rule to calculate SIB(i, r i , j) and SIB(j, l j , i).

Complexity
The complexity of both first-and second-order algorithms can be analyzed in the same way. The sub-problem Int is of size O(n 2 ), with a calculating time of order O(n 2 ) at most. For sub-problems L, R, LR, and N, each has O(n 3 ) elements, with a unit calculating time O(n). Therefore both algorithms run in time of O(n 4 ) with a space requirement of O(n 3 ).

Discussion
A traditional second-order model takes as the objective function s∈SIB(G * ) s sib (s). Our model instead tries to optimize s∈SIB(G * ) max(s sib (s), 0). This model is somehow inadequate given that the second-order score function cannot penalize a bad factor. When a negative score is assigned to a second-order factor, it will be taken as 0 by our algorithm. This inadequacy is due to the spurious ambiguity problem that is illustrated in Section 3.3. Take the two derivations in Figure 12 for example. The derivation that starts from Int C [a, e] ⇒ Int C [a, c] + Int O [c, e] incorporates the second-order score s sib (a, c, e). This is different when we consider the derivation that starts Because we assume temporarily that e (a,c) crosses others, we do not consider s sib (a, c, e). We can see from this example that second-order scores not only depend on the derived graphs but also sensitive to the derivation processes.
If a second-order score is greater than 0, our algorithm selects the derivation that takes it into account since it increases the total score. If a secondorder score is negative, our algorithm avoids including it by selecting other paths. In other words, our algorithm treats this score as 0.
(a.1) Figure 13: Decomposition for Int C [i, j] in the second-order parsing algorithm.

Derivation-Sensitive Training
We extend our quartic-time parsing algorithm into a practical parser. In the context of data-driven parsing, this requires an extra disambiguation model. As with many other parsers, we employ a global linear model. Following Zhang et al. (2016)'s experience, we define rich features extracted from word, POS-tags and pseudo trees. To estimate parameters, we utilize the averaged perceptron algorithm (Collins, 2002).
Our training proceudre is sensitive to derivation rather then derived graphs. For each sentence, we first apply our algorithm to find the optimal prediction derivation. The we collect all first-and second-order factors from this derivation to update parameters. To train a first-order model, because our algorithm includes all factors, viz. depencies, there is no difference between our derivationbased method and a traditional derived structurebased method. For the second-order model, our method increases the second-order scores somehow.

• Following previous experimental setup for
English CCG parsing, we use section 02-21 as training data, section 00 as the development data, and section 23 for testing.
• The DeepBank, Enju HPSGBank and Prague Dependency TreeBank are from SemEval 2014 Task 8 (Oepen et al., 2014), and the data splitting policy follows the shared task.
Experiments for CCG-grounded analysis were performed using automatically assigned POS-tags that are generated by a symbol-refined HMM tagger (Huang et al., 2010). Experiments for the other three data sets used POS-tags provided by the shared task. We also use features extracted from pseudo trees. We utilize the Mate parser (Bohnet, 2010) to generate pseudo trees. All experimental results consider directed dependencies in a standard way. We report Unlabeled Precision (UP), Recall (UR) and F-score (UF), which are calculated using the official evaluation tool provided by SDP2014 shared task. Table 1 lists the accuracy of our system. The output of our parser was evaluated against each dependency in the corpus. We can see that the firstorder parser obtains a considerably good accuracy, with rich syntactic features. Furthermore, we can see that the introduction of higher-order features improves parsing substantially for all data sets, as expected. When syntactic trees are utilized, the  Table 2: Parsing accuracy evaluated on the test sets. "SJW" denotes the book embedding parser introduced in . improvement is smaller but still significant on the three SemEval data sets. 1. Both systems have explicit control of the output structures. While Sun et al.'s system constrain the output graph to be P2 only, our system adds an additional 1EC restriction.

Accuracy
2. Their system's second-order features also includes both-side neighboring features.
3. Their system uses beam search and dual decomposition and therefore approximate, while ours perform exact decoding.
We can see that while our purely Maximum Subgraph parser obtains better results on DeepBank and CCGBank; while the book embedding parser is better on the other two data sets.

Analysis
Our algorithm is sensitive to the derivation process and may exclude a couple of negative secondorder scores by selecting misleading derivations. Neverthess, our algorithm works in an exact way to include all positive second-order scores. Table  3 shows the coverage of all second-order factors. On average, 99.67% second-order factors are calculated by our algorithm. This relatively satisfactory coverage suggests that our algorithm is very effective to include second-order features. Only a very small portion is dropped.

Conclusion
This paper proposed two exact, graph-based algorithms for 1EC/P2 parsing with first-order and quasi-second-order scores. The resulting parser has the same asymptotic run time as 's algorithm. An exploration of other factorizations that facilitate semantic dependency parsing may be an interesting avenue for future work. Recent work has investigated faster decoding for higher-order graph-based projective parsing e.g. vine pruning (Rush and Petrov, 2012) and cube pruning (Zhang and McDonald, 2012). It would be interesting to extend these lines of work to decrease the complexity of our quartic algorithm.