Semantic Dependency Parsing via Book Embedding

We model a dependency graph as a book, a particular kind of topological space, for semantic dependency parsing. The spine of the book is made up of a sequence of words, and each page contains a subset of noncrossing arcs. To build a semantic graph for a given sentence, we design new Maximum Subgraph algorithms to generate noncrossing graphs on each page, and a Lagrangian Relaxation-based algorithm tocombine pages into a book. Experiments demonstrate the effectiveness of the bookembedding framework across a wide range of conditions. Our parser obtains comparable results with a state-of-the-art transition-based parser.


Introduction
Dependency analysis provides a lightweight and effective way to encode syntactic and semantic information of natural language sentences. One of its branches, syntactic dependency parsing (Kübler et al., 2009) has been an extremely active research area, with high-performance parsers being built and applied for practical use of NLP. Semantic dependency parsing, however, has only been addressed in the literature recently (Oepen et al., 2014(Oepen et al., , 2015Du et al., 2015;Zhang et al., 2016;Cao et al., 2017).
Semantic dependency parsing employs a graphstructured semantic representation. On the one hand, it is flexible enough to provide analysis for various semantic phenomena (Ivanova et al., 2012). This very flexibility, on the other hand, brings along new challenges for designing parsing algorithms. For graph-based parsing, no previously defined Maximum Subgraph algorithm has simultaneously a high coverage and a polynomial complexity to low degrees. For transition-based parsing, no principled decoding algorithms, e.g. dynamic programming (DP), has been developed for existing transition systems.
In this paper, we borrow the idea of book embedding from graph theory, and propose a novel framework to build parsers for flexible dependency representations. In graph theory, a book is a kind of topological space that consists of a spine and a collection of one or more half-planes. In our "book model" of semantic dependency graph, the spine is made up of a sequence of words, and each half-plane contains a subset of dependency arcs. In particular, the arcs on each page compose a noncrossing dependency graph, a.k.a. planar graph. Though a dependency graph in general is very flexible, its subgraph on each page is rather regular. Under the new perspective, semantic dependency parsing can be cast as a two-step task: Each page is first analyzed separately, and then all the pages are bound coherently.
Our work is motivated by the extant low-degree polynomial time algorithm for first-order Maximum Subgraph parsing for noncrossing dependency graphs (Kuhlmann and Jonsson, 2015). We enhance existing work with new exact second-and approximate higher-order algorithms. Our algorithms facilitate building with high accuracy the partial semantic dependency graphs on each page. To produce a full semantic analysis, we also need to integrate partial graphs on all pages into one coherent book. To this end, we formulate the problem as a combinatorial optimization problem, and propose a Lagrangian Relaxation-based algorithm for solutions.
We implement a practical parser in the new framework with a statistical disambiguation model. We evaluate this parser on four data sets: those used in SemEval 2014 Task 8 (Oepen et al., 2014), and the dependency graphs extracted from  Figure 1: A fragment of a semantic dependency graph.
CCGbank (Hockenmaier and Steedman, 2007). On all data sets, we find that our higher-order parsing models are more accurate than the first-order baseline. Experiments also demonstrate the effectiveness of our page binding algorithm. Our new parser can be taken as a graph-based parser extended for more general dependency graphs. It parallels the state-of-the-art transition-based system of Zhang et al. (2016) in performance.

Semantic Dependency Graphs
A dependency graph G = (V, A) is a labeled directed graph for a sentence s = w 1 , . . . , w n . The vertex set V consists of n vertices, each of which corresponds to a word and is indexed by an integer. The arc set A represents the labeled dependency relations of the particular analysis G. Specifically, an arc, viz. a (i,j,l) , represents a dependency relation l from head w i to dependent w j .
Semantic dependency parsing is the task of mapping a natural language sentence into a formal meaning representation in the form of a dependency graph. Figure 1 shows a graph fragment of a noun phrase. This semantic graph is grounded on Combinatory Categorial Grammar (CCG; Steedman, 2000), and can be taken as a proxy for predicate-argument structure. The graph includes most semantically relevant non-anaphoric local (e.g. from "wants" to "Mark") and long-distance (e.g. from "buy" to "company") dependencies.

Maximum Subgraph Parsing
Usually, syntactic dependency analysis employs tree-shaped representations. Dependency parsing, thus, can be formulated as the search for a maximum spanning tree (MST) of an arc-weighted graph. For semantic dependency parsing, where the target representations are not necessarily trees, Kuhlmann and Jonsson (2015) proposed to generalize the MST model to other types of subgraphs. In general, dependency parsing is formulated as the search for Maximum Subgraph for graph class G: Given a graph G = (V, A), find a subset A ′ ⊆ A with maximum total weight such that the induced subgraph G ′ = (V, A ′ ) belongs to G. Formally, we have the following optimization problem: Here, G(s, G) is the set of all graphs that belong to G and are compatible with s and G. For parsing, G is usually a complete graph. SCOREPART(s, p) evaluates the event that a small subgraph p of a candidate graph H is good. We define the order of a part according to the number of dependencies it contains, in analogy with tree parsing in terminology. Previous work only discussed the first-order case for Maximum Subgraph parsing (Kuhlmann and Jonsson, 2015). In this paper, we are also interested in higher-order parsing, with a special focus on factorizations utilizing the following parts: . . If G is the set of projective trees or noncrossing graphs the first-order Maximum Subgraph problem can be solved in cubic-time (Eisner, 1996;Kuhlmann and Jonsson, 2015). Unfortunately, these two graph classes are not expressive enough to encode semantic dependency graphs. Moreover, this problem for several wellmotivated graph classes, including acyclic or 2planar graphs, is NP-hard, even if one only considers first-order factorization. The lack of appropriate decoding algorithms results in one major challenge for semantic dependency parsing.

Book Embedding
This section introduces the basic idea about book embedding from a graph theoretical point of view. Definition 1. A book is a kind of topological space that consists of a line, called the spine, together with a collection of one or more halfplanes, called the pages, each having the spine as its boundary. Definition 2. A book embedding of a finite graph G onto a book B satisfies the following conditions.  Figure 2: Book embedding for the graph in Figure  1. Arcs are assigned to two pages.
1. Every vertex of G is depicted as a point on the spine of B.
2. Every edge of G is depicted as a curve that lies within a single page of B.
3. Every page of B does not have any edge crossings.
A book embedding separates a graph into several subgraphs, each of which contains all vertices, but only a subset of arcs that are not crossed with each other. This kind of graph is named noncrossing dependency graph by Kuhlmann and Jonsson (2015) and planar by Titov et al. (2009), Gómez-Rodríguez and Nivre (2010) and many others.
We can formalize a semantic dependency graph as a book. Take the graph in Figure 1 for example. We can separate the edges into two sets and take each set as a single page, as shown in Figure 2.
Empirically, a semantic dependency graph is sparse enough that it can be that it can be usually embedded onto a very thin book. To measure the thickness, we can use pagenumber that is defined as follows. Definition 3. The book pagenumber of G is the minimum number of pages required for a book embedding of G.
We look into the pagenumber of graphs on four linguistic graph banks (as defined in Section 5). These corpora are also used for training and evaluating our data-driven parsers. The pagenumbers are calculated using sentences in the training sets. Table 1 lists the percentages of complete graphs that can be accounted with books of different thickness. The percentages of noncrossing graphs, i.e. graphs that have pagenumber 1, vary between 48.23% and 78.26%. The practical usefulness of the algorithms for computing maximum noncrossing graphs will be limited by the relatively low coverage.
The class of graphs with pagenumber no more than two has a considerably satisfactory coverage.  Table 1: Coverage in terms of complete graphs with respect to different pagenumbers ("PN" for short). "DM," "PAS," "CCD" and "PSD" are short for DeepBank, Enju HPSGBank, CCGBank and Prague Dependency Treebank.
It can account for more than 98% of the graphs and sometimes close to 100% in each data set. Unfortunately, the power of Maximum Subgraph parsing is limited given that finding the maximum acyclic subgraph when pagenumber is at most k is NP-hard for k ≥ 2 (Kuhlmann and Jonsson, 2015). As an alternative, we propose to model a semantic graph as a book, in which the spine is made up of a sequence of words, and each halfplane contains a subset of dependency arcs. To build a semantic graph for a given sentence, we design new parsing algorithms to generate noncrossing graphs on each page (Section 3), and a Lagrangian Relaxation-based algorithm to integrate pages into a book (Section 4).

Maximum Subgraph for Noncrossing Graphs
We introduce several DP algorithms for calculating the maximum noncrossing dependency graphs. Each algorithm visits all the spans from bottom to top, finding the best combination of smaller structures to form a new structure, according to the scores of first-or higher-order features. For sake of conciseness, we focus on undirected graphs and treat direction of linguistic dependencies as edge labels 1 . We will use e (i,j,l) (i < j) or simply e (i,j) to indicate an edge in either direction 1 The single-head property does not hold. We currently do not consider other constraints of directions. So prediction of the direction of one edge does not affect prediction of other edges as well as their directions. The directions can be assigned locally, and our parser builds directed rather than undirected graphs in this way. Undirected graphs are only used to conveniently illustrate our algorithms. All experimental results in Section 5 consider directed dependencies in a standard way. We use the official evaluation tool provided by SDP2014 shared task. The numberic results reported in this paper are directly comparable to results in other papers. between i and j.
For sake of formal concision, we introduce the algorithm of which the goal is to calculate the maximum score of a subgraph. Extracting corresponding optimal graphs can be done in a number of ways. For example, we can maintain an auxiliary arc table which is populated parallel to the procedure of obtaining maximum scores. We define two score functions: (1) s fst (s, e, l) assigns a score to an individual edge e (s,e,l) and (2) s scd (s, e 1 , e 2 , l 1 , l 2 ) assigns a score to a pair of neighboring edges e (s,e 1 ,l 1 ) and e (s,e 2 ,l 2 ) .

First-Order Factorization
Given a sentence, we define two DP tables, namely O[s, e] and C[s, e, l] which represents the value of the highest scoring noncrossing graphs that spans sequences of words of a sentence. The two tables are related to two sub-problems, as graphically shown in Figure 3. The following is their explaination. = .
e − 1 . s s label l between s and some node in this span. k is the farthest node linked to s.
, if e (s,e) does not exist and there is no edge to its right in this span.
C[s, e, l] can be obtained by one of the following combinations: • O[s + 1, e] + s fst (s, e, l), if s has no edge to its right; • C[s, k, l ′ ] + O[k, e] + s fst (s, e, l)(l ′ ∈ L, s < k < e), if there is an edge from s to some node in the span.
For each edge, there are two directions for the edge, we encode the directions into the label l, and treat it as undirected edge. We need to search for a best split and a best label for every span, so the time complexity of the algorithm is O(n 3 |L|) where n is the length of the sentence and L is the set of labels.

Second-Order Single Side Factorization
We propose a new algorithm concerning singleside second-order factorization. The DP tables, as well as the decomposition for the open problem, are the same as in the first order factorization. The decomposition of C[s, e, l] is very different. In order to score second-order features from adjacent edges in the same side, which is similar to sibling features for tree parsing (McDonald and Pereira, 2006), we need to find the rightmost node adjacent to s, denoted as r s , and the leftmost node adjacent to e, denoted as l e , and here we have s < r s ≤ l e < e. And, sometimes, we split C[s, e, l] into three parts to capture the neighbor factors on both endpoints. In summary, C[s, e, l] can be obtained by one of the following combination (as graphically shown in Figure 4): • O[s + 1, e − 1] + s fst (s, e, l) + s scd (s, nil, e, nil, l) + s scd (e, nil, s, nil, l), if there is no edge from s/e to any node in the span.
• C[s, r s , l ′ ] + O[r s , e − 1] + s fst (s, e, l) + s scd (s, r s , e, l ′ , l) + s scd (e, nil, s, nil, l) (s < r s < e), if there is no edge from e to any node in the span.
• O[s + 1, l e ] + C[l e , e, l ′ ] + s fst (s, e, l) + s scd (e, l e , s, l ′ , l) + s scd (s, nil, e, nil, l) (s < l e < e), if there is no edge from s to any node in the span.
For the last combination, we need to search for two best separating words, namely s r and l e , and two best labels, namely l ′ and l ′ , so the time complexity of this second-order algorithm is O(n 4 |L| 2 ).

Generalized Higher-Order Parsing
Both of the above two algorithms are exact decoding algorithms. Solutions allow for exact decoding with higher-order features typically at a high cost in terms of efficiency. A trade-off between rich features and exact decoding benefit tree parsing (McDonald and Nivre, 2011). In particular, Zhang and McDonald (2012) proposed a generalized higher-order model that abandons exact search in graph-based parsing in favor of freedom in feature scope. They kept intact Eisner's algorithm for first-order parsing problems, while enhanced the scoring function in an approximate way by introducing higher-order features.
We borrow Zhang and McDonald's idea and develop a generalized parsing model for noncrossing dependency representations. The sub-problems and their decomposition are much like the firstorder algorithm. The difference is that we expand l e . e Figure 5: Sub-problems of generalized higherorder factorization and some of the combinations.
the signature of each structure to include all the larger context required to compute higher-order features. For example, we can record the leftmost and the rightmost edges in the open structure to get the tri-neighbor features. The time complexity is thus always O(n 3 B 2 ), no matter how complicatedly higher-order features are incorporated.
We focus on five factors introduced in Section 2.2. Still consider single-side second-order factorization. We keep the closed structure the same but modify the open one to O[s, e; r s , l e , l s,rs , l le,e ]. During parsing, we only record the top-B combinations of label concerning e (s,e) and related r s , l e , l s,rs and l le,e . The split of a structure is similar to the first-order algorithm, shown in Figure 5. Note that r s may be e and l e may be s. In this way, we know exactly whether or not there is an edge from s to e in a refined open structure. This is different from the intuition of the design of the open structure when we consider first-order factorization.

Finding and Binding Pages
Statistics presented in Table 1 indicate that the coverage of noncrossing dependency graphs is relatively low. If we treat semantic dependency parsing as Maximum Subgraph parsing, the practical usefulness of the algorithms introduced above is rather limited accordingly. To deal with this problem, we model a semantic graph as a book, and view semantic dependency parsing as finding a book with coherent optimal pages. Given the considerably high coverage of pagenumber at most 2, we only consider 2-page books.  Figure 6: Every non-crossing arc is repeatedly assigned to every page.

Finding Pages via Coloring
In general, finding the pagenumber of a graph is NP-hard (Gómez-Rodríguez and Nivre, 2010). However, it is easy to figure out that the problem is solvable if the pagenumber is at most 2. Fortunately, a semantic dependency graph is not so dense that it can be usually embedded onto a very thin book with only 2 pages. For a structured prediction problem, the structural information of the output produced by a parser is very important. The density of semantic dependency graphs therefore results in a defect: The output's structural information is limited because only a half of arcs on average are included in one page. To enrich the structural information, we put into each page the arcs that do not cross with any other arcs. See Figure 6 for example.
We utilize an algorithm based on coloring to decompose a graph G = (V, A) into two noncrossing subgraphs G A = (V, A B ) and G B = (V, A B ). A detailed description is included in the supplementary note. The key idea of our algorithm is to color each crossing arc in two colors using depthfirst search. When we color an arc e x , we examine all arcs crossing with e x . If one of them, say e y , has not been examined and can be colored in the other color (no crossing arc of e y has the same color with e y ), we color e y and then recursively process e y . Otherwise, e y is marked as a bad arc and dropped from both A A and A B . After coloring all the crossing arcs, we add every arc in different color to different subgraphs. Specially, all noncrossing arcs are assigned to both A A and A B .

Binding Pages via Lagrangian Relaxation
Applying the above algorithm, we can obtain two corpora to train two noncrossing dependency parsing models. In other words, we can learn two score functions f A and f B to score noncrossing dependency graphs. Given the trained models and a sentence, we can find two optimal noncrossing graphs, i.e. find the solutions for arg max g f A (g) and arg max g f B (g), respectively.
We can put all the arcs contained in g A = arg max g f A (g) and g B = arg max g f B (g) together as our parse for the sentence. This naive combination always gives a graph with a recall much higher than the precision. The problem is that a naive combination does not take the agreements of the graphs on the two pages into consideration, and thus loses some information. To combine the two pages in a principled way, we must do joint decoding to find two graphs g A and g B to maximize the score f A (g A ) + f B (g B ), under the following constraints.
The functionality of cross is to figure out whether e (i,j) and e (i ′ ,j ′ ) cross. The meaning of the first constraint is: When there is an arc e (i,j) in the first graph, e (i,j) is also in the second graph, or there is an arc e (i ′ ,j ′ ) in the second graph which cross with e (i,j) . So is the second one. All constraints are linear and can be written in a simplified way as, where A and B are matrices that can be constructed by checking all possible crossing arc pairs. In summary, we have the following constrained optimization problem, The Lagrangian of the optimization problem is where u is the Lagrangian multiplier. Then the dual is Figure 7: The page binding algorithm.
We instead try to find the solution for max u L(u). By using a subgradient method to calculate max u L(u), we have an algorithm for joint decoding (see Figure 7). L(u) is divided into two optimization problems which can be decoded easily. Each sub-problem is still a parsing problem for noncrossing graphs. Only the scores of factors are modified (see Line 3 and 4). Specifically, to modify the first order weights of edges, we take a subtraction of u ⊤ A in the first model and a substraction of u ⊤ B in the second one. In each iteration, after obtaining two new parsing results, we check whether the constraints are satisfied. If the answer is "yes," we stop and return the merged graph. Otherwise, we update u in a way to increase L(u) (see Line 8).

Data Sets
To evaluate the effectiveness of book embedding in practice, we conduct experiments on unlabeled parsing using four corpora: CCGBank (Hockenmaier and Steedman, 2007), DeepBank , Enju HPSGBank (En-juBank; Miyao et al., 2004) and Prague Dependency TreeBank (PCEDT; Hajic et al., 2012), We use "standard" training, validation, and test splits to facilitate comparisons. Following previous experimental setup for CCG parsing, we use section 02-21 as training data, section 00 as the development data, and section 23 for testing. The other three data sets are from SemEval 2014 Task 8 (Oepen et al., 2014), and the data splitting policy follows the shared task. All the four data sets are publicly available from LDC (Oepen et al., 2016).
Experiments for CCG analysis were performed using automatically assigned POS-tags generated by a symbol-refined HMM tagger (Huang et al., 2010). For the other three data sets we use POStags provided by the shared task. We also use features extracted from trees. We consider two types of trees: (1) syntactic trees provided as a companion analysis by the shared task and CCGBank, (2) pseudo trees (Zhang et al., 2016) automatically extracted from semantic dependency annotations. We utilize the Mate parser (Bohnet, 2010) to generate pseudo trees for all data sets and also syntactic trees for CCG analysis, and use the companion syntactic analysis provided by the shared task for the other three data sets.

Statistical Disambiguation
Our parsing algorithms can be applied to scores originated from any source, but in our experiments we chose to use the framework of global linear models, deriving our scores as: ϕ is a feature-vector mapping and w is a parameter vector. p may refer to a single arc, a pair of neighboring arcs, or a general tuple of arcs, according to the definition of a parsing model. For details we refer to the source code. We chose the averaged structured perceptron (Collins, 2002) for parameter estimation.

Results of Practical Parsing
We evaluate five decoding algorithms: M1 first-order exact algorithm, M2 second-order exact algorithm with singleside factorization, M3 second-order approximate algorithm 2 with single-side factorization, M4 second-order approximate algorithm with single-and both-side factorization, M5 third-order approximate algorithm with single-and both-side factorization. Table 2 lists the accuracy of Maximum Subgraph parsing. The output of our parser was evaluated against each dependency in the corpus. We report unlabeled precision (UP), recall (UR) and f-score (UF). We can see that the first-order model obtains a considerably good precision, with rich features.  Table 2: Parsing accuracy evaluated on the development sets. "MS" is short for Maximum Subgraph parsing. "NC" and "LR" are short for naive combination and Lagrangian Relaxation. But due to the low coverage of the noncrossing dependency graphs, a set of dependencies can not be built. This property has a great impact on recall. Furthermore, we can see that the introduction of higher-order features improves parsing substantially for all data sets, as expected. When pseudo trees are utilized, the improvement is marginal. We think the reason is that we have already included many higher-order features at the stage of pseudo tree parsing.

Effectiveness of Approximate Parsing
Perhaps surprisingly approximate parsing with single-side second order features and cube pruning is even slightly better than exact parsing. This result demonstrates the effectiveness of generalized dependency parsing. Further including third-order features does not improve parsing accuracy.

Effectiveness of Page Binding
When arcs are assigned to two sets, we can separately train two parsers for producing two types of noncrossing dependency graphs. These two parsers can be integrated using a naive merger or a LR-based merger. Table 2 also shows the accuracy obtained by the second-order model M4. The effectivenss of the Lagrangian Relaxation-based algorithm for binding pages is confirmed. Figure 8 presents the termination rate with respective to the number of iterations. Here we apply M4 with syntax and pseudo tree features. In practice the Lagrangian Relaxation-based algorithm finds solutions in a few iterations for a majority of sentences. This suggests that even though the joint decoding is an iterative procedure, satisfactory efficiency is still available.  Table 3: Parsing accuracy evaluated on the test sets.

Comparison with Other Parsers
We show the parsing results on the test data together with some relevant results from related work. We compare our parser with two other systems: (1) ZDSW (Zhang et al., 2016) is a transition-based system that obtains state-of-theart accuracy; we present the results of their best single parsing model; (2) Peking (Du et al., 2014) is the best-performing system in the shared task; it is a hybrid system that integrate more than ten submodels to achieve high accuracy. Our parser can be taken as a graph-based parser. It reaches stateof-the-art performance produced by the transitionbased system. On DeepBank and EnjuBank, the accuracy of our parser is equivalent to ZDSW, while on CCGBank, our parser is significantly better.
There is still a gap between our single parsing model and Peking hybrid model. For a majority of NLP tasks, e.g. parsing (Surdeanu and Manning, 2010), semantic role labeling (Koomen et al., 2005), hybrid systems that combines complementary strength of heterogeneous models perform better. But good individual system is the cornerstone of hybrid systems. Better design of single system almost always benefits system ensemble.

Conclusion
We propose a new data-driven parsing framework, namely book embedding, for semantic dependency analysis, viz. mapping from natural language sentences to bilexical semantic dependency graphs. Our work includes two contributions: 1. new algorithms for maximum noncrossing dependency parsing.
2. a Lagrangian Relaxation based algorithm to combine noncrossing dependency subgraphs.
Experiments demonstrate the effectiveness of the book embedding framework across a wide range of conditions. Our graph-based parser obtains state-of-the-art accuracy.