Parsing for Grammatical Relations via Graph Merging

This paper is concerned with building deep grammatical relation (GR) analysis using data-driven approach. To deal with this problem, we propose graph merging, a new perspective, for building flexible dependency graphs: Constructing complex graphs via constructing simple subgraphs. We discuss two key problems in this perspective: (1) how to decompose a complex graph into simple subgraphs, and (2) how to combine subgraphs into a coherent complex graph. Experiments demonstrate the effectiveness of graph merging. Our parser reaches state-of-the-art performance and is significantly better than two transition-based parsers.


Introduction
Grammatical relations (GRs) represent functional relationships between language units in a sentence. Marking not only local but also a wide variety of long distance dependencies, GRs encode in-depth information of natural language sentences. Traditionally, GRs are generated as a byproduct by grammar-guided parsers, e.g. RASP (Carroll and Briscoe, 2002), C&C (Clark and Curran, 2007b) and Enju (Miyao et al., 2007). Very recently, by representing GR analysis using general directed dependency graphs, Sun et al. (2014) and Zhang et al. (2016) showed that considerably good GR structures can be directly obtained using data-driven, transition-based parsing techniques. We follow their encouraging work and study the data-driven approach for producing GR analyses.
The key challenge of building GR graphs is due to their flexibility. Different from surface syntax, the GR graphs are not constrained to trees, which is a fundamental consideration in design-ing parsing algorithms. To deal with this problem, we propose graph merging, a new perspective, for building flexible representations. The basic idea is to decompose a GR graph into several subgraphs, each of which captures most but not the complete information. On the one hand, each subgraph is simple enough to allow efficient construction. On the other hand, the combination of all subgraphs enables whole target GR structure to be produced.
There are two major problems in the graph merging perspective. First, how to decompose a complex graph into simple subgraphs in a principled way? To deal with this problem, we considered structure-specific properties of the syntactically-motivated GR graphs. One key property is their reachability: In a given GR graph, almost every node is reachable from a same and unique root. If a node is not reachable, it is disconnected from other nodes. This property ensures a GR graph to be successfully decomposed into limited number of forests, which in turn can be accurately and efficiently built via tree parsing. We model the graph decomposition problem as an optimization problem and employ Lagrangian Relaxation for solutions.
Second, how to merge subgraphs into one coherent structure in a principled way? The problem of finding an optimal graph that consistently combines the subgraphs obtained through individual models is non-trivial. We treat this problem as a combinatory optimization problem and also employ Lagrangian Relaxation to solve the problem. In particular, the parsing phase consists of two steps. First, graph-based models are applied to assign scores to individual arcs and various tuples of arcs. Then, a Lagrangian Relaxation-based joint decoder is applied to efficiently produces globally optimal GR graphs according to all graph-based models.

Background
In this paper, we focus on building GR analysis for Mandarin Chinese. Mandarin is an analytic language that lacks inflectional morphology (almost) entirely and utilizes highly configurational ways to convey syntactic and semantic information. This analytic nature allows to represent all GRs as bilexical dependencies. Sun et al. (2014) showed that analysis for a variety of complicated linguistic phenomena, e.g. coordination, raising/control constructions, extraction, topicalization, can be conveniently encoded with directed graphs. Moreover, such deep syntactic dependency graphs can be effectively derived from Chinese TreeBank (Xue et al., 2005) with very high quality. Figure 1 is an example. In this graph, "subj*ldd" between the word "涉及/involve" and the word "文 件/documents" represents a longdistance subject-predicate relation. The arguments and adjuncts of the coordinated verbs, namely "颁 布/issue" and "实行/practice," are separately yet distributively linked to the two heads. By encoding GRs as directed graphs over words, Sun et al. (2014) and Zhang et al. (2016) showed that the data-driven, transition-based ap-proach can be applied to build Chinese GR structures with very promising results. This architecture is complementary to the traditional approach to English GR analysis, which leverages grammarguided parsing under deep formalisms, such as LFG (Kaplan et al., 2004), CCG (Clark and Curran, 2007a) and HPSG (Miyao et al., 2007). We follow Sun et al.'s and Zhang et al.'s encouraging work and study the discriminative, factorization models for obtaining GR analysis.

The Idea
The key idea of this work is constructing a complex structure via constructing simple partial structures. Each partial structure is simple in the sense that it allows efficient construction. For instance, projective trees, 1-endpoint-corssing trees, non-crossing dependency graphs and 1-endpointcrossing, pagenumber-2 graphs can be taken as simple structures, given that low-degree polynomial time parsing algorithms exist (Eisner, 1996;Pitler et al., 2013;Kuhlmann and Jonsson, 2015;. To construct each partial structure, we can employ mature parsing techniques. To get the final target output, we also require the total of all partial structures enables whole target structure to be produced. In this paper, we exemplify the above idea by designing a new parser for obtaining GR graphs. Take the GR graph in Figure 1 for example. It can be decomposed into two tree-like subgraphs, shown in Figure 2. If we can parse the sentence into subgraphs and combine them in a principled way, we get the original GR graph.
Under this perspective, we need to develop a principled method to decompose a complex structure into simple sturctures, which allows us to generate data to train simple solvers. We also need to develop a principled method to integrate partial structures, which allows us to produce coherent  Figure 2: A graph decomposition for the GR graph in Figure 1. The two subgraphs are shown on two sides of the sentence respectively. The subgraph on the upper side of the sentence is exactly a tree, while the one on the lower side is slightly different. The edge from the word "文件/document" to "涉 及/involve" is tagged " [inverse]" to indicate that the direction of the edge in the subgraph is in fact opposite to that in the original graph.
structures as outputs. We are going to demonstrate the techniques we use to solve these two problems.

Graph Decomposition as Optimization
Given a sentence s = w 1 w 2 · · · w n of length n, we use a vector y of length n 2 to denote a graph on it. We use indices i and j to index the elements in the vector, y(i, j) ∈ {0, 1}, denoting whether there is an arc from w i to w j (1 ≤ i, j ≤ n). Given a graph y, we hope to find m subgraphs y 1 , ..., y m , each of which belongs to a specific class of graphs G k (k = 1, 2, · · · , m). Each class should allow efficient construction. For example, we may need a subgraph to be a tree or a noncrossing dependency graph. The combination of all y k gives enough information to construct y. Furthermore, the graph decomposition procedure is utilized to generate training data for building sub-models. Therefore, we hope each subgraph y k is informative enough to train a good disambiguation model. To do so, for each y k , we define a score function s k that indicates the "goodness" of y k . Integrating all ideas, we can formalize graph decomposition as an optimization problem, The last condition in this optimization problem en-sures that all edges in y appear at least in one subgraph.
For a specific graph decomposition task, we should define good score functions s k and graph classes G k according to key properties of the target structure y.

Decomposing GR Graphs into Tree-like Subgraphs
One key property of GR graphs is their reachability: Every node is either reachable from a unique root or by itself an independent connected component. This property allows a GR graph to be decomposed into limited number of tree-like subgraphs. By tree-like we mean if we treat a graph on a sentence as undirected, it is a tree, or it is a subgraph of some tree on the sentence. The advantage of tree-like subgraphs is that they can be effectively built by adapting data-driven tree parsing techniques. Take the sentence in Figure 1 for example. For every word, there is at least one path link the virtual root and this word. Furthermore, we can decompose the graph into two tree-like subgraphs, as shown in Figure 2. In this decomposition, one subgraph is exactly a tree, and the other is very close to a tree.
We restrict the number of subgraphs to 3. The intuition is that we use one tree to capture long distance information and the other two to capture coordination information. 1 In other words, we decompose each given graph y into three tree-like subgraphs g 1 , g 2 and g 3 . The goal is to let g 1 , g 2 and g 3 carry important information of the graph as well as cover all edges in y. The optimization problem can be written as We score a subgraph in a first order arc-factored way, which first scores the edges separately and then adds up the scores. Formally, the score func- is the score of the edge from i to j. Under this score function, we can use the Maximum Spanning Tree (MST) algorithm (Chu and Liu, 1965;Edmonds, 1967;Eisner, 1996) to decode the tree-like subgraph with the highest score.
After we define the score function, extracting a subgraph from a GR graph works like this: We first assign heuristic weights ω k (i, j) (1 ≤ i, j ≤ n) to the potential edges between all the pairs of words, then compute a best projective tree g k using the Eisner's Algorithm: g k is not exactly a subgraph of y, because there may be some edges in the tree but not in the graph. To guarantee we get a subgraph of the original graph, we add labels to the edges in trees to encode necessary information. We label g k (i, j) with the original label, if y(i, j) = 1; with the original label appended by "∼R" if y(j, i) = 1; with "None" else. With this labeling, we can have a function t2g to transform the extracted trees into tree-like graphs. t2g(g k ) is not necessary the same as the original graph y, but must be a subgraph of it.

Three Variations of Scoring
With different weight assignments, we can extract different trees from a graph, obtaining different subgraphs. We devise three variations of weight assignment: ω 1 , ω 2 , and ω 3 . Each ω k (k is 1,2 or 3) consists of two parts. One is shared by all, denoted by S, and the other is different from each other, denoted by V . Formally, Given a graph y, S is defined as In the definitions above, c 1 , c 2 , c 3 and c 4 are coefficients, satisfying c 1 c 2 c 3 , and l p is a function of i and j. l p (i, j) is the length of shortest path from i to j that either i is a child of an ancestor of j or j is a child of an ancestor of i. That is to say, the paths are in the form i ← n 1 ← · · · ← n k → j or i ← n 1 → · · · → n k → j. If no such path exits, then l p (i, j) = n. The intuition behind the design is illustrated below.
S 1 indicates whether there is an edge between i and j, and we want it to matter mostly; S 2 indicates whether the edge is from i to j, and we want the edge with correct direction to be selected more likely; S 3 indicates the distance between i and j, and we like the edge with short distance because it is easier to predict; S 4 indicates the length of certain type of path between i and j that reflects c-commanding relationships, and the coefficient remains to be tuned.
We want the score V to capture different information of the GR graph. In GR graphs, we have an additional information (as denoted as "*ldd" in Figure 1) for long distance dependency edges. Moreover, we notice that conjunction is another important structure, and they can be derived from the GR graph. Assume that we tag the edges relating to conjunctions with "*cjt." The three variation scores, i.e. V 1 , V 2 and V 3 , reflect long distance and the conjunction information in different ways. V 1 . First for edges y(i, j) whose label is tagged with *ldd, we assign V 1 (i, j) = d. d is a coefficient to be tuned on validation data.. Whenever we come across a parent p with a set of conjunction children cjt 1 , cjt 2 , · · · , cjt n , we find the rightmost child gc 1r of the leftmost child in conjunction cjt 1 , and add d to each V 1 (p, cjt 1 ) and V 1 (cjt 1 , gc 1r ). The edges in conjunction that are added additional d's to are shown in blue in Figure  3.
V 2 . Different from V 1 , for edges y(i, j) whose label is tagged with *ldd, we assign an V 2 (j, i) = d. Then for each conjunction structure with a parent p and a set of conjunction children cjt 1 , cjt 2 , · · · , cjt n , we find the leftmost child gc nl of the rightmost child in conjunction cjt n , and add d to each V 2 (p, cjt n ) and V 2 (cjt n , gc nl ). The concerned edges in conjunction are shown in green in Figure 3.
Then the dual is According to the duality principle, max g 1 ,g 2 ,g 3 ;u min u L(g 1 , g 2 , g 3 ) = min u L(u), so we can find the optimal solution for the problem if we can find min u L(u). However it is very hard to compute L(u), not to mention min u L(u). The challenge is that g m in the three maximizations must be consistent.
The idea is to separate the overall maximization into three maximization problems by approximation. We observe that g 1 , g 2 , and g 3 are very close to g m , so we can approximate L(u) by In this case, the three maximization problem can be decoded separately, and we can try to find the optimal u using the subgradient method.

The Algorithm
Algorithm 1 is our tree decomposition algorithm.
In the algorithm, we use subgradient method to find min u L (u) iteratively. In each iteration, we first compute g 1 , g 2 , and g 3 to find L (u), then update u until the graph is covered by the subgraphs. The coefficient 1 3 's can be merged into the steps α (k) , so we omit them. The three separate problems g k ← arg max g k s k (g k ) + u g k (k = 1, 2, 3) can be solved using Eisner's algorithm, similar to solving arg max g k s k (g k ). Intuitively, the Lagrangian multiplier u in our Algorithm can be regarded as additional weights for the score function. The update of u is to increase weights to the edges that are not covered by any tree-like subgraph, so that it will be more likely for them to be selected in the next iteration.

Graph Merging
The extraction algorithm gives three classes of trees for each graph. We apply the algorithm to the graph training set, and get three training tree sets. After that, we can train three parsing models with the three tree sets. In this work, the parser we use to train models and parse trees is Mate (Bohnet, 2010), a second-order graph-based dependency parser.
Let the scores the three models use be f 1 , f 2 , f 3 respectively. Then the parsers can find trees with highest scores for a sentence. That is solving the following optimization problems: arg max g 1 f 1 (g 1 ), arg max g 2 f 2 (g 2 ) and arg max g 2 f 3 (g 3 ). We can parse a given sentence with the three models, obtain three trees, and then transform them into subgraphs, and combine them together to obtain the graph parse of the sentence by putting all the edges in the three subgraphs together. That is to say, we obtain the graph y = max{t2g(g 1 ), t2g(g 2 ), t2g(g 3 )}. We call this process simple merging.
However, the simple merging process omits some consistency that the three trees extracted from the same graph achieve, thus losing some important information. The information is that when we decompose a graph into three subgraphs, some edges tend to appear in certain classes of subgraphs at the same time. We want to retain the co-occurrence relationship of the edges when doing parsing and merging. To retain the hidden consistency, we must do joint decoding instead of decode the three models separately.

Capturing the Hidden Consistency
In order to capture the hidden consistency, we add consistency tags to the labels of the extracted trees to represent the co-occurrence. The basic idea is to use additional tag to encode the relationship of the edges in the three trees. The tag set is T =  {0, 1, 2, 3, 4, 5, 6}. Given a tag t ∈ T , t&1, t&2, t&4 denote whether the edge is contained in g 1 , g 2 , g 3 respectively, where the operator "&" is the bitwise AND operator. Specially, since we do not need to consider first bit of the tags of edges in g 1 , the second bit in g 2 , and the third bit in g 3 , we always assign 0 to them. For example, if y(i, j) = 1, g 1 (i, j) = 1, g 2 (j, i) = 1, g 3 (i, j) = 0 and t 3 (j, i) = 0, we tag g 1 (i, j) as 2 and g 2 (j, i) as 1.
When it comes to parsing, we also get labels with consistency information. Our goal is to guarantee the tags in edges of the parse trees for a same sentence are consistent while graph merging. Since the consistency tags emerge, for convenience we index the graph and tree vector representation using three indices. g(i, j, t) denotes whether there is an edge from word w i to word w j with tag t in graph g.
The joint decoding problem can be written as a constrained optimization problem as max. f 1 (g 1 ) + f 2 (g 2 ) + f 3 (g 3 ) s.t.

Lagrangian Relaxation with Approximation
To solve the constrained optimization problem above, we do some transformations and then apply the Lagrangian Relaxation to it with approximation.
Let a 12 (i, j) = g 1 (i, j, 2) + g 1 (i, j, 6), then the first constraint can be written as an equity constraint g 1 (:, :, 2) + g 1 (:, :, 6) = a 12 . * ( t g 2 (:, :, t)) where ":" is to take out all the elements in the corresponding dimension, and ". * " is to do multiplication pointwisely. So can the other inequality constraints. If we take a 12 , a 13 , · · · , a 32 as constants, then all the constraints are linear. The constraints thus can be written as where A 1 , A 2 , and A 3 are matrices that can be constructed from a 12 , a 13 , · · · , a 32 .
The Lagrangian of the optimization problem is where u is the Lagrangian multiplier. Then the dual is Again, we use the subgradient method to minimize L(u). During the deduction, we take a 12 , a 13 , · · · , a 32 as constants, but unfortunately they are not. We propose an approximation for the a's in each iteration: Using the a's we got in the previous iteration instead. It is a reasonable approximation given that the u's in two consecutive iterations are similar and so are the a's.

The Algorithm
The pseudo code of our algorithm is shown in Algorithm 2. We know that the score functions f 1 , f 2 , and f 3 each consist of first-order scores and higher order scores. So they can be written as where s 1st k (g) = ω k (i, j)g(i, j) (k = 1, 2, 3). With this property, each individual problem g k ← arg max g k f k (g k ) + u A k g k can be decoded easily, with modifications to the first order weights Algorithm 2: The Joint Decoding Algorithm of the edges in the three models. Specifically, let w k = u A k , then we can modify the ω k in s k to ω k , such that ω k (i, j, t) = ω k (i, j, t)+w k (i, j, t)+ w k (j, i, t).
The update of w 1 , w 2 , w 3 can be understood in an intuitive way. When one of the constraints is not satisfied, without loss of generality, say, the first one for edge y(i, j). We know g 1 (i, j) is tagged to represent that g 2 (i, j) = 1, but it is not the case. So we increase the weight of that edge with all kinds of tags in g 2 , and decrease the weight of the edge with tag representing g 2 (i, j) = 1 in g 1 . After the update of the weights, the consistency is more likely to be achieved.

Labeled Parsing
For sake of formal concision, we illustrate our algorithms omitting the labels. It is straightforward to extend the algorithms to labeled parsing. In the joint decoding algorithm, we just need to extend the weights w 1 , w 2 , w 3 for every label that appears in the three tree sets, and the algorithm can be deduced similarly.

Experimental Setup
We conduct experiments on Chinese GRBank (Sun et al., 2014), an LFG-style GR corpus for Mandarin Chinese. Linguistically speaking, this deep dependency annotation directly encodes information such as coordination, extraction, raising, control as well as many other long-range dependencies. The selection for training, development, test data is also according to Sun et al. (2014)  The measure for comparing two dependency graphs is precision/recall of GR tokens which are defined as w h , w d , l tuples, where w h is the head, w d is the dependent and l is the relation. Labeled precision/recall (LP/LR) is the ratio of tuples correctly identified by the automatic generator, while unlabeled precision/recall (UP/UR) is the ratio regardless of l. F-score is a harmonic mean of precision and recall. These measures correspond to attachment scores (LAS/UAS) in dependency tree parsing. To evaluate our GR parsing models that will be introduced later, we also report these metrics. Table 3 shows the results of graph decomposition on the training set. If we use simple decomposition, say, directly extracting three trees from a graph, we get three subgraphs. On the training set, each kind of the subgraphs cover around 90% edges and 30% sentences. When we merge them together, they cover nearly 97% edges and over 70% sentences. This indicates that the ability of a single tree is limited and three trees can cover most of the edges.

Results of Graph Decomposition
When we apply Lagrangian Relaxation to the decomposition process, both the edge coverage and the sentence coverage gain great error reduc-   Table 1 shows the results of graph merging on the development set, and Table 2 on test set. The three training sets of trees are from the decomposition with Lagrangian Relaxation and the models are trained from them. In both tables, simple merging (SM) refers to first decode the three trees for a sentence then combine them by putting all the edges together. As is shown, the merged graph achieves higher f-score than other single models. With Lagrangian Relaxation, the performance of not only the merged graph but also the three subgraphs are improved, due to capturing the consistency information. When we do simple merging, though the recall of each kind of subgraphs is much lower than the precision of them, it is opposite of the merged graph. This is because the consistency between three models is not required and the models tend to give diverse subgraph predictions. When we require the consistency between the three models, the precision and recall become comparable, and higher f-scores are achieved.

Results of Graph Merging
The best scores reported by previous work, i.e. (Sun et al., 2014) and (Zhang et al., 2016) are also listed in Table 2. We can see that our subgraphs already achieve competitive scores, and the merged graph with Lagrangian Relaxation improves both unlabeled and labeled f-scores substantially, with an error reduction of 15.13% and 10.86%. We also include Zhang et al.'s parsing result obtained by an ensemble model that integrate six different transition-based models. We can see that parser ensemble is very helpful for deep dependency parsing and the accuracy of our graph merging parser is sightly lower than this ensemble model. Given that the architecture of graph merging is quite different from transition-based parsing, we think system combination of our parser and the transition-based parser is promising.

Conclusion
To construct complex linguistic graphs beyond trees, we propose a new perspective, namely graph merging. We take GR parsing as a case study and exemplify the idea. There are two key problems in this perspective, namely graph decomposition and merging. To solve these two problems in a principled way, we treat both problems as optimization problems and employ combinatorial optimization techniques. Experiments demonstrate the effectiveness of the graph merging framework. This framework can be adopted to other types of flexible representations, e.g. semantic dependency graphs (Oepen et al., 2014(Oepen et al., , 2015 and abstract meaning representations (Banarescu et al., 2013).