Efficient Inner-to-outer Greedy Algorithm for Higher-order Labeled Dependency Parsing

Many NLP systems use dependency parsers as critical components. Jonit learning parsers usually achieve better parsing accuracies than two-stage methods. However, classical joint parsing algorithms signiﬁcantly increase computational complexity, which makes joint learning impractical. In this paper, we proposed an ef-ﬁcient dependency parsing algorithm that is capable of capturing multiple edge-label features, while maintaining low computational complexity. We evaluate our parser on 14 different languages. Our parser consistently obtains more accurate results than three baseline systems and three popular, off-the-shelf parsers.

Dependency parsers predict dependency structures and dependency type labels on each edge. However, most graph-based dependency parsing algorithms only produce unlabeled dependency trees, particularly when higher-order factorizations are used (Koo and Collins, 2010;Ma and Zhao, 2012b;Martins et al., 2013;Ma and Zhao, 2012a). A two-stage method (McDonald, 2006) is often used because the complexity of some joint learning models is unacceptably high. On the other hand, joint learning models can benefit from edge-label information that has proven to be im-portant to provide more accurate tree structures and labels (Nivre and Scholz, 2004).
Previous studies explored the trade-off between computational costs and parsing performance. Some work (McDonald, 2006;Carreras, 2007) simplified labeled information to only single label features. Other work (Johansson and Nugues, 2008;Bohnet, 2010) used richer label features but increased systems' complexities significantly, while achieving better parsing accuracy. Yet, there are no previous work addressing the problem of good balance between parsing accuracy and computational costs for joint parsing models.
In this paper, we propose a new dependency parsing algorithm that can utilize edge-label information of more than one edge, while simultaneously maintaining low computational complexity. The component needed to solve this dilemma is an inner-to-outer greedy approximation to avoid an exhaustive search. The contributions of this work are (i) showing the effectiveness of edge-label information on both UAS and LAS. (ii) proposing a joint learning parsing model which achieves both effectiveness and efficience. (iii) giving empirical evaluations of this parser on different treebanks over 14 languages.

Basic Notations
In the following, x represents a generic input sentence, and y represents a generic dependency tree. Formally, for a sentence x, dependency parsing is the task of finding the dependency tree y with the highest-score for x: Score(x, y). (1) Here Y(x) denotes the set of possible dependency trees for sentence x.
In this paper, we adopt the second-order sibling factorization (Eisner, 1996;, in which each sibling part consists of a tuple of indices (h, m, c) where (h, m) and (h, c) are a pair of adjacent edges to the same side of the head h. By adding labele information to this factorization, Score(x, y) can be rewritten as: where S sib (h, m, c, l 1 , l 2 ) is the score function for the sibling part (h, m, c) with l 1 and l 2 being the labels of edge (h, m) and (h, c), respectively. f are feature functions and λ is the parameters of parsing model.

Exact Search Parsing
The unlabeled sibling parser introduces three types of dynamic-programming structures: complete spans C (s,t) , which consist of the headword s and its descendents on one side with the endpoint t, incomplete spans I (s,t) , which consist of the dependency (s, t) and the region between the head s and the modifier t, and sibling spans S (s,t) , which represent the region between successive modifiers s and t of some head. To capture label information, we have to extend each incomplete span I s,t to I s,t,l to store the label of dependency edge from s to t. The reason is that there is an edge shared by two adjacent sibling parts (e.g. (h, m, c) and (h, c, c ) share the edge (h, c)). So the incomplete span I (s,t) does not only depend on the label of dependency (s, t), but also the label of the dependency (s, r) for each split point r. The dynamic-programming procedure for new incomplete spans 1 are where L is set of all edge labels. Then we have, The graphical specification of the this parsing algorithm is provided in Figure 1  number in the treebank of CoNLL-2008 shared task is 70. So it is impractical to perform an exhaustive search for parsing, and more efficient approximating algorithms are needed.

Two Intermediate Models
In this section, we describe two intuitive simplifications of the labeled parsing model presented above. For the two simplified parsing models, efficient algorithms are available.

Model 0: Single-edge Label
In this parsing model, labeled features are restricted to a single edge. Specifically, Then the dynamic-programming derivation for each incomplete span becomes where l(s, t) = argmax l∈L S sib (s, t, l).
In this case, therefore, we do not have to extend incomplete spans. The computational cost to calculate l(s, t) is O(|L|n 2 ) time, so the computational complexity of this algorithm is O(n 3 + |L|n 2 ) time and O(n 2 ) space.

Model 1: Sibling with Single Label
As remarked in McDonald (2006), Model 0 can be slightly enriched to include single label features associated with a sibling part. Formally,

Now, the dynamic-programming derivation is
where l(s, r, t) = argmax l∈L S sib (s, r, t, l).
The additional algorithm to calculate the best edge label l(s, r, t) takes O(|L|n 3 ) time. Therefore, this algorithm requires O(|L|n 3 ) time and O(n 2 ) space 2 . Figure 1 (b) and Figure 1 (c) provide the graphical specifications for Model 0 and Model 1, respectively.

Inner-to-outer Greedy Search
Though the two intermediate parsing models, model 0 and model 1, encode edge-label information and have efficient parsing algorithms, the labeled features they are able to capture are relatively limited due to restricting their labeled feature functions to a single label. Our experimental results show that utilizing these edge-label information yields a slight improvement of parsing accuracy (see Section 3 for details). In this section, we describe our new labeled parsing model that can exploit labeled features involving two edgelabels in a sibling part. To achieve efficient search, we adopt an method characterized by inferring labels of outer parts from the labels of inner ones.
Formally, consider the maximization problem in Eq 3. It can be treated as a two-layer maximization: first fixes a split point r and maximizes over all edge-label l, then maximizes over all possible split points. Our approach approximates the maximization in the first layer: Then the dynamic-programming derivation for each incomplete span is I (s,t) = max s<r≤t I (s,r) +S (r,t) +max l∈L S sib (s, r, t, l * (s,r) , l) (7) To compute I (s,t) , we need to calculate l(s, r, t, l * (s,r) ) = argmax l∈L S sib (s, r, t, l * (s,r) , l), which is similar to the calculation of l(s, r, t) in Model 1. The only difference between them is l * (s,r) that can be calculated in previous derivations. Thus, their computation costs are almost the same. The procedure of our algorithm to derivate incomplete spans can be regarded as two steps. At the first step, the algorithm goes through all possible split points (Eq 7). Then at the second step, at each split point r, it calculate the label l(s, r, t, l * (s,r) ) (Eq 8) based on the sibling part (s, r, t) and the label l * (s,r) which is the "best" label for dependency edge (s, r) based on incomplete span I (s,r) . The key insight of this algorithm is the inner-to-outer dynamic-programming structure: inner modifiers (r) of a head (s) and their "best" labels (l * (s,r) ) are generated before outer ones (t). Thus, using already computed "best" labels of inner dependency edges makes us get rid of maximizing over two labels, l and l. Moreover, we do not have to extend each incomplete span by the augmentation with a "label" index. This makes the space complexity remains O(n 2 ), which is important in practice. The graphical specification is provided in Figure 1 (d).

Setup
We conduct our experiments on 14 languages, including the English treebank from CoNLL-2008 shared task (Surdeanu et al., 2008) and all 13 treebanks from CoNLL-2006 shared task (Buchholz and Marsi, 2006). We train our parser using The kbest version of the Margin Infused Relaxed Algorithm (MIRA) (Crammer and Singer, 2003;Crammer et al., 2006;McDonald, 2006). In our experiments, we set k = 1 and fix the number of iteration to 10, instead of tuning these parameters on development sets. Following previous work, all experiments are evaluated on the metrics of unlabeled attachment score (UAS) and Labeled attachment score (LAS), using the official scorer 3 of CoNLL-2006 shared task.

Non-Projective Parsing
The parsing algorithms described in the paper fall into the category of projective dependency parsers, which exclude crossing dependency edges. Since the treebanks from CoNLL shared tasks contain non-projective edges, we use the "mountain-  Table 1: UAS and LAS of non-projective versions of our parsing algorithms on 14 treebanks from CoNLL shared tasks, together with three baseline systems and the best systems for each language reported in CoNLL shared tasks. MD06 is , RD06 is Riedel et al. (2006), JN08 is Johansson and Nugues (2008), and NV06 is Nivre et al. (2006) Bold indicates the best result for a language. Red values represent statistically significant improvements over two-stage baseline system on the corresponding metrics with p < 0.01, using McNemar's test. Blue values indicate statistically significant improvements with p < 0.05.
climbing" non-projective parsing algorithm proposed in . This approximating algorithm first searches the highest scoring projective parse tree and then it rearranges edges in the tree until the rearrangements do not increase the score for the tree anymore 4 . Table 1 illustrates the parsing results our parser with non-projective parsing algorithm, together with three baseline systems-the two-stage system (McDonald, 2006) and the two intermediate models, Model 0 and Model 1-and the best systems reported in CoNLL shared tasks for each language. Our parser achieves better parsing performance on both UAS and LAS than all the three baseline systems for 12 languages. The two exceptions are Portuguese and Turkish, on which our parser achieves better LAS and comparable UAS. Comparing with the best systems from CoNLL, our parser achieves better performance on both UAS and LAS for 9 languages. Moreover, the average UAS of our parser over the 14 languages is better than that of the best systems in CoNLL. It should be noted that the best results for 14 languages in CoNLL are not from one single system, but different systems that achieved best results for 4 Additional care is required in the non-projective approximation since a change of one edge could result in a label change for multiple edges

Experiments on PTB
To make a thorough empirical comparison with previous studies, we also evaluate our system on the English Penn Treebanks (Marcus et al., 1993) with Stanford Basic Dependencies (De Marneffe et al., 2006). We compare our parser with three off-the-shelf parsers: MaltParser (Nivre and Scholz, 2004;Zhang and Clark, 2008;Nivre, 2011), MSTParser (McDonald et al., 2005), and the parser using Neural Networks  Table 3: Top 10 dependency labels on which our algorithm achieves most improvements on the F1 score of UAS, together with the corresponding improvements of LAS. "TST" indicates the two-stage system. The first column is the label name in the treebank. The second column is the label's description from Surdeanu et al. (2008).
(DNNParser) (Chen and Manning, 2014). The results are listed in Table 2. Clearly, our parser is superior in terms of both UAS and LAS.

Analysis
To better understand the performance of our parser, we analyze the distribution of our parser's UAS and LAS over different dependency labels on the English CoNLL treebank, compared with the ones of the two-stage model. Table 3 lists the top 10 dependency labels on which our algorithm achieves most improvements on the F1 score of UAS, together with the corresponding improvements of LAS. From Table 3 we can see among the 10 labels, there are 5 labels -"MNR", "ADV", "TMP", "DIR", "LOC" -which are a specific kind of adverbials. This illustrates that our parser performs well on the recognition of different kinds of adverbials. Moreover, the label "OPRD" and "OBJ" indicate dependency relations between verbs and their modifiers, too. In addition, our parser also significantly improves the accuracy of appositional relations ("APPO").

Conclusion
We proposed a new dependency parsing algorithm which can jointly learn dependency structures and edge labels. Our parser is able to use multiple edge-label features, while maintaining low computational complexity. Experimental results on 14 languages show that our parser significantly improves the accuracy of both dependency structures (UAS) and edge labels (LAS), over three baseline systems and three off-the-shelf parsers. This demonstrates that jointly learning dependency structures and edge labels can bene-fit both performance of tree structures and labeling accuracy. Moreover, our parser outperforms the best systems of different languages reported in CoNLL shared task for 9 languages.
In future, we are interested in extending our parser to higher-order factorization by increasing horizontal context (e.g., from siblings to "trisiblings") and vertical context (e.g., from siblings to "grand-siblings") and validating its effectiveness via a wide range of NLP applications.