Transforming Dependencies into Phrase Structures

We present a new algorithm for transforming dependency parse trees into phrase-structure parse trees. We cast the problem as structured prediction and learn a statistical model. Our algorithm is faster than traditional phrase-structure parsing and achieves 90.4% English parsing accuracy and 82.4% Chinese parsing accuracy, near to the state of the art on both benchmarks.


Introduction
Natural language parsers typically produce phrasestructure (constituent) trees or dependency trees. These representations capture some of the same syntactic phenomena, and the two can be produced jointly (Klein and Manning, 2002;Hall and Nivre, 2008;Carreras et al., 2008;Rush et al., 2010). Yet it appears to be completely unpredictable which will be preferred by a particular subcommunity or used in a particular application. Both continue to receive the attention of parsing researchers.
Further, it appears to be a historical accident that phrase-structure syntax was used in annotating the Penn Treebank, and that English dependency annotations are largely derived through mechanical, rule-based transformations (reviewed in Section 2). Indeed, despite extensive work on directto-dependency parsing algorithms (which we call dparsing), the most accurate dependency parsers for English still involve phrase-structure parsing (which we call c-parsing) followed by rule-based extraction of dependencies (Kong and Smith, 2014).
What if dependency annotations had come first? Because d-parsers are generally much faster than c-parsers, we consider an alternate pipeline (Section 3): d-parse first, then transform the dependency representation into a phrase-structure tree constrained to be consistent with the dependency parse. This idea was explored by Xia and Palmer (2001) and Xia et al. (2009) using hand-written rules. Instead, we present a data-driven algorithm using the structured prediction framework (Section 4). The approach can be understood as a specially-trained coarse-to-fine decoding algorithm where a d-parser provides "coarse" structure and the second stage refines it (Charniak and Johnson, 2005;Petrov and Klein, 2007).
Our lexicalized phrase-structure parser, PAD, is asymptotically faster than parsing with a lexicalized context-free grammar: O(n 2 ) plus d-parsing, vs. O(n 5 ) worst case runtime in sentence length n, with the same grammar constant. Experiments show that our approach achieves linear observable runtime, and accuracy similar to state-of-the-art phrase-structure parsers without reranking or semisupervised training (Section 7).

Background
We begin with the conventional development by first introducing c-parsing and then defining d-parses through a mechanical conversion using head rules. In the next section, we consider the reverse transformation.

CFG Parsing
The phrase-structure trees annotated in the Penn Treebank are derivation trees from a context-free grammar. Define a binary 1 context-free grammar (CFG) as a 4-tuple (N , G, T , r) where N is a set of nonterminal symbols (e.g. NP, VP), T is a set of terminal symbols, consisting of the words in the language, G is a set of binary rules of the form A → β 1 β 2 , and r ∈ N is a distinguished root nonterminal symbol.
Given an input sentence x 1 , . . . , x n of terminal symbols from T , define the set of c-parses for the sentence as Y(x). This set consists of all binary ordered trees with fringe x 1 , . . . , x n , internal nodes labeled from N , all tree productions A → β 1 β 2 consisting of members of G, and root label r.
For a c-parse y ∈ Y(x), we further associate a span v ⇐ , v ⇒ with each vertex in the tree. This specifies the subsequence {x v⇐ , . . . , x v⇒ } of the sentence covered by this vertex.

Dependency Parsing
Dependency parses provide an alternative, and in some sense simpler, representation of sentence structure. These d-parses can be derived through mechanical transformation from context-free trees. There are several popular transformations in wide use; each provides a different representation of a sentence's structure (Collins, 2003;De Marneffe and Manning, 2008;Yamada and Matsumoto, 2003;Johansson and Nugues, 2007).
We consider the class of transformations that are defined through local head rules. For a binary CFG, define a collection of head rules as a mapping from each CFG rule to a head preference for its left or right child. We use the notation A → β * 1 β 2 and A → β 1 β * 2 to indicate a left-or right-headed rule, respectively.
The head rules can be used to map a c-parse to a dependency tree (d-parse). In a d-parse, each word in the sentence is assigned as a dependent to a head word, h ∈ {0, . . . , n}, where 0 is a special symbol indicating the pseudo-root of the sentence. For each h we define L(h) ⊂ {1, . . . , h − 1} as the set of left dependencies of h, and R(h) ⊂ {h + 1, . . . , n} as the set of right dependencies.
A d-parse can be constructed recursively from a c-parse and the head rules. For each c-parse vertex v with potential children v L and v R in bottom-up order, we apply the following procedure to both assign heads to the c-parse and construct the d-parse: The 1 The 1 automaker 2 sold 3 . . . The blue and red vertices have the words automaker2 and sold3 as heads respectively. The vertex VP(3) implies that automaker2 is a left-dependent of sold3, and that 2 ∈ L(3) in the d-parse.
1. If the vertex is leaf x m , then head(v) = m.

If the next rule is
the head of the right-child is a dependent of the head word.
3. If the next rule is A → β 1 β * 2 then head(v) = head(v R ) and head(v L ) ∈ L(head(v)), i.e. the head of the left-child is a dependent of the head word. Figure 1 shows an example conversion of a c-parse to d-parse using this procedure.
By construction, these dependencies form a directed tree with arcs (h, m) for all h ∈ {0, . . . , n} and m ∈ L(h) ∪ R(h). While this tree differs from the original c-parse, we can relate the two trees through their spans. Define the dependency tree span h ⇐ , h ⇒ as the contiguous sequence of words reachable from word h in this tree. 2 This span is equivalent to the maximal span v ⇐ , v ⇒ of any cparse vertex with head(v) = h. This property will be important for the parsing algorithm presented in the next section.  (Collins et al., 1999).] A d-parse (left) and several c-parses consistent with it (right). Our goal is to select the best parse from this set.

Parsing Dependencies
Now we consider flipping this setup. There has been significant progress in developing efficient directto-dependency parsers. These d-parsers are trained only on dependency annotations and do not require full phrase-structure trees. 3 Some prefer this setup, since it allows easy selection of the specific dependencies of interest in a downstream task (e.g., information extraction), and perhaps even training specifically for those dependencies. Other applications make use of phrase structures, so c-parsers enjoy wide use as well.
With these latter applications in mind, we consider the problem of converting a fixed d-parse into a c-parse, with the intent of using off-the-shelf d-parsers for constructing phrase-structure parses. Since this problem is more challenging than its inverse, we use a structured prediction setup: we learn a function to score possible c-parse conversions, and then generate the highest-scoring c-parse given a dparse. A toy example of the problem is shown in Figure 2.

Parsing Algorithm
Consider the classical problem of predicting the best c-parse under a CFG with head rules, known as lexicalized context-free parsing. Assume that we are given a binary CFG defining a set of valid c-parses Y(x). The parsing problem is to find the highestscoring parse in this set, i.e. arg max y∈Y(x) s(y; x) where s is a scoring function that factors over lexicalized tree productions.
This problem can be solved by extending the CKY algorithm to propagate head information. The algorithm can be compactly defined by the productions in Figure 3 (left). For example, one type of production is of the form for all rules A → β 1 β * 2 ∈ G and spans i ≤ k < j. This particular production indicates that rule A → β 1 β * 2 was applied at a vertex covering i, j to produce two vertices covering i, k and k + 1, j , and that the new head is index h has dependent index m. We say this production "completes" word m since it can no longer be the head of a larger span.
Running the algorithm consists of bottom-up dynamic programming over these productions. However, applying this version of the CKY algorithm requires O(n 5 |G|) time (linear in the number of productions), which is not practical to run without heavy pruning. Most lexicalized parsers therefore make further assumptions on the scoring function which can lead to asymptotically faster algorithms (Eisner and Satta, 1999).
Instead, we consider the same objective, but constrain the c-parses to be consistent with a given dparse, d. By "consistent," we mean that the cparse will be converted by the head rules to this exact d-parse. 4 Define the set of consistent c-parses as Y(x, d) and the constrained search problem as arg max y∈Y(x,d) s(y; x, d).
Figure 3 (right) shows the algorithm for this new problem. The algorithm has several nice properties. All rules now must select words h and m that are consistent with the dependency parse (i.e., there is an arc (h, m)) so these variables are no longer free. Furthermore, since we have the full d-parse, we can precompute the dependency span of each word m ⇐ , m ⇒ . By our definition of consistency, this gives us the c-parse span of m before it is completed, and fixes two more free variables. Finally the head item must have its alternative side index match Premise: Goal: ( 1, n , m, r) for any m Premise: Goal: ( 1, n , m, r) for any m ∈ R(0) . While the terms |L(h)| and |R(h)| could in theory make the runtime quadratic, in practice the number of dependents is almost always constant in the length of the sentence. This leads to linear observed runtime in practice as we will show in Section 7.

Pruning
In addition to constraining the number of c-parses, the d-parse also provides valuable information about the labeling and structure of the c-parse. We can use this information to further prune the search space. We employ two pruning methods: Method 1 uses the part-of-speech tag of x h , tag(h), to limit the possible rule productions at a given span. We build tables G tag(h) and restrict the search to rules seen in training for a particular partof-speech tag.
Method 2 prunes based on the order in which dependent words are added. By the constraints of the algorithm, a head word x h must combine with each of its left and right dependents. However, the order of combination can lead to different tree structures (as illustrated in Figure 2). In total there are |L(h)| × |R(h)| possible orderings of dependents.
In practice, though, it is often easy to predict which side, left or right, will come next. We do this by estimating the distribution, where m ∈ L(h) is the next left dependent and m ∈ R(h) is the next right dependent. If the conditional probability of left or right is greater than a threshold parameter γ, we make a hard decision to combine with that side next. This pruning further reduces the impact of outliers with multiple dependents on both sides.
We empirically measure how these pruning methods affect observed runtime and oracle parsing performance (i.e., how well a perfect scoring function could do with a pruned Y(x, d)). Table 1 shows a comparison of these pruning methods on development data. The constrained parsing algorithm is much faster than standard lexicalized parsing, and 4 pruning contributes even greater speed-ups. The oracle experiments show that the d-parse constraints do contribute a large drop in oracle accuracy, while pruning contributes a relatively small one. Still, this upper-bound on accuracy is high enough to make it possible to still recover c-parses at least as accurate as state-of-the-art c-parsers. We will return to this discussion in Section 7.

Binarization and Unary Rules
We have to this point developed the algorithm for a strictly binary-branching grammar; however, we need to produce trees have rules with varying size. In order to apply the algorithm, we binarize the grammar and add productions to handle unary rules. Consider a non-binarized rule of the form A → β 1 . . . β m with head child β * k . Relative to the head child β k the rule has left-side β 1 . . . β k−1 and rightside β k+1 . . . β m . We replace this rule with new binary rules and non-terminal symbols to produce each side independently as a simple chain, left-side first. The transformation introduces the following new rules: 5 A → β 1Ā * ,Ā → β iĀ * for i ∈ {2, . . . , k}, andĀ →Ā * β i for i ∈ {k, . . . , m}.
As an example consider the transformation of a rule with four children:

NP
These rules can then be reversed deterministically to produce a non-binary tree. 5 These rules are slightly modified when k = 1.
We also explored binarization using horizontal and vertical markovization to include additional context of the tree, as found useful in unlexicalized approaches (Klein and Manning, 2003). Preliminary experiments showed that this increased the size of the grammar, and the runtime of the algorithm, without leading to improvements in accuracy.
Phrase-structure trees also include unary rules of the form A → β * 1 . To handle unary rules we modify the parsing algorithms in Figure 3 to include a unary completion rule, for all indices i ≤ h ≤ j that are consistent with the dependency parse. In order to avoid unary recursion, we limit the number of applications of this rule at each span (preserving the runtime of the algorithm). Preliminary experiments looked at collapsing the unary rules into the nonterminal symbols, but we found that this hurt performance compared to explicit unary rules.

Structured Prediction
We learn the d-parse to c-parse conversion using a standard structured prediction setup. Define the linear scoring function s for a conversion as s(y; x, d, θ) = θ f (x, d, y) where θ is a parameter vector and f (x, d, y) is a feature function that maps parse productions to sparse feature vectors. While the parser only requires a d-parse at prediction time, the parameters of this scoring function are learned directly from a treebank of c-parses and a set of head rules. The structured prediction model, in effect, learns to invert the head rule transformation.

Features
The scoring function requires specifying a set of parse features f which, in theory, could be directly adapted from existing lexicalized c-parsers. However, the structure of the dependency parse greatly limits the number of decisions that need to be made, and allows for a smaller set of features.
We model our features after two bare-bones parsing systems. The first set is the basic arc-factored features used by McDonald (2006  modifier word and part-of-speech, and head word and part-of-speech. The second set of features is modeled after the span features described in the X-bar-style parser of Hall et al. (2014). These include conjunctions of the rule with: first and last word of current span, preceding and following word of current span, adjacent words at split of current span, and binned length of the span.
The full feature set is shown in Figure 4. After training, there are a total of around 2 million nonzero features. For efficiency, we use lossy feature hashing. We found this had no impact on parsing accuracy but made the parsing significantly faster.

Training
The parameters θ are estimated using a structural support vector machine (Taskar et al., 2004). Given a set of gold-annotated c-parse examples, (x 1 , y 1 ), . . . , (x D , y D ), and d-parses d 1 . . . d D induced from the head rules, we estimate the parameters to minimize the regularized empirical risk where we define as (x, d, y, θ) = −s(y) + max y ∈Y(x,d) (s(y ) + ∆(y, y )) and where ∆ is a problem specific cost-function. In experiments, we use a Hamming loss ∆(y, y ) = |y − y | where y is an indicator for production rules firing over pairs of adjacent spans (i.e., i, j, k).  The objective is optimized using AdaGrad (Duchi et al., 2011). The gradient calculation requires computing a loss-augmented max-scoring c-parse for each training example which is done using the algorithm of Figure 3 (right).

Related Work
The problem of converting dependency to phrasestructured trees has been studied previously from the perspective of building multi-representational treebanks. Xia and Palmer (2001) and Xia et al. (2009) develop a rule-based system for the conversion of human-annotated dependency parses. This work focuses on modeling the conversion decisions made and capturing how researchers annotate specific phenomena. Our work focuses on a different problem of learning a data-driven structured prediction model that is also able to handle automatically predicted dependency parses as input. While the aim is different, Table 2 does give a direct comparison of our system to that of Xia et al. (2009) on gold d-parse data.
An important line of previous work also uses dependency parsers to produce phrase-structure trees. In particular Hall et al. (2007) and Hall and Nivre (2008) develop a specialized dependency label set to encode phrase-structure information in the d-parse. After predicting a d-parse this label information can be used to assemble a predicted c-parse. Our work differs in that it does not make any assumptions on the labeling of the dependency tree used and it uses structured prediction to produce the final c-parse.
Very recently, Fernández-González and Martins (2015) also show that an off-the-shelf, trainable, dependency parser is enough to build a highlycompetitive constituent parser. They proposed 6 a new intermediate representation called "headordered dependency trees", which encode head ordering information in dependeny labels. Their algorithm is based on a reduction of the constituent parsing to dependency parsing of such trees.
There has been successful work combining dependency and phrase-structure information to build accurate c-parsers. Klein and Manning (2002) construct a factored generative model that scores both context-free syntactic productions and semantic dependencies. Carreras et al. (2008) construct a stateof-the-art parser that uses a dependency parsing model both for pruning and within a richer lexicalized parser. Similarly, Rush et al. (2010) use dual decomposition to combine a powerful dependency parser with a lexicalized phrase-structure model. This work differs in that we treat the dependency parse as a hard constraint, hence largely reduce the runtime of a fully lexicalized phrase structure parsing model while maintaining the ability, at least in principle, to generate highly accurate phrasestructure parses.
Finally there have also been several papers that use ideas from dependency parsing to simplify and speed up phrase-structure prediction. Zhu et al. (2013) build a high-accuracy phrase-structure parser using a transition-based system. Hall et al. (2014) use a stripped down parser based on a simple X-bar grammar and a small set of lexicalized features.

Methods
We ran a series of experiments to assess the accuracy, efficiency, and applicability of our parser, PAD, to several tasks. These experiments use the following setup.
For English experiments we use the standard Penn Treebank (PTB) experimental setup (Marcus et al., 1993). Training is done on §2-21, development on §22, and testing on §23. We use the development set to tune the regularization parameter, λ = 1e−8, and the pruning threshold, γ = 0.95.
For Chinese experiments, we use version 5.1 of the Penn Chinese Treebank 5.1 (CTB) (Xue et al., 2005). We followed previous work and used articles 001-270 and 440-1151 for training, 301-325 for development, and 271-300 for test. We also use the development set to tune the regularization parame-ter, λ = 1e − 3.
Part-of-speech tagging is performed for all models using TurboTagger (Martins et al., 2013). Prior to training the d-parser, the training sections are automatically processed using 10-fold jackknifing (Collins and Koo, 2005) for both dependency and phrase structure trees. Zhu et al. (2013) found this simple technique gives an improvement to dependency accuracy of 0.4% on English and 2.0% on Chinese in their system.
During training, we use the d-parses induced by the head rules from the gold c-parses as constraints. There is a slight mismatch here with test, since these d-parses are guaranteed to be consistent with the target c-parse. We also experimented with using 10fold jacknifing of the d-parser during training to produce more realistic parses; however, we found that this hurt performance of the parser.
Unless otherwise noted, in English the test dparsing is done using the RedShift implementation 6 of the parser of Zhang and Nivre (2011), trained to follow the conventions of Collins head rules (Collins, 2003). This parser is a transition-based beam search parser, and the size of the beam k controls a speed/accuracy trade-off. By default we use a beam of k = 16. We found that dependency labels have a significant impact on the performance of the RedShift parser, but not on English dependency conversion. We therefore train a labeled parser, but discard the labels.
For Chinese, we use the head rules compiled by Ding and Palmer (2005) 7 . For this data-set we trained the d-parser using the YaraParser implementation 8 of the parser of Zhang and Nivre (2011), because it has a better Chinese implementation. We use a beam of k = 64. In experiments, we found that Chinese labels were quite helpful, and added four additional features templates conjoining the label with the non-terminals of a rule.
unlabeled accuracy score (UAS). We implemented the grammar binarization, head rules, and pruning tables in Python, and the parser, features, and training in C++. Experiments are performed on a Lenovo ThinkCentre desktop computer with 32GB of memory and Core i7-3770 3.4GHz 8M cache CPU.

Experiments
We ran experiments to assess the accuracy of the method, its runtime efficiency, the effect of dependency parsing accuracy, and the effect of the amount of annotated phrase-structure data. Table 3 compares the accuracy and speed of the phrase-structure trees produced by the parser. For these experiments we treat our system and the Zhang-Nivre parser as an independently trained, but complete end-to-end c-parser. Runtime for these experiments includes both the time for dparsing and conversion. Despite the fixed depen-  dency constraints, the English results show that the parser is comparable in accuracy to many widelyused systems, and is significantly faster. The parser most competitive in both speed and accuracy is that of Zhu et al. (2013), a fast shift-reduce phrasestructure parser. Furthermore, the Chinese results suggest that, even without making language-specific changes in the feature system we can still achieve competitive parsing accuracy. Table 4 shows experiments comparing the effect of different input dparses. For these experiments we used the same version of PAD with 11 different d-parsers of varying quality and speed. We measure for each parser: its UAS, speed, and labeled F 1 when used with PAD and with an oracle converter. 10 The paired figure shows that there is a direct correlation between the UAS of the inputs and labeled F 1 .

Runtime
In Section 3 we considered the theoretical complexity of the parsing model and presented the main speed results in Table 1. Despite having a quadratic theoretical complexity, the practical runtime was quite fast. Here we consider the empirical complexity of the model by measuring the time spent on individual sentences. Figure 5 shows parser speed for sentences of varying length for both the full algorithm and with pruning. In both cases the observed runtime is linear.
Recovering Phrase-Structure Treebanks Annotating phrase-structure trees is often more expensive and slower than annotating unlabeled dependency trees (Schneider et al., 2013). For low-resource languages, an alternative approach to developing fully annotated phrase-structure treebanks might be to label a small amount of c-parses and a large amount of cheaper d-parses. Assuming this setup, we ask how many c-parses would be necessary to obtain reasonable performance?
For this experiment, we train PAD on only 5% of the PTB training set and apply it to predicted dparses from a fully-trained model. Even with this small amount of data, we obtain a parser with development score of F 1 = 89.1%, which is comparable to Charniak (2000) and Stanford PCFG (Klein and Manning, 2003) trained on the complete c-parse training set. Additionally, if the gold dependencies are available, PAD with 5% training achieves F 1 = 95.8% on development, demonstrating a strong abil-  Table 5: Error analysis of binary CFG rules. Rules used are split into classes based on correct (+) identification of dependency (h, m), span i, j , and split k. "Count" is the size of each class. "Acc." is the accuracy of span nonterminal identification.
ity to recover the phrase-structure trees from dependency annotations.
Analysis Finally we consider an internal error analysis of the parser. For this analysis, we group each binary rule production selected by the parser by three properties: Is its dependency (h, m) correct? Is its span i, j correct? Is its split k correct?
The first property is fully determined by the input d-parse, the others are partially determined by PAD itself. Table 5 shows the breakdown. The conversion is almost always accurate (∼98%) when the parser has correct span and dependency information. As expected, the difficult cases come when the dependency was fully incorrect, or there is a propagated span mistake. As dependency parsers improve, the performance of PAD should improve as well.

Conclusion
With recent advances in statistical dependency parsing, we find that fast, high-quality phrase-structure parsing is achievable using dependency parsing first, followed by a statistical conversion algorithm to fill in phrase-structure trees. Our implementation is available as open-source software at https:// github.com/ikekonglp/PAD.