Transition-based Spinal Parsing

We present a transition-based arc-eager model to parse spinal trees, a dependency-based representation that includes phrase-structure information in the form of constituent spines assigned to tokens. As a main advantage, the arc-eager model can use a rich set of features combining dependency and constituent information, while parsing in linear time. We describe a set of conditions for the arc-eager system to produce valid spinal structures. In experiments using beam search we show that the model obtains a good trade-off between speed and accuracy, and yields state of the art performance for both dependency and constituent parsing measures


Introduction
There are two main representations of the syntactic structure of sentences, namely constituent and dependency-based structures. In terms of statistical modeling, an advantage of dependency representations is that they are naturally lexicalized, and this allows the statistical model to capture a rich set of lexico-syntactic features. The recent literature has shown that such lexical features greatly favor the accuracy of statistical models for parsing (Collins, 1999;Nivre, 2003;McDonald et al., 2005). Constituent structure, on the other hand, might still provide valuable syntactic information that is not captured by standard dependencies.
In this work we investigate transition-based statistical models that produce spinal trees, a representation that combines dependency and constituent structures. Statistical models that use both representations jointly were pioneered by Collins (1999), who used constituent trees annotated with head-child information in order to define lexicalized PCFG models, i.e. extensions of classic constituent-based PCFG that make a central use of lexical dependencies.
An alternative approach is to view the combined representation as a dependency structure augmented with constituent information. This approach was first explored by Collins (1996), who defined a dependency-based probabilistic model that associates a triple of constituents with each dependency. In our case, we follow the representations proposed by , which we call spinal trees. In a spinal tree (see Figure 1 for an example), each token is associated with a spine of constituents, and head-modifier dependencies are attached to nodes in the spine, thus combining the two sources of information in a tight manner. Since spinal trees are inherently dependencybased, it is possible to extend dependency models for such representations, as shown by  using a so-called graph-based model. The main advantage of such models is that they allow a large family of rich features that include dependency features, constituent features and conjunctions of the two. However, the consequence is that the additional spinal structure greatly increases the number of dependency relations. Even though a graph-based model remains parseable in cubic time, it is impractical unless some pruning strategy is used .
In this paper we propose a transition-based parser for spinal parsing, based on the arc-eager strategy by Nivre (2003). Since transition-based parsers run in linear time, our aim is to speed up spinal parsing while taking advantage of the rich representation it provides. Thus, the research question underlying this paper is whether we can accurately learn to take greedy parsing decisions for rich but complex structures such as spinal trees. To control the trade-off, we use beam search for transition-based parsing, which has been shown to be successful (Zhang and Clark, 2011b). The main contributions of this paper are the following: • We define an arc-eager statistical model for spinal parsing that is based on the triplet relations by Collins (1996). Such relations, in conjunction with the partial spinal structure available in the stack of the parser, provide a very rich set of features.
• We describe a set of conditions that an arceager strategy must guarantee in order to produce valid spinal structures.
• In experiments using beam search we show that our method obtains a good tradeoff between speed and accuracy for both dependency-based attachment scores and constituent measures.

Spinal Trees
A spinal tree is a generalization of a dependency tree that adds constituent structure to the dependencies in the form of spines. In this section we describe the spinal trees used by . A spine is a sequence of constituent nodes associated with a word in the sentence. From a linguistic perspective, a spine corresponds to the projection of the word in the constituent tree. In other words, the spine of a word consists of the constituents whose head is the word. See Figure  1 for an example of a sentence and its constituent and spinal trees. In the example the spine of each token is the vertical sequence on top of it. Formally a spinal tree for a sentence x 1:n is a pair (V, E), where V is a sequence of n spinal nodes and E is a set of n spinal dependencies. The i-th node in V is a pair (x i , σ i ), where x i is the i-th word of the sentence and σ i is its spine.
A spine σ is a vertical sequence of constituent nodes. We denote by N the set of constituent nodes, and we use ∈ N to denote a special terminal node. We denote by l(σ) the length of a spine. A spine σ is always non-empty, l(σ) ≥ 1, its first node is always , and for any 2 ≤ j ≤ l(σ) the j-th node of the spine is an element of N .
A spinal dependency is a tuple h, d, p that represents a directed dependency from the p-th node of σ h to the d-th node of V . Thus, a spinal dependency is a regular dependency between a head token h and a dependent token d augmented with a position p in the head spine. It must be that 1 ≤ h, d ≤ n and that 1 < p ≤ l(σ h ).
The set of spinal dependencies E satisfies the standard conditions of forming a rooted directed projected tree (Kübler et al., 2009). Plus, E satisfies that the dependencies are correctly nested with respect to the constituent structure that the spines represent. Formally, let (h, d 1 , p 1 ) and (h, d 2 , p 2 ) be two spinal dependencies associated with the same head h. For left dependencies, correct nesting means that if d 1 < d 2 < h then p 1 ≥ p 2 . For right dependents, if h < d 1 < d 2 then p 1 ≤ p 2 .
In practice, it is straightforward to obtain spinal trees from a treebank of constituent trees with head-child annotations in each constituent : starting from a token, its spine consists of the non-terminal labels of the constituents whose head is the token; the parent node of the top of the spine gives information about the lexical head (by following the head children of the parent) and the position where the spine attaches to. Given a spinal tree it is trivial to recover the constituent and dependency trees.

Arc-Eager Transition-Based Parsing
The arc-eager transition-based parser (Nivre, 2003) parses a sentence from left to right in linear time. It makes use of a stack that stores tokens that are already processed (partially built dependency structures) and it chooses the highest-scoring parsing action at each point. The arc-eager algorithm adds every arc at the earliest possible opportunity and it can only parse projective trees.
The training process is performed with an oracle (a set of transitions to a parse for a given sentence, (see Figure 2)) and it learns the best transition given a configuration. The SHIFT transition removes the first node from the buffer and puts it on the stack. The REDUCE transition removes the top node from the stack. The LEFT-ARC t transition introduces a labeled dependency edge between the first element of the buffer and the top element of the stack with the label t. The top element is removed from the stack (reduce transition). The RIGHT-ARC t transition introduces a labeled dependency edge between the top element of the stack and the first element in the buffer with a label d, and it performs a shift transition. Each action can have constraints   In this paper, we took the already existent implementation of arc-eager from ZPar 1 (Zhang and Clark, 2009) which is a beam-search parser implemented in C++ focused on efficiency. ZPar gives competitive accuracies, yielding state-of-the-art results, and very fast parsing speeds for dependency parsing. In the case of ZPar, the parsing process starts with a root node at the top of the stack (see Figure 3) and the buffer contains the words/tokens to be parsed.

Transition-based Spinal Parsing
In this section we describe an arc-eager transition system that produces spinal trees. Figure 3 shows a parsing example. In essence, the strategy we propose builds the spine of a token by pieces, by adding a piece of spine each time the parser produces a dependency involving such token.
We first describe a labeling of dependencies that encodes a triplet of constituent labels, and it is the basis for defining an arc-eager statistical model. Then we describe a set of constraints that guaran-1 http://sourceforge.net/projects/zpar/ tees that the arc-eager derivations we produce correspond to spinal trees. Finally we discuss how to map arc-eager derivations to spinal trees.

Constituent Triplets
We follow Collins (1996) and define a labeling for dependencies based on constituent triplets.
Consider a spinal tree (V, E) for a sentence x 1:n . A constituent triplet of a spinal dependency (h, d, p) ∈ E is a tuple a, b, c where: • a ∈ N is the node at position p of σ h (parent label) For example, a dependency labeled with S, VP, NP is a subject relation, while the triplet VP, , NP represents an object relation. Note that a constituent triplet, in essence, corresponds to a context-free production in a head-driven PCFG (i.e. a → bc, where b is the head child of a). Figure 2: Arc-eager transition system with spinal constraints. Σ represents the stack, B represents the buffer, A represents the set of arcs, t represents a given triplet when its components are not relevant, a, b, c represents a given triplet when its components are relevant and i, j and k represent tokens of the sentence. The constraints labeled with (1) . . . (5) are described in Section 3.2. The constraints that are not labeled are standard constraints of the arc-eager parsing algorithm (Nivre, 2003).

Initial configuration
In the literature, these triplets have been shown to provide very rich parameterizations of statistical models for parsing (Collins, 1996;Collins, 1999;. For our purposes, we associate with each spinal dependency (h, d, p) ∈ E a triplet dependency (h, d, a, b, c ), where the triplet is defined as above. We then define a standard statistical model for arc-eager parsing that uses constituent triplets as dependency labels. An important advantage of this model is that left-arc and right-arc transitions can have feature descriptions that combine standard dependency features with phrase-structure information in the form of constituent triplets. As shown by , this rich set of features can obtain significant gains in parsing accuracy.

Spinal Arc-Eager Constraints
We now describe constraints that guarantee that any derivation produced by a triplet-based arceager model corresponds to a spinal structure.
Let us make explicit some properties that relate a derivation D with a token i, the arcs in D involving i, and its spine σ i : • D has at most a single arc (h, i, a, b, c ) where i is in the dependent position. The dependent label c of this triplet defines the top of σ i . If c = then σ i = , and i can not have dependants.
• Consider the subsequence of D of left arcs with head i, of the form (i, j, a, b, c ). In an arc-eager derivation this subsequence follows a head-outwards order. Each of these arcs has in its triplet a pair of contiguous nodes b−a of σ i . We call such pairs spinal edges. The subsequence of spinal edges is ordered bottomup, because arcs appear head-outwards. In addition, sibling arcs may attach to the same position in σ i . Thus, the subsequence of left spinal edges of i in D is a subsequence with repeats of the sequence of edges of σ i .
• Analogously, the subsequence of right spinal  We constrain the arc-eager transition process such that these properties hold. Recall that a wellformed spine starts with a terminal node , and so does the first edge of the spine and only the first. Let C be a configuration, i.e. a partial derivation. The constraints are: (1) An arc (h, i, a, b, ) is not valid if i has dependents in C.
(2) An arc (i, j, a, b, c ) is not valid if C contains a dependency of the form (h, i, a , b , ).
(3) A left arc (i, j, a, , c ) is only valid if all sibling left arcs in C are of the form (i, j , a, , c ).
(5) If C has a left arc (i, j, a, , c ), then a right arc (i, j , a , , c ) is not valid if a = a .
In essence, constraints 1-2 relate the top of a spine with the existence of descendants, while constraints 3-5 enforce that the bottom of the spine is well formed. We enforce no further constraints looking at edges in the middle of the spine. This means that left and right arc operations can add spinal edges in a free manner, without explicitly encoding how these edges relate to each other. In other words, we rely on the statistical model to correctly build a spine by adding left and right spinal edges along the transition process in a bottom-up fashion.
It is easy to see that these constraints do not prevent the transition process from ending. Specifically, even though the constraints invalidate arc operations, the arc-eager process can always finish by leaving tokens in the buffer without any head assigned, in which case the resulting derivation is a forest of several projective trees.

Mapping Derivations to Spinal Trees
The constrained arc-eager derivations correspond to spinal structures, but not necessarily to single spinal trees, for two reasons. First, from the derivation we can extract two subsequences of left and right spinal edges, but the derivation does not encode how these sequences should merge into a spine. Second, as in the basic arc-eager process, the derivation might be a forest rather than a single tree. Next we describe processes to turn a spinal arc-eager derivation into a tree.
Forming spines. For each token i we depart from the top of the spine t, a sequence L of left spinal edges, and a sequence R of right spinal edges. The goal is to form a spine σ i , such that its top is t, and that L and R are subsequences with repeats of the edges of σ i . We look for the shortest spine satisfying these properties. For example, consider the derivation in Figure 3  Rooting Forests. The arc-eager transition system is not guaranteed to generate a single root in a derivation (though see (Nivre and Fernández-González, 2014) for a solution). Thus, after mapping a derivation to a spinal structure, we might get a forest of projective spinal trees. In this case, to produce a constituent tree from the spinal forest, we promote the last tree and place the rest of trees as children of its top node.

Experiments
In this section we describe the performance of the transition-based spinal parser by running it with different sizes of the beam and by comparing it 2 However, this is not always the case. For example, in the Penn Treebank adjuncts create an additional constituent level in the verb-phrase structure, and this can result in a series of contiguous VP spinal nodes. The effect of flattening such structures is mild, see below.
3 These limitations have relatively mild effects on recovering constituent trees in the style of the Penn Treebank. To measure the effect, we took the correct spinal trees of the development section and mapped them to the corresponding arc-eager derivation. Then we mapped the derivation back to a spinal tree using this process and recovered the constituent tree. This process obtained 98.4% of bracketing recall, 99.5% of bracketing precision, and 99.0 of F1 measure.
with the state-of-the-art. We used the ZPar implementation modified to incorporate the constraints for spinal arc-eager parsing. We used the exact same features as Zhang and Nivre (2011), which extract a rich set of features that encode higherorder interactions betwen the current action and elements of the stack. Since our dependency labels are constituent triplets, these features encode a mix of constituent and dependency structure.

Data
We use the WSJ portion of the Penn Treebank 4 , augmented with head-dependant information using the rules of Yamada and Matsumoto (2003). This results in a total of 974 different constituent triplets, which we use as dependency labels in the spinal arc-eager model. We use predicted part-ofspeech tags 5 .

Results in the Development Set
In Table 1 we show the results of our parser for the dependency trees, the table shows unlabeled attachment score (UAS) , triplet accuracy (TA, which would be label accuracy, LA) and triplet attachment score (TAS), and spinal accuracy (SA) (the spinal accuracy is the percentage of complete spines that the parser correctly predicts). In order to be fully comparable, for the dependencybased metrics we report results including and excluding punctuation symbols for evaluation. The table also shows the speed (sentences per second) in standard hardware. We trained the parser with different beam values, we run a number of iterations until the model converges and we report the results of the best iteration.
As it can be observed the best model is the one trained with beam size 64, and greater sizes of the beam help to improve the results. Nonetheless, it also makes the parser slower. This result is expected since the number of dependency labels, i.e. triplets, is 974 so a higher size of the beam allows to test more of them when new actions are included in the agenda. This model already provides high results over 92.34% UAS and it can also predict most of the triplets that label the dependency  Table 1: UAS with predicted part-of-speech tags for the dev.set including and excluding punctuation symbols. Constituent results for the development set. Parsing speed in sentences per second (an estimate that varies depending on the machine). TA and TAS refer to label accuracy and labeled attachment score where the labels are the different constituent triplets described in Section 3. SA is the spinal accuracy.
arcs (91.45 TA and 89.84 TAS) (including punctuation symbols for evaluation). Table 1 also shows the results of the parser in the development set after transforming the dependency trees by following the method described in Section 3. The result even surpasses 89.5% F1 which is a competitive accuracy. As we can see, the parser also provides a good trade-off between parsing speed and accuracy. 6 In order to test whether the number of dependency labels is an issue for the parser, we also trained a model on dependency trees labeled with Yamada and Matsumoto (2003) rules, and the results are comparable to ours. For a beam of size 64, the best model with dependency labels provides 92.3% UAS for the development set including punctuation and 93.0% excluding punctuation, while our spinal parser for the same beam size provides 92.3% UAS including punctuation and 93.1% excluding punctuation. This means that the beam-search arc-eager parser is capable of coping with the dependency triplets, since it even provides slightly better results for unlabeled attachment scores. However, unlike , the arc-eager parser does not substantially benefit of using the triplets during training.

Final Results and State-of-the-art Comparison
Our best model ( Table 3 compares our model with other constituent parsers, including shift-reduce parsers as ours. Our best model is competitive compared with the rest. Collins (1996) defined a statistical model for dependency parsing based on using constituent triplets in the labels, which forms the basis of our arc-eager model. In that work, a chart-based algorithm was used for parsing, while here we use greedy transition-based parsing.    was the first to use spinal representations to define an arc-factored dependency parsing model based on the Eisner algorithm, that parses in cubic time. Our work can be seen as the transition-based counterpart of that, with a greedy parsing strategy that runs in linear time. Because of the extra complexity of spinal structures, they used three probabilistic non-spinal dependency models to prune the search space of the spinal model. In our work, we show that a single arc-eager model can obtain very competitive results, even though the accuracies of our model are lower than theirs.

Related Work
In terms of parsing spinal structures, Rush et al. (2010) introduced a dual decomposition method that uses constituent and dependency parsing routines to parse a combined spinal structure.
In a similar style to our method , Hall and Nivre (2008) and Hall (2008) introduced an approach for parsing Swedish and German, in which MaltParser  is used to predict dependency trees, whose dependency labels are enriched with constituency labels. They used tuples that encode dependency labels, constituent labels, head relations and the attachment. The last step is to make the inverse transformation from a dependency graph to a constituent structure.
Recently Kong et al. (2015) proposed a structured prediction model for mapping dependency trees to constituent trees, using the CKY algorithm. They assume a fixed dependency tree used as a hard constraint. Also recently, Fernández-González and Martins (2015) proposed an arcfactored dependency model for constituent parsing. In that work dependency labels encode the constituent node where the dependency arises as well as the position index of that node in the head spine. In contrast, we use constituent triplets as dependency labels.
Our method is based on constraining a shiftreduce parser using the arc-eager strategy. Nivre (2003) and Nivre (2004) establish the basis for arc-eager algorithm and arc-standard parsing algorithms, which are central to most recent transitionbased parsers (Zhang and Clark, 2011b;Zhang and Nivre, 2011;Bohnet and Nivre, 2012). These parsers are very fast, because the number of parsing actions is linear in the length of the sentence, and they obtain state-of-the-art-performance, as shown in Section 4.3.
For shift-reduce constituent parsing, Sagae and Lavie (2005; presented a shift-reduce phrase structure parser. The main difference to ours is that their models do not use lexical dependencies. Zhang and Clark (2011a) presented a shift-reduce parser based on CCG, and as such is lexicalized. Both spinal and CCG representations are very expressive. One difference is that spinal trees can be directly obtained from constituent treebanks with head-child information, while CCG derivations are harder to obtain.
More recently, Zhang and Clark (2009) and the subsequent work of Zhu et al. (2013) described a beam-search shift-reduce parsers obtaining very high results. These models use dependency information via stacking, by running a dependency parser as a preprocess. In the literature, stacking is a common technique to improve accuracies by combining dependency and constituent information, in both ways (Wang and Zong, 2011;Farkas and Bohnet, 2012). Our model differs from stacking approaches in that it natively produces the two structures jointly, in such a way that a rich set of features is available.

Conclusions and Future Work
There are several lessons to learn from this paper. First, we show that a simple modification to the arc-eager strategy results in a competitive greedy spinal parser which is capable of predicting dependency and constituent structure jointly. In order to make it work, we introduce simple constraints to the arc-eager strategy that ensure well-formed spinal derivations. Second, by doing this, we are providing a good trade-off between speed and accuracy, while at the same time we are providing a dependency structure which can be really useful for downstream applications. Even if the dependency model needs to cope with a huge amount of dependency labels (in the form of constituent triplets), the unlabeled attachment accuracy does not drop and the labeling accuracy (for the triplets) is good enough for getting a good phrase-structure parse. Overall, our work shows that greedy strategies to dependency parsing can be successfuly augmented to include constituent structure.
In the future, we plan to explore spinal derivations in new transition-based dependency parsers (Chen and Manning, 2014;Dyer et al., 2015;Weiss et al., 2015;Zhou et al., 2015). This would allow to explore the spinal derivations in new ways and to test their potentialities.