Stacking or Supertagging for Dependency Parsing – What’s the Difference?

Supertagging was recently proposed to provide syntactic features for statistical dependency parsing, contrary to its traditional use as a disambiguation step. We conduct a broad range of controlled experiments to compare this specific application of supertagging with another method for providing syntactic features, namely stacking. We find that in this context supertagging is a form of stacking. We furthermore show that (i) a fast parser and a sequence labeler are equally beneficial in supertagging, (ii) supertagging/stacking improve parsing also in a cross-domain setting, and (iii) there are small gains when combining supertagging and stacking, but only if both methods use different tools. The important consideration is therefore not the method but rather the diversity of the tools involved.


Introduction
We present a systematic comparison of two methods that have been proposed to improve statistical dependency parsers: supertagging and stacking.
Supertags are labels for tokens much like POS tags but they also encode syntactic information, e.g. the head direction or the subcategorization frame. Supertagging was developed for deep grammar formalisms as the disambiguation of supertag assignment prior to parsing (Bangalore and Joshi, 1999;Clark and Curran, 2004;Ninomiya et al., 2006). Recently, it was presented as a method to provide syntactic information to the feature model of a statistical dependency parser. Ambati et al. (2013; provide CCG supertags to a dependency parser, whereas Ouchi et al. (2014) extract their supertag tag set from a dependency treebank (see Figure 1). In this paper, we adopt tree: John loves Mary stags: subj/R root/L+L R obj/L root subj obj Figure 1: Supertags that are derived from dependency trees for each token. They encode the label, the head direction, and the presence of left and right dependents.
this particular definition and take supertagging as a way of incorporating syntactic features instead of the traditional use for disambiguation. Parser stacking was introduced by Nivre and McDonald (2008) and Martins et al. (2008). In stacking, two parsers are run in sequence so that the second parser can use the output of the first parser as features, for example, whether a particular arc was already predicted by the first parser.
When supertags were first proposed by Joshi and Bangalore (1994), they called supertagging almost parsing, because supertags anticipate a lot of syntactic disambiguation. In stacking, the first step is running a parser, or in other words: real parsing. In this paper, we investigate the difference between almost and real parsing for improving a statistical dependency parser.
We conduct an extensive number of comparative experiments with two state-of-the-art dependency parsers and a state-of-the-art sequence labeler on 10 different data sets. In the first set of experiments, we use only the two parsers and compare both methods in artificial and realistic settings. In the second set of experiments, we control for the methods and compare different ways of realizing them. In the last set, we evaluate the benefit of combining both methods.
Intuitively, stacking should give higher improvements than the version of supertagging defined by Ouchi et al. (2014), since trees in stacking are more informative than supertag sequences in supertagging. However, our experiments show that both methods perform on par. Based on an in-depth analysis of these findings, we argue that supertagging is a form of stacking.
One apparent advantage of supertagging is the fact that one can predict supertags without a parser and thus possibly faster. However, greedy transition-based parsers are extremely fast as well. We show that the output of a CRF sequence labeler and a greedy transition-based parser are of equal usefulness when used in supertagging. This setup suggests application to large-scale (e.g. web) data. We test both methods on the English Web Treebank (Bies et al., 2012) and show that they also improve parsing in a cross-domain setting.
Our experiments on combining supertagging and stacking show small gains only when supertags and trees are predicted by different tools. Surdeanu and Manning (2010) demonstrate that diversity of algorithms is important when stacking parsers. Since supertagging is a form of stacking, this also holds for supertagging, and we argue that this is a more important factor than the choice between the two methods.
We give background on supertagging and stacking in Section 2 and describe our experimental setup in Section 3. We present our experiments in Sections 4 to 6 and conclude with Section 7.

Background
The term supertag originated in Joshi and Bangalore (1994) as an elementary structure associated with a lexical item. These elementary structures carry more information than POS tags, hence the name super POS tags or supertags. Within Lexicalized Tree Adjoining Grammar (LTAG) (Schabes et al., 1988) supertags correspond to trees that localize dependencies. A supertagger assigns supertags to each word of a sentence. A parser then combines these structures into a full parse (Bangalore and Joshi, 1999) that leads to simplified and faster parsing. The same approach applied to Combinatory Categorial Grammar (CCG) (Clark and Curran, 2004) and Head-Driven Phrase Structure Grammar (HPSG) (Ninomiya et al., 2006) speeds up the parser dramatically. Foth et al. (2006) were the first to utilize supertags in a dependency parsing context by incorporating them as soft constraints into their rulebased parser (Foth et al., 2004). In LTAG, CCG, or HPSG supertags are the elementary components of the framework in question. In Foth et al. (2006), supertags are specifically designed to capture syntactic properties. Ouchi et al. (2014) use supertags as features in a statistical dependency parser for English. Ambati et al. (2014) instead utilize CCG categories for English and Hindi. Both demonstrate significant improvements. Björkelund et al. (2014) extend the positive results to nine other languages.
Another way of exploiting one parser's output as features in another parser is stacking. Nivre and McDonald (2008) define a simple set of local features that mark whether an arc is present in the input tree. They demonstrate that stacking parsers leads to higher parsing accuracy than a non-stacked baseline. Martins et al. (2008) extend this feature set to include non-local information, e.g. information about siblings and grandparents of dependents. However, the additional non-local features provide only minor further gains over the local ones if the parser itself already uses nonlocal features. Surdeanu and Manning (2010) present a study on parser stacking for English. They find that one important factor is the diversity of the parsing algorithms involved. Specifically, stacking a parser on itself does not lead to gains. This effect was also observed by Martins et al. (2008).

Data Sets and Preprocessing
We perform experiments on the data from the SPMRL 2014 Shared Task (Seddah et al., 2014), which consists of data sets for 9 languages (see Table 1). To these 9, we add the English Penn Treebank converted to Stanford Dependencies. 1 We use sections 2-21 for training, 24 as development set and 23 as test set.
Contrary to most previous work, we use automatically predicted preprocessing in all the parsing experiments. POS tags and morphological features are jointly assigned using MarMoT 2 (Müller et al., 2013), a state-of-the-art morphological CRF tagger. To improve tagging accuracy we integrate the analyses of language-specific morphological analyzers as additional features into MarMoT (see Table 1). We use the mate-tools 3 for lemmatiza-  (Domeij et al., 2000) Table 1: Analyzers used in the tagger. Foth et al. (2006) experiment with different tag set designs and show that richer supertags improve their parser's accuracy more. However, richer tags increase the tag set size considerably and make it more difficult to predict them automatically. Ouchi et al. (2014) test two models for English. Model 1 includes the relative head position of a word (hdir), its dependency relation (label), and information about dependents to the left or right (hasLdep, hasRdep). The tag set is derived from the treebank, an example is shown in Figure 1. Model 2 additionally uses dependency relations of obligatory dependents of verbs. The difference between the two models has no impact on the performance of a parser, however. Björkelund et al. (2014)   Based on these results we decided to use Model 1 in all of the experiments. The supertags are extracted from the respective training sets and follow the template label/hdir+hasLdep hasRdep. Table 2 gives the tag set sizes for each data set.

Notation
We denote stacking and supertagging by STACK and STAG, respectively. When a tool y uses the output of another tool x, we mark this by superscript and subscript. For example, STACK y x means that tool y uses the output of tool x in stacking. Similarly, STAG y x means that tool y uses the supertags predicted by tool x. We follow Martins et al. (2008)'s terminology and call x the Level 0 tool and y the Level 1 tool.

Parsers and Feature Models
In the experiments, Level 1 tools will always be dependency parsers since we are interested in the effect of supertagging and stacking on parsing performance. We experiment with one graph-based and one transition-based parser to cover the two major paradigms in dependency parsing.
We extend the parsers' baseline feature sets in two directions: (1) to extract features for stacking, i.e., to extract features from a provided dependency tree, and (2) to extract features from a sequence of supertags. For stacking, features are taken from Nivre and McDonald (2008) and slightly adapted to our setting. For supertagging, we mirror the features from stacking to the best extent possible given the more limited information that is contained in the supertags to begin with.
We note that feature engineering can be done more elaborately both for stacking (Martins et al., 2008) and supertagging (Ouchi et al., 2014). However, since not all types of features that can be extracted necessarily carry over from one method to the other, a simpler feature set is more useful for a comparison. Moreover, both of the aforementioned papers only demonstrate minor performance gains with more elaborate features.
Transition-based Parser. We use the parser by Björkelund and Nivre (2015) as our transitionbased parser. It uses the arc-standard decoding algorithm extended with a SWAP transition (Nivre, 2009) to handle non-projective structures. 4 The system applies arc transitions between the two topmost items of the stack (denoted s 0 and s 1 ). The lazy SWAP oracle by  is used during training. The parser is globally trained using beam-search and early update (Zhang and Clark, 2008). The implementation uses the passive-aggressive perceptron (Crammer et al., 2006) and a hash kernel for feature mapping following Bohnet (2010). The parser is trained for 25 iterations using a beam size of 20. We omit the definition of the baseline feature set of the transition-based parser, however it is primarily based on that of Zhang and Nivre (2011) with adaptations to the arc-standard setting. Table 3 outlines the feature templates used for stacking and supertagging. The predicates Stacking features Table 3: Feature templates used for stacking and supertagging in the transition-based parser. ⊕ denotes conjunctions of basic templates. All templates are conjoined with the POS tag of the topmost stack items s 0 and s 1 .
and stag X (d) extract the head of d, the direction of d's head, the arc label of d, whether d has left/right dependents, and the supertag according to the Level 0 prediction X. X is either a dependency graph (in stacking) or a supertag assignment (in supertagging), denoted G and S in Table 3, respectively. The predicates l/rdep(d) extract the leftmost/rightmost dependent of d given the current parser state. pos(d) extracts the POS tag of d, with a special placeholder if d is undefined.
The stacking features are mostly taken from Nivre and McDonald (2008) with the exception of the last two rows. These features encode whether d should have left/right dependents according to G conjoined with whether d has left/right dependents in the current configuration. We added these features because existence of left/right dependents is also encoded in the supertags. Conjoining the existence of left/right dependents according to the Level 0 predictions with the POS tag of left/right dependents in the current parser state thus encodes whether dependents were attached or not. Since the arc-standard algorithm works bottom-up, every token needs to collect all its dependents before it can be attached to its own head.
The supertag features mimic the information provided by stacking. For instance, in stacking the Level 0 predictions explicitly include whether s 0 is the head of s 1 . In supertagging this is approximated by combining the direction of the head of s 1 with whether s 0 expects dependents on the left. Table 4: Feature templates used for stacking and supertagging in the graph-based parser. ⊕ denotes conjunctions of basic templates. All templates are conjoined with the direction of that arc and with the POS tag of the head and the dependent.
Graph-based Parser. The graph-based parser we use is TurboParser, 5 which solves the parsing task by doing global inference using a dual decomposition algorithm and outputs non-projective structures natively (Martins et al., 2013). Table 4 shows the stacking and supertagging features as we implemented them in TurboParser. They are synchronized with the features for the transition-based parser where possible. We extract these features only on first-order factors, with d and h denoting the dependent and the head, respectively. Unlike in Nivre and McDonald (2008), the features cannot access the label of the current arc during feature extraction, as it is automatically combined with the features after the extraction.
Like in the transition-based parser, supertag and stacking features are modeled to capture similar information. However, features that combine information about dependents of dependents with information about the head are not included since these would require higher-order factors.

Evaluation
We evaluate the parsing experiments using Labeled Attachment Score (LAS). 6 We mark statistical significance against respective baselines by † and ‡, denoting p-value < 0.05 and p-value < 0.01 respectively. Significance testing is carried out using the Wilcoxon signed-rank test. Averages and oracle experiments are not tested for significance.

Comparing Supertagging and Stacking
The purpose of the following experiments is to compare supertagging and stacking and to derive some conclusions about their relationship to each other. We use one parser as the Level 0 parser and the other one as Level 1 parser. In stacking, the Level 1 parser exploits the tree produced by the Level 0 parser as additional features. In supertagging, we derive the supertag of each token from the tree that is output by the Level 0 parser. The Level 1 parser then uses these supertags as additional features. Although supertags are normally predicted with sequence labelers, using a parser on Level 0 in both cases ensures that the only difference between the two settings is the means by which the information is given to the Level 1 parser, i.e. as a tree or as a sequence of supertags. Figure 2 illustrates this setup. The training sets are annotated with predicted dependency trees or supertags via 5-fold jackknifing. In the tables, GB stands for the graph-based parser and TB for the transition-based parser.

Supertagging and Stacking Accuracy
First of all we convince ourselves that both strategies, supertagging and stacking, indeed improve over the baseline. Table 5 gives the performance of the Level 1 parser on the test sets: In the baseline setting (BL) the parser is run without any additional information. STAG and STACK show the performance of the Level 1 parser when provided with supertags or a tree from the Level 0 parser.
As demonstrated by previous work, both stack-ing and supertagging consistently improve the parsing performance of the Level 1 parser. Moreover, both methods improve the parsing accuracies to the same extent, with the average improvements about 0.7% points absolute for both the graphbased and the transition-based parser. Almost all of the improvements are statistically significant, with a few exceptions, most notably Polish. For supertagging, our results confirm the findings by Ouchi et al. (2014) and Ambati et al. (2014). The stacking results are in line with Nivre and McDonald (2008) and Martins et al. (2008). Here, it is worth noting that even though dependency parsers have markedly advanced since 2008, the fact remains that stacking parsers improves performance. We now continue with a more in-depth analysis to find out where the improvements are coming from. We perform the analysis on the development sets in order to not compromise our test sets. The corresponding accuracies for the development sets can be found in Tables 6 and 7 in rows 1 to 3 .

In-Depth Analysis
The overall improvements with supertagging and stacking are similar, but they might still come about in different ways. To investigate this, we follow McDonald and Nivre (2007) and look into accuracy distributions of comparable systems relative to sentence length and dependency length, i.e. the distance between the dependent and the head.
We present the analysis on the concatenation of all the development sets. We also looked at the corresponding plots for the individual treebanks. While the absolute numbers vary across the different data sets, the relative differences between the baseline, supertagging, and stacking models are consistent with the concatenation. Figure 3 gives the accuracy of both parsers relative to sentence length in bins of size 10. Bin sizes are represented as grey bars. 7      Figure 4 displays the graph-based parser's performance relative to dependency length in terms of precision and recall. 8 Precision is defined as the percentage of correct predictions among all pre-supertagged systems show a consistent improvement over the baseline. Moreover, the curves of the stacked and supertagged systems are mostly parallel and close to each other.
Supertagging and stacking thus do not just appear similar at the macro level in terms of LAS. The analysis shows that their contributions are also very similar when broken down by dependency or sentence length and the improvements are not restricted to sentences or arcs of particular lengths. We therefore conclude that both methods are indeed doing the same thing.

Oracle Experiments
In order to assess the potential utility of supertags we provided the parsers with gold supertags. We expect the gold supertags to give a considerable boost to accuracy as they encode correct syntactic information. Intuitively, we would expect the corresponding stacking experiment (providing gold trees) to reach 100% accuracy since the parser receives the full solution as features. However, this assumption turns out not to hold.
Row 4 in Tables 6 and 7 shows the results for the supertag experiments. Comparing row 4 with row 2 , we find big jumps (between 7 and 20% absolute) in performance. For German, English, and Polish performance goes up even to 97/98%. These huge jumps are due to the amount of syntactic information encoded in the supertags, which is much higher than in POS tags for example.
Row 5 in Tables 6 and 7 shows the results for the stacking experiments. Surprisingly, stacking with gold dependency trees does not reach 100% accuracy. Moreover, comparing rows 4 and 5 we find that on average supertagging and stacking improve performance of a parser to the same extent.
The fact that gold supertags do not yield maximum accuracy is not so surprising since a supertag sequence does not encode the full dependency tree, but merely indicates direction of heads and dependents. However, it is puzzling that stacking with gold trees does not lead to perfect parsing results. In case of the transition-based parser, the reason might be that the parser does not do exhaustive search but uses beam search to explore only a fraction of the search space. That is, the gold solution can get pruned early enough that the parser never considers it. For the graph-based parser this result is more unexpected since this parser does exact search. We currently do not have any expla-nation for this, however we hypothesize that the lack of regularization during training might assign enough weight on the regular features such that they can override the few stacking features that convey the correct solution.

Self-Application
Rows 6 and 7 in Tables 6 and 7 show experiments where we use the same parser at Level 0 and Level 1. We know from Martins et al. (2008) that self-application, i.e. stacking a parser on its own output, leads to at most tiny improvements, especially compared to a setting with different parsers. Our results corroborate these findings. More interestingly, we find a similar effect for supertagging. 10 This effect demonstrates that it is important that Level 0 and Level 1 use different ways of modeling the data in order to benefit from the combination (cf. Surdeanu and Manning (2010)).

Supertagging Without Parsers
One potential advantage of supertagging over stacking is the fact that one can predict supertags without a parser. Most previous work predicts supertags using classifiers or sequence models, which is the standard for tagging problems. As tagging is commonly considered an "easier" task than parsing, one could assume that supertags can be predicted very efficiently using standard sequence labeling algorithms. But sequence labelers would not be able to predict the dependency tree in a stacking setup.
The two parsers that we use in the experiments are indeed unlikely to outperform standard sequence labelers in terms of speed. However, greedy arc-standard parsers are very fast. In the next experiment, we therefore compare a greedy arc-standard parser, which is the transition-based parser without beam search, with MarMoT (see Section 3.1). We follow Ouchi et al. (2014) in adding POS tags and morphological information to the feature model of the sequence labeler.
The purpose of this experiment is two-fold: So far, we predicted supertags by predicting a tree first and then deriving the supertags from that tree. Now we test how our previous results compare to supertags predicted by a sequence labeler, which is 10 Note that most of the improvements in STAG GB GB are actually statistically significant. However, the difference to BL GB is considerably smaller than in the predicted setting in row 2 (avg. difference is 0.28% vs. 0.65% points absolute).  the common way of predicting supertags. But furthermore, we want to see how supertagging with a sequence labeler compares to supertagging and stacking with a parser that is equally efficient. Table 8 gives the result of the experiment. We denote the sequence labeler by SL and the greedy parser by GTB. Rows 2 and 6 show that, on average, the parsing performance is not harmed by predicting supertags with the sequence labeler instead of one of the parsers (compare to row 2 in Tables 6 and 7). It depends on the individual data set whether the sequence labeler is more useful than one of the parsers or not. The supertags predicted by the sequence labeler improve parsing performance to a similar extent.
The experiments with the greedy parser yield different results for the graph-based and the transition-based parser on Level 1: When the graph-based parser acts as Level 1, the greedy parser is slightly behind the sequence labeler. This holds both for supertagging and stacking experiments (compare row 2 to rows 3 and 4 ), which again suggests that supertagging and stacking are interchangeable. However, when Level 1 is the transition-based parser, we find a self-application effect for the greedy parser, both in supertagging and stacking (rows 5 vs. 7 and 8 ). This is not surprising since the decoding algorithms in the beamsearch and greedy transition-based parser are identical. It simply underlines the importance of having different algorithms in the setup.

Out-of-Domain Application
The previous experiment shows that the greedy parser at Level 0 gives competitive results compared to a sequence labeler. Having fast predictors available for stacking or supertagging suggests an application where speed matters, e.g. Ambati et al. (2014) propose supertags to improve the performance of fast parsers in a web scale scenario.
As web data can be any kind of text, the ques-tion is whether the positive effects of supertagging and stacking are actually preserved in such an outof-domain setting. To test this, we conduct experiments on the English Web Treebank (Bies et al., 2012) Table 9: Results (LAS) on the English Web Treebank.
The results in Table 9 show consistent improvements on the five genres of the data set both for supertagging and stacking. Both are thus good methods to improve parsing accuracies when parsing out-of-domain data. Since parsing speed also depends on the Level 1 parser, a greedy transitionbased parser would be preferable in such an application. Using supertagging with a sequence labeler to provide syntactic information to the greedy parser is then a good choice because it avoids a self-application effect.
The last two rows in Table 9 show the performance when the greedy parser is acting as Level 1. Supertagging improves over the baseline significantly on 4 out of 5 data sets. However, the baseline for the greedy parser is on average about 2% points absolute behind the other two parsers. This loss in accuracy buys a significant speedup though. The greedy parser is about 29 times faster 11 than the graph-based parser on the English 11 We report parsing time. Exact runtimes depend on im-  data set and even 80 times faster on the Arabic data set. As the Arabic data set has very long sentences, the higher complexity of the graph-based parser has a notable effect on its performance. Compared to the beam-search transition-based parser, the greedy parser is about 10 times faster on English and 5 times faster on Arabic.

Combining Supertagging and Stacking
We now explore whether the combination of supertagging and stacking yields even better parsers.
In rows 3 and 6 in Table 10, we show results when supertag and stacking features come from different sources, i.e. they were predicted by different tools 12 . For both parsers, the sequence labeler predicts the supertags and the respective other parser provides the tree for the stacking features. The combinations are better than the baseline. Rows 2 and 5 give results from the best single source, i.e. either STAG y SL or STACK y x . For most of the languages the difference between the combination and the best single component is statistically not significant, except Arabic, German, Hungarian, and English for (STAG SL +STACK TB ) GB , and Arabic for (STAG SL +STACK GB ) TB . The increment goes up to 0.51 in case of Hungarian. On average, the gains are, however, marginal -the graph-based parser's accuracy increases by 0.2% absolute and the transition-based parser improves by 0.18% absolute. Although these differences denote improvements, they are not nearly as high as the improvements over the baseline for the single components and it depends on the actual data set whether it is worth the effort.
In Section 4, we argued that supertagging and stacking are similar and the diversity of tools is the plementation and hardware. We therefore give relative numbers so the reader gets an impression of the magnitude. 12 We did experiments with combining supertags and stacking from the same Level 0 tool, however since the features were derived from the same tree the differences compared to stacking only were negligible as expected. more important factor. The improvements by the combination can also be interpreted along these lines: They are caused by using different tools rather than the fact that we are combining the two methods. It is like stacking onto two parsers instead of one.

Conclusion
In this paper, we have shown that supertagging as a method for providing syntactic features for statistical dependency parsing (Ambati et al., 2014;Ouchi et al., 2014) is a form of stacking. Although supertags do not convey as much information as full trees, they improve dependency parsers to an equal amount. The two methods are thus in principle interchangeable.
Combining stacking and supertagging only gives improvements if different tools are used. In this case, the improvements come from the involvement of different tools rather than their combination. Furthermore, using supertags in a parser that predicted them itself does not lead to improvements. This is in line with findings by Surdeanu and Manning (2010) on stacking, of which supertagging is a variant. Therefore, while it is not so important which method is used, it is important to use different algorithms in these setups.
Finally, we have shown that sequence labelers can be replaced by greedy parsers in supertagging without compromising quality or speed. We applied them in a cross-domain parsing scenario and demonstrated that supertagging and stacking improve parsing also in this setting.
However, there are circumstances where one method might be preferable over the other, for example, when one wants to stack on a slow parser (cf. Øvrelid et al. (2009)). Rather than running the slow parser on every sentence in a stacking setup, it can be run once on some training data. A supertagger can then be trained on this data to provide syntactic information at a fraction of the cost (see Ambati et al. (2014) for CCG).