How (not) to train a dependency parser: The curious case of jackknifing part-of-speech taggers

In dependency parsing, jackknifing taggers is indiscriminately used as a simple adaptation strategy. Here, we empirically evaluate when and how (not) to use jackknifing in parsing. On 26 languages, we reveal a preference that conflicts with, and surpasses the ubiquitous ten-folding. We show no clear benefits of tagging the training data in cross-lingual parsing.


Introduction
Dependency parsers are trained over manually annotated treebank data. By contrast, when applied in the real world, they parse over sequences of predicted parts of speech. As POS tagging accuracy drops due to domain change, the parsing quality declines proportionally. Bringing these two POS tag sources closer together thus makes for a reasonable adaptation strategy.
Arguably the simplest of such adaptations is nfold jackknifing. In it, a treebank is divided into n equal parts, and the n-th part is POS-tagged with a tagger trained on the remainder. The procedure is repeated until all n parts are assigned with predicted POS tags. A parser is then trained over the thus altered treebank, under the assumption that its POS features will now more closely resemble those of the input data.
Jackknifing is simplistic as it i) has a very limited adaptation range for n ∈ N + , and it ii) does not in any way take the input data into account, other than through a vague assumption of an undefined amount of tagging noise in the input. As such, it exhibits very mixed results. Still, the method is now ubiquitous in the parsing literature.
In Figure 1, we survey the ACL Anthology 1 for POS jackknifing. We uncover that ∼80% of the 70 1 http://aclweb.org/anthology/ Figure 1: Jackknifing in the ACL Anthology. Distribution of n over 70 parsing papers that use tagger n-folding.
parsing papers we retrieved make use of ten-fold jackknifing. This choice spans across the various languages and domains parsed in these papers, and is even motivated by simply "following the traditions in literature". 2 Our contributions. We evaluate jackknifing to establish whether its use is warranted in dependency parsing. Controlling for tagging quality in training and testing, we experiment with monolingual and delexicalized cross-lingual parsers over 26 languages, showing that: i) Indiscriminate use of ten-fold jackknifing results in sub-optimal parsing. ii) Tagging the training data does not yield clear benefits in realistic cross-lingual parsing. iii) Our jackknifing extension improves parsing through finer-grained adaptation.

Method
Jackknifing generally refers to a leave-one-out procedure for reducing bias in parameter estimation from an unbiased sample (Quenouille, 1956;Tukey, 1958). More recently, in machine learning the term is used synonymously with "cross-validation" for estimation of predictive model performance measures. In NLP, jackknifing has commonly been used to describe a procedure by which the training input is adjusted to correspond more closely to the expected test input, and it is in this latter sense that we use the term here. In particular, in parsing research, the n-fold jackknifing proceeds as follows. The treebank is first partitioned into n non-overlapping subsets of equal size. Then, iteratively, each part acts as a test subset and is tagged using a model induced by the remaining n − 1 parts, the training subset, until the entire treebank is tagged.
We want to control for POS tagging accuracy through the jackknifing method. To do this, we train a tagger on increasing sized subsets of the training set. In fold terminology, this corresponds to dividing the training set into equal parts of size 1 n , training on n−1 n ths of the training set and testing on the remaining 1 n th. However, this constrains the size of the training subset to be larger than half the original data, and thus concentrates our study on models that use almost all the data, since the nonlinear curve f (n) = n−1 n becomes very flat very fast. Thus, varying fold numbers reveals very little variation in terms of POS tagging accuracy in the lower accuracy range.
Linear extension. We now propose a simple extension of the jackknifing paradigm to study parser accuracy given a percentage p of the training set: linear jackknifing.
Let p ∈ (0, 1) be the percentage of the randomly shuffled training set D used to induce a model to tag some remaining number of instances. A training subset of this size allows a test subset of size at most ⌈ D ⋅ (1 − p)⌉. Given a test subset to tag, we can induce a model from a random subset of the remaining examples of size approximately p ⋅ D to become our training subset. We randomize the choice of examples in the training subset to avoid introducing bias. In order to tag all of D, the minimum number of models we need to generate is ⌈1 (1 − p)⌉. We thus separate D into test sub-  Intrinsic evaluation. For increasing values of p, at 5% increments, we carried out linear jackknifing on 26 languages. For each p, we averaged the performance of the induced taggers on the respective gold standards. Figure 2 illustrates the difference in informativeness of the two approaches, where each tagging accuracy score is averaged across the 26 languages. We see that with n-fold jackknifing, tagging accuracy is constrained to between approximately 92% and 95%, whereas linear jackknifing explores accuracies as low as approximately 86%. Moreover, the confidence intervals are consistent across the p, demonstrating unbiased tagging models generated on less data (lower p). We now show that these smaller levels of p are essential for good parser performance in some cases of jackknifing.

Experiments
Our experiment aims at judging the adequacy of jackknifing in dependency parsing. First, we outline the experiment setup, where we conduct two sets of experiments: i) monolingual, where lexicalized parsers are trained on treebanks for their respective languages, and ii) cross-lingual, that features SINGLE-best and MULTI-source delexicalized parsers.
Tagging sources. By jackknifing we explore how the mismatch between training and test POS affects parsing. Our setup thus critically relies on the sources of tags. We tag our test sets using: i) PRED, the monolingual taggers, and ii) PROJ, the low-resource taggers by Agić et al. (2016), based on annotation projection.  We do not experiment with gold POS tags in the test sets. Instead, we only focus on realistic parsing over predicted tags. The tags in our training sets can be GOLD, PROJ, or they can be predicted through n-fold or linear jackknifing.
In n-fold jackknifing, we experiment with n ∈ {2, 3, .., 20}, while for the linear extension we set p ∈ {5, 10, ..., 95}. We report the average parsing scores over 5 runs for each n and p so as to mitigate the effects of random shuffling in the two jackknifing procedures. In finding the optimal values of the parameters n max and p max , we report the highest values in case of ties. For example, if n = 5 and n = 10 both yield the same maximum UAS, we set n max = 10.
We emphasize the importance of realistic set-tings especially in cross-lingual parsing. Thus, we commit to using PROJ taggers with an outlook on true low-resource languages.
Data. We use the Universal Dependencies (UD) treebanks version 1.2 (Nivre et al., 2016). 3 As the projection-based taggers are trained on the WTC dataset by Agić et al. (2016), we intersect the list of WTC languages with the UD list for a total of 26 languages.
Tagging and parsing. For POS tagging, we use the TNT tagger by Brants (2000). The PRED taggers score 94.1±1.1%, while the low-resource PROJ taggers are on average 71.7±5.7% accurate. We experiment with two parsers. Bohnet's (2010)   second-order graph-based system MATE 4 is the primary. Further, we verify all parsing results by using a transition-based parser YARA 5 with dynamic oracles (Rasooli and Tetreault, 2015). The following CoNLL 2009 features are used for training the parsers: 6 ID, FORM (in monolingual parsing only), POS, and HEAD. Since ours is not a benchmarking effort, we apply all systems with their default settings.

Results
In monolingual parsing over PRED tags (Table 1), we achieve an identical average UAS with lin- Figure 3: Parsing accuracy (UAS) in relation to linear jackknifing over 26 languages, with two sources of test set POS tags. ear and n-fold jackknifing. Our adaptations surpass training with GOLD data by +1.1 UAS. Linear jackknifing improves over GOLD training by +8.1 UAS when parsing over low-resource PROJ tags. There, we top GOLD training by n-fold jackknifing as well, but it trails the linear variant by -3.9 UAS. In the low-resource PROJ setup, PROJtrained parsers are dominant. They score +6.8 UAS over linear, +10.7 UAS over n-folding, and +14.8 UAS over GOLD training. Figure 3 plots the relation between the sample size p in linear jackknifing and the resulting UAS in parsing, split for PRED and PROJ test-set taggings. Parsing over PRED tags, the UAS generally increases with p, but we note that this increase is rather small: over 26 languages, moving p from 5% to 95% yields only +0.7 UAS on average. By contrast, adapting to the lower-quality PROJ tags sees a larger +5 UAS benefit from decreasing p all the way to 5%, which is well outside the n-fold range, as indicated for n ∈ {2, ..., 20} by the dotted lines in the figure.
Our cross-lingual parsing experiment (Table 2) contrasts two options: we either tag (PROJ↝) or do not tag (GOLD↝) the parser training data. To reflect realistic low-resource parsing, the test data is tagged with PROJ taggers only. On average, the unadapted parsers are slightly better (UAS: +1.1 MULTI, +0.1 SINGLE). However, they are almost evenly split with the adapted ones in terms of offering the best performance for 12-14 out of the 26 test languages each. These results suggest, at least for simplicity, a preference for not tagging the treebanks.

Discussion
Linear or n-fold? In resource-rich PRED parsing, the two jackknifing methods are evenly split, with identical average UAS score and an overlap on 13 languages. In low-resource PROJ parsing, nfolding falls far behind as the constraint for n ≥ 2 prevents it from adapting accordingly. The median p max in PRED and PROJ are 75% and 5%. The first one roughly corresponds to 4-fold jackknifing, while the second one is far below the two-fold range. The median n max are 11 and 2, and we note that n max is rarely ∼10 in Table 1.
If we simply use ten-fold jackknifing for PRED tags, we manage to match the p max scores for only 9 of 26 languages, and we score -0.2 UAS on average. If using n = 10 with PROJ tags, the disconnect is much more substantial, and we are unable to reach p max (-4.6 UAS).
The GOLD training data is never the best choice in our monolingual parsing experiment, regardless of whether the test tags are PRED or PROJ. This result in itself justifies the usage of jackknifing as adaptation for the monolingual setting, provided that it is not indiscriminate.
Findingp max . For choosing the optimal linear jackknifing in real-world parsing, we note that p max closely correlates with test set tagging accuracy (Spearman's ρ = 0.76), and negatively with treebank size (ρ = −0.42, for D ≤ 10k sentences). Thus, to adapt via linear jackknifing, we must i) approximate the expected input data tagging accuracy, while at the same time ii) accounting for the fact that the accuracy associated with any p depends on treebank size as well. In other words, given two treebanks D 1 < D 2 , we would typically need p 1 > p 2 with the goal of matching the same test-set accuracy.
The other parser. Replacing the MATE parser with the transition-based YARA does not change the outcome of our monolingual parsing experiment, save for the average 0.58-1.65 drop in UAS. On the other hand, in cross-lingual parsing, YARA highlights the benefits of not tagging the training data, as the GOLD↝PROJ parsers are there the best choice for parsing 17/26 languages. On average, we see +2.1 UAS for MULTI, and +0.7 for SIN-GLE over PROJ↝PROJ. This is especially worth noting since large-scale parsing generally favors transition-based systems. GOLD test tags. Thus far, we have shown the need for more careful jackknifing in parser training with respect to the expected tagging quality at parse time. Fixing n = 10 was suboptimal in parsing over the fully supervised PRED tags, while n = 2, 10 were way below the threshold in lowresource parsing over our cross-lingual PROJ tags. We have purposely excluded GOLD test tags from the discussion so far.
Still, while parsing over GOLD POS input does not hold much significance for real-world applications, it is worth noting how jackknifing performs in the upper limit: trying to reach the accuracy of parsers trained and tested on GOLD tags. In that particular setup, we observe the maximum UAS of 83.8 for median n max = 12 and p max = 80%. The respective modal values are n = 20 and p = 95%, meaning that for most languages, we come closest to GOLD↝GOLD by maximizing the tagging accuracy. The overall score amounts to -0.7 UAS below the upper bound.

Related work
Jackknifing itself is for the most part incidental to the work that employs it. Here, we mention a few notable exceptions. Che et al. (2012) compare jackknifing to using gold tags in parsing Chinese for constituents and dependencies, where they observe mixed results: improvement with one parser, and decrease with the other. Seeker and Kuhn (2013) briefly touch upon the importance of jackknifing in bridging the gap between training and test data, and experiment with 5-and 10-folds. Honnibal and Johnson (2015) passingly contrast jackknifing to joint learning, giving precedence to the latter for simplicity. Finally, Kong et al. (2015) follow Zhu et al. (2013) in ten-folding for Chinese and English, citing 2.0% and 0.4% improvements. Incidentally, jackknifing parsers then hurts their performance in tree conversions.

Conclusions
The parsing literature is riddled with indiscriminate use of n-fold part-of-speech tagger jackknifing as makeshift domain adaptation.
In this paper we have proposed a careful empirical treatment of jackknifing in dependency parsing, far surpassing ten-folding via fine-grained control over the data adjustment.