It Depends: Dependency Parser Comparison Using A Web-based Evaluation Tool

The last few years have seen a surge in the number of accurate, fast, publicly available dependency parsers. At the same time, the use of dependency parsing in NLP applications has increased. It can be difﬁcult for a non-expert to select a good “off-the-shelf” parser. We present a comparative analysis of ten leading statistical dependency parsers on a multi-genre corpus of English. For our analysis, we developed a new web-based tool that gives a convenient way of comparing dependency parser outputs. Our analysis will help practitioners choose a parser to optimize their desired speed/accuracy trade-off, and our tool will help practitioners examine and compare parser output.


Introduction
Dependency parsing is a valuable form of syntactic processing for NLP applications due to its transparent lexicalized representation and robustness with respect to flexible word order languages. Thanks to over a decade of research on statistical dependency parsing, many dependency parsers are now publicly available. In this paper, we report on a comparative analysis of leading statistical dependency parsers using a multi-genre corpus. Our purpose is not to introduce a new parsing algorithm but to assess the performance of existing systems across different genres of language use and to provide tools and recommendations that practitioners can use to choose a dependency parser. The contributions of this work include: • A comparison of the accuracy and speed of ten state-of-the-art dependency parsers, cov-ering a range of approaches, on a large multigenre corpus of English.
• A new web-based tool, DEPENDABLE, for side-by-side comparison and visualization of the output from multiple dependency parsers.
• A detailed error analysis for these parsers using DEPENDABLE, with recommendations for parser choice for different factors.
• The release of the set of dependencies used in our experiments, the test outputs from all parsers, and the parser-specific models.

Related Work
There have been several shared tasks on dependency parsing conducted by CoNLL (Buchholz and Marsi, 2006;Nivre and others, 2007;Surdeanu and others, 2008;Hajič and others, 2009), SANCL (Petrov and McDonald, 2012), SPMRL (Seddah and others, 2013), and Se-mEval (Oepen and others, 2014). These shared tasks have led to the public release of numerous statistical parsers. The primary metrics reported in these shared tasks are: labeled attachment score (LAS) -the percentage of predicted dependencies where the arc and the label are assigned correctly; unlabeled attachment score (UAS) -where the arc is assigned correctly; label accuracy score (LS)where the label is assigned correctly; and exact match (EM) -the percentage of sentences whose predicted trees are entirely correct. Although shared tasks have been tremendously useful for advancing the state of the art in dependency parsing, most English evaluation has employed a single-genre corpus, the WSJ portion of the Penn Treebank (Marcus et al., 1993), so it is not immediately clear how these results gen -BC   BN  MZ  NW  PT  TC  WB  ALL  Training  171,120 206,057 163,627 876,399 296,437 85,466 284,975  2,084,081  Development  29,962  25,274  15,422 147,958  25,206 11,467  36,351  291,640  Test  35,952  26,424  17,875  60,757  25,883 10,976  38,490  216,357  Training  10,826  10,349  6,672  34,492  21,419  8,969  12,452  105,179  Development  2,117  1,295  642  5,896  1,780  1,634  1,797  15,161  Test  2,211  1,357  780  2,327  1,869  1,366  1,787  11,697   Table 1: Distribution of data used for our experiments. The first three/last three rows show the number of tokens/trees in each genre. BC: broadcasting conversation, BN: broadcasting news, MZ: news magazine, NW: newswire, PT: pivot text, TC: telephone conversation, WB: web text, ALL: all genres combined.
eralize. 1 Furthermore, a detailed comparative error analysis is typically lacking. The most detailed comparison of dependency parsers to date was performed by McDonald and Nivre (2007;; they analyzed accuracy as a function of sentence length, dependency distance, valency, non-projectivity, part-of-speech tags and dependency labels. 2 Since then, additional analyses of dependency parsers have been performed, but either with respect to specific linguistic phenomena (e.g. (Nivre et al., 2010;Bender et al., 2011)) or to downstream tasks (e.g. (Miwa and others, 2010;Petrov et al., 2010;Yuret et al., 2013)).

OntoNotes 5
We used the English portion of the OntoNotes 5 corpus, a large multi-lingual, multi-genre corpus annotated with syntactic structure, predicateargument structure, word senses, named entities, and coreference (Weischedel and others, 2011;Pradhan and others, 2013). We chose this corpus rather than the Penn Treebank used in most previous work because it is larger (2.9M vs. 1M tokens) and more diverse (7 vs. 1 genres). We used the standard data split used in CoNLL'12 3 , but removed sentences containing only one token so as not to artificially inflate accuracy. Table 1 shows the distribution across genres of training, development, and test data. For the most strict and realistic comparison, we trained all ten parsers using automatically assigned POS tags from the tagger in ClearNLP (Choi and Palmer, 2012a), which achieved accuracies of 97.34 and 97.52 on the development and test data, respectively. We also excluded any "morphological" fea-1 The SANCL shared task used OntoNotes and the Web Treebanks instead for better generalization.
2 A detailed error analysis of constituency parsing was performed by (Kummerfeld and others, 2012). 3 conll.cemantix.org/2012/download/ids/ ture from the input, as these are often not available in non-annotated data.

Dependency Conversion
OntoNotes provides annotation of constituency trees only. Several programs are available for converting constituency trees into dependency trees. Table 2 shows a comparison between three of the most widely used: the LTH (Johansson and Nugues, 2007), 4 , Stanford (de Marneffe and Manning, 2008), 5 and ClearNLP (Choi and Palmer, 2012b) 6 dependency converters. Compared to the Stanford converter, the ClearNLP converter produces a similar set of dependency labels but generates fewer unclassified dependencies (0.23% vs. 3.62%), which makes the training data less noisy.
Both the LTH and ClearNLP converters produce long-distance dependencies and use function tags for the generation of dependency relations, which allows one to generate rich dependency structures including non-projective dependencies. However, only the ClearNLP converter adapted the new Treebank guidelines used in OntoNotes. It can also produce secondary dependencies (e.g. right-node raising, referent), which can be used for further analysis. We used the ClearNLP converter to produce dependencies for our experiments.

Parsers
We compared ten state of the art parsers representing a wide range of contemporary approaches to statistical dependency parsing (Table 3). We trained each parser using the training data from OntoNotes. For all parsers we trained using the automatic POS tags generated during data preprocessing, as described above.
Training settings For most parsers, we used the default settings for training. For the SNN parser, following the recommendation of the developers, we used the word embeddings from (Collobert and others, 2011).

Development data ClearNLP, LTDP, SNN and
Yara make use of the development data (for parameter tuning). Mate and Turbo self-tune parameter settings using the training data. The others were trained using their default/"standard" parameter settings.
Beam search ClearNLP, LTDP, Redshift and Yara have the option of different beam settings. The higher the beam size, the more accurate the parser usually becomes, but typically at the expense of speed. For LTDP and Redshift, we experimented with beams of 1, 8, 16 and 64 and found that the highest accuracy was achieved at beam 8. 17 For ClearNLP and Yara, a beam size of 64 produced the best accuracy, while a beam size of 1 for LTDT, ClearNLP, and Yara produced the best speed performance. Given this trend, we also include how those three parsers perform at beam 1 in our analyses.
Feature Sets RBG, Turbo and Yara have the options of different feature sets. A more complex or larger feature set has the advantage of accuracy, but often at the expense of speed. For RBG and Turbo, we use the "Standard" setting and for Yara, we use the default ("not basic") feature setting.
Output All the parsers other than LTDP output labeled dependencies. The ClearNLP, Mate, RBG, and Turbo parsers can generate non-projective dependencies.

DEPENDABLE: Web-based Evaluation and Visualization Tool
There are several very useful tools for evaluating the output of dependency parsers, including the venerable eval.pl 18 script used in the CoNLL shared tasks, and newer Java-based tools that support visualization of and search over parse trees such as TedEval (Tsarfaty et al., 2011), 19 Mal-tEval (Nilsson and Nivre, 2008) 20 and "What's wrong with my NLP?". 21 Recently, there is momentum towards web-based tools for annotation and visualization of NLP pipelines (Stenetorp and others, 2012). For this work, we used a new webbased tool, DEPENDABLE, developed by the first author of this paper. It requires no installation and so provides a convenient way to evaluate and compare dependency parsers. The following are key features of DEPENDABLE: Figure 1: Screenshot of our evaluation tool.
• It reads any type of Tab Separated Value (TSV) format, including the CoNLL formats.
• It computes LAS, UAS and LS for parse outputs from multiple parsers against gold (manual) parses.
• It computes exact match scores for multiple parsers, and "oracle ensemble" output, the upper bound performance obtainable by combining all parser outputs.
• It allows the user to exclude symbol tokens, projective trees, or non-projective trees.
• It produces detailed analyses by POS tags, dependency labels, sentence lengths, and dependency distances.
• It reports statistical significance values for all parse outputs (using McNemar's test).
DEPENDABLE can be also used for visualizing and comparing multiple dependency trees together ( Figure 2). A key feature is that the user may select parse trees by specifying a range of accuracy scores; this enabled us to perform the error analyses in Section 6.5. DEPENDABLE allows one to filter trees by sentence length and highlights arc and label errors. The evaluation and comparison tools are publicly available at http://nlp.mathcs.emory.edu/ clearnlp/dependable.

Results and Error Analysis
In this section, we report overall parser accuracy and speed. We analyze parser accuracy by sentence length, dependency distance, nonprojectivity, POS tags and dependency labels, and genre. We report detailed manual error analyses focusing on sentences that multiple parsers parsed incorrectly. 22 All analyses, other than parsing speed, were conducted using the DEPEND-ABLE tool. 23 The full set of outputs from all parsers, as well as the trained models for each parser, available at http://amandastent. com/dependable/. We also include the greedy parsing results of ClearNLP, LTDP, and Yara in two of our analyses to better illustrate the differences between the greedy and non-greedy settings. The greedy parsing results are denoted by the subscript ' g '. These two analyses are the overall accuracy results, presented in Section 6.1 (Table 4), and the overall speed results, presented in Section 6.2 ( ( Table 5 and Figure ). All other analyses exclude the ClearNLP g , LTDP g and Yara g . 22 For one sentence in the NW data, the LTDP parser failed to produce a complete parse containing all tokens, so we removed this sentence for all parsers, leaving 11,696 trees (216,313 tokens) in the test data. 23 We compared the results produced by DEPENDABLE with those produced by eval07.pl, and verified that LAS, UAS, LA, and EM were the same when punctuation was included. Our tool uses a slightly different symbol set than eval07.pl: !"#$%&'() * +,-  Table 4: Overall parsing accuracy. The top 6 rows and the bottom 7 rows show accuracies for greedy and non-greedy parsers, respectively.

Overall Accuracy
In Table 4, we report overall accuracy for each parser. For clarity, we report results separately for greedy and non-greedy versions of the parsers. Over all the different metrics, MATE is a clear winner, though ClearNLP, RBG, Redshift, Turbo and Yara are very close in performance. Looking at only the greedy parsers, ClearNLP g shows a significant advantage over the others. We conducted a statistical significance test for the the parsers (greedy versions excluded). All LAS differences are statistically significant at p < .01 (using McNemar's test), except for: RBG vs. Redshift, Turbo vs. Yara, Turbo vs. ClearNLP and Yara vs. ClearNLP. All UAS differences are statistically significant at p < .01 (using McNemar's test), except for: SNN vs. LTDP, Turbo vs. Redshift, Yara vs. RBG and ClearNLP vs. Yara.

Overall Speed
We ran timing experiments on a 64 core machine with 16 Intel Xeon E5620 2.40 GHz processors and 24G RAM, and used the unix time command to time each run. Some parsers are multithreaded; for these, we ran in single-thread mode (since any parser can be externally parallelized). Most parsers do not report model load time, so we first ran each parser five times with a test set of 10 sentences, and then averaged the middle three times to get the average model load time. 24 Next, we ran each parser five times with the entire test set and derived the overall parse time by averaging the middle three parse times. We then subtracted the average model time from the average 24 Recall we exclude single-token sentences from our tests. parse time and averaged over the number of sentences and tokens.    Table 5 shows overall parsing speed for each parser. spaCy is the fastest greedy parser and Redshift is the fastest non-greedy parser. Figure 3 shows an analysis of parsing speed by sentence length in bins of length 10. As expected, as sentence length increases, parsing speed decreases remarkably.

Detailed Accuracy Analyses
For the following more detailed analyses, we used all tokens (including punctuation). As mentioned earlier, we exclude ClearNLP g , LTDP g and Yara g from these analyses and instead use their respective non-greedy modes yielding higher accuracy.
Sentence Length We analyzed parser accuracy by sentence length in bins of length 10 ( Figure 4)  Dependency Distance We analyzed parser accuracy by dependency distance (depth from each dependent to its head; Figure 5). Accuracy falls off more slowly as dependency distance increases for the top 6 parsers vs. the rest.
Projectivity Some of our parsers only produce projective parses. Table 6 shows parsing accuracy for trees containing only projective arcs (11,231 trees, 202,521 tokens) and for trees containing non-projective arcs (465 trees, 13,792 tokens). As before, all differences are statistically significant at p < .01 except for: Redshift vs. RBG for overall LAS; LTDP vs. SNN for overall UAS; and Turbo vs. SpaCy for overall UAS. For strictly projective trees, the LTDP parser is 5th from the top in UAS. Apart from this, the grouping between "very good" and "good" parsers does not change.   Table 6: Accuracy for proj. and non-proj. trees.

Dependency Relations
We were interested in which dependency relations were computed with high/low overall accuracy, and for which accuracy varied between parsers. The dependency relations with the highest average LAS scores (> 97%) were possessive, hyph, expl, hmod, aux, det and poss. These relations have strong lexical clues (e.g. possessive) or occur very often (e.g. det). Those with the lowest LAS scores (< 50%) were csubjpass, meta, dep, nmod and parataxis. These either occur rarely or are very general (dep). The most "confusing" dependency relations (those with the biggest range of accuracies across parsers) were csubj, preconj, csubjpass, parataxis, meta and oprd (all with a spread of > 20%). The Mate and Yara parsers each had the highest accuracy for 3 out of the top 10 "confusing" dependency relations. The RBG parser had the highest accuracy for 4 out of the top 10 "most accurate" dependency relations. SNN had the lowest accuracy for 5 out of the top 10 "least accurate" dependency relations, while the RBG had the lowest accuracy for another 4.

POS Tags
We also examined error types by part of speech tag of the dependent. The POS tags with the highest average LAS scores (> 97%) were the highly unambiguous tags POS, WP$, MD, TO, HYPH, EX, PRP and PRP$. With the exception of WP$, these tags occur frequently. Those with the lowest average LAS scores (< 75%) were punctuation markers ((, ) and :, and the rare tags AFX, FW, NFP and LS.
Genres Table 7 shows parsing accuracy for each parser for each of the seven genres comprising the English portion of OntoNotes 5. Mate and ClearNLP are responsible for the highest accuracy for some genres, although accuracy differences among the top four parsers are generally small. Accuracy is highest for PT (pivot text, the Bible) and lowest for TC (telephone conversation) and WB (web data). The web data is itself multi-genre and includes translations from Arabic and Chinese, while telephone conversation data includes disfluencies and informal language.

Oracle Ensemble Performance
One popular method for achieving higher accuracy on a classification task is to use system combination (Björkelund and others, 2014;Le Roux and others, 2012;Le Roux et al., 2013;Sagae and Lavie, 2006;Sagae and Tsujii, 2010;Haffari et al., 2011). DEPENDABLE reports ensemble upper bound performance assuming that the best tree can be identified by an oracle (macro), or that the best arc can be identified by an oracle (micro). Ta  with other parsers; their respective "best match" score was never higher than 55.

Error Analysis
From the test data, we pulled out parses where only one parser achieved very high accuracy, and parses where only one parser had low accuracy ( Table 9). As with the detailed performance analyses, we used the most accurate version of each parser for this analysis. Mate has the highest number of "generally good" parses, while the SNN parser has the highest number of "uniquely bad" parses. The SNN parser tended to choose the wrong root, but this did not appear to be tied to the number of verbs in the sentence -rather, the SNN parser just makes the earliest "reasonable" choice of root.  Table 9: Differential parsing accuracies.
To further analyze these results, we first looked at the parse trees for "errorful" sentences where the parsers agreed. From the test data, we extracted parses for sentences where at least two parsers got UAS of < 50%. This gave us 253 sentences. The distribution of these errors across genres varied: PT -2.8%, MZ -3.5%, BN -9.8%, NW -10.3%, WB -17.4%, BC -25.3%, TC -30.8%. By manual comparison using the DEPEND-ABLE tool, we identified frequently occurring potential sources of error. We then manually annotated all sentences for these error types. Figure 6 shows the number of "errorful" sentences of each type. Punctuation attachment "errors" are prevalent. For genres with "noisy" text (e.g. broadcast conversation, telephone conversation) a significant proportion of errors come from fragmented sentences or those containing backchannels or disfluencies. There are also a number of sentences with what appeared to be manual dependency labeling errors in the gold annotation.  Table 7: Parsing accuracy by genre. Figure 6: Common error types in erroneous trees.

Recommendations
Each of the transition-based parsers that was included in this evaluation can use varying beam widths to trade off speed vs. accuracy, and each parser has numerous other parameters that can be tuned. Notwithstanding all these variables, we can make some recommendations. Figure 7 illustrates the speed vs. accuracy tradeoff across the parsers. For highest accuracy (e.g. in dialog systems), Mate, RBG, Turbo, ClearNLP and Yara are good choices. For highest speed (e.g. in web-scale NLP), spaCy and ClearNLP g are good choices; SNN and Yara g are also good choices when accuracy is relatively not as important.

Conclusions and Future Work
In this paper we have: (a) provided a detailed comparative analysis of several state-of-the-art statistical dependency parsers, focusing on accuracy In the future, we plan to add regular expression search over parses, and sorting within results tables. Our hope is that the results from the evaluation as well as the tool will give non-experts in parsing better insight into which parsing tool works well under differing conditions. We also hope that the tool can be used to facilitate evaluation and be used as a teaching aid in NLP courses. Supplements to this paper include the tool, the parse outputs, the statistical models for each parser, and the new set of dependency trees for OntoNotes 5 created using the ClearNLP dependency converter. We do recommend examining one's data and task before choosing and/or training a parser. Are non-projective parses likely or desirable? Does the data contain disfluencies, sentence fragments, and other "noisy text" phenomena? What is the average and standard deviation for sentence length and dependency length? The analyses in this paper can be used to select a parser if one has the answers to these questions.
In this work we did not implement an ensemble of parsers, partly because an ensemble necessarily entails complexity and/or speed delays that render it unusable by all but experts. However, our analyses indicate that it may be possible to achieve small but significant increases in accuracy of dependency parsing through ensemble methods. A good place to start would be with ClearNLP, Mate, or Redshift in combination with LTDP and Turbo, SNN or spaCy. In addition, it may be possible to achieve good performance in particular genres by doing "mini-ensembles" trained on general purpose data (e.g. WB) and genre-specific data. We leave this for future work. We also leave for future work the comparison of these parsers across languages.
It remains to be seen what downstream impact differences in parsing accuracy of 2-5% have on the goal task. If the impact is small, then speed and ease of use are the criteria to optimize, and here spaCy, ClearNLP g , Yara g and SNN are good choices.