Some Languages Seem Easier to Parse Because Their Treebanks Leak

Cross-language differences in dependency parsing performance are mostly attributed to treebank size, average sentence length, average dependency length, morphological complexity, and domain differences. In this paper I point to a factor not previously discussed: If we abstract away from words and dependency labels, how many graphs in the test data were seen in the training data? I discuss how to compute graph isomorphisms, and show that, tree-bank size aside, overlap between training and test graphs explains more of the observed variation than standard explanations such as the above.


Introduction
The state of the art in dependency parsing varies a lot across languages: on Polish, the best system in the CoNLL 2018 shared task achieved a labeled attachment score of 94.9% on held-out data; on Basque, the same number was 19.5%. Just a few years ago, a major source of variation was the complexity of the annotation schemes used in the different treebanks; with the Universal Dependencies project, 1 treebanks now follow the same annotation guidelines, but nevertheless, these performance differences persist. 2 Differences are typically attributed to training set size (Vania et al., 2019), linguistic variation 1 https://universaldependencies.org/ 2 While Universal Dependencices have made the available dependency treebanks more compatible, treebanks were of course developed using very different protocols; some are automatically or semi-automatically converted from other formalisms, others written with the Universal Dependencies guidelines in mind; some, again, were developed by big teams, some by a single person. While protocol is hard to isolate and study -and while protocol may correlate both positively or negatively with parsing performance, i.e., it is easy to imagine a poorly designed treebank that is easy to parse -the protocol likely has a significant downstream effect on performance; which means we can only hope to explain some of the variance in the experiments below.  Figure 1: Isomorphic examples from UD-English-Pronouns. Left: Yours drove responsibly. Right: It is hers. The two sentences are associated with the same unlabeled directed graphs. (Nivre et al., 2007), sentence length or average gold dependency length (in the test data) (McDonald and Nivre, 2011), and domain differences between training and test data (Foster et al., 2011). Training set size is undoubtedly a very strong predictor of parsing performance, but in this paper, overlap between unlabeled graphs in the training and test sections of a treebank is shown to be more predictive than any of the other factors. Specifically, we compute equivalence classes over unlabeled dependency graphs -directed or undirected -and compute the ratio of trees in the treebanks' test sections that are isomorphic to graphs observed in the training section, i.e., the graph-level train-test leakage, and correlate this number with state-ofthe-art performance numbers across languages. To the best of our knowledge, no one has previously considered this predictor of parsing performance, and we show that it is more predictive than factors previously discussed in the literature.

Contribution
We present a way to quantify graph-level train-test leakage and an empirical evaluation of it across parsing results for 45 languages; we show that next to treebank size, graph-level train-test leakage is a better predictor of parsing performance than any of the factors previously considered.

Unlabeled Graph Isomorphisms
Exact graph isomorphism is in NP, but it remains an open problem whether it is NP-complete or in P. We use the VF2 algorithm in (Cordella et al., 2001), which is known to be fast in practice with low memory requirements (Foggia et al., 2001). The algorithm proceeds by iteratively expanding a subgraph isomorphism, until this procedure fails, or until the subgraph isomorphism covers the input graphs. We compute isomorphisms over dependency trees in the training set by first reducing the trees to a more abstract graph. In our experiments below, we consider two such reductions: to undirected, unlabeled graphs (UUGs; removing labels and edge directions) and to directed, unlabeled graphs (DUGs; removing only labels). Once we have computed the isomorphisms, we count how many of the dependency trees in the test data are members of one of these equivalence classes. We then report the fraction of test dependency trees that are isomorphic to at least one dependency tree in the training data. This number can be seen as a metric of graph-level train-test leakage. See Figure  2 for the top 5 most leaking treebanks in the Universal Dependencies project (Version 2.5); the worst has only 3/100 unseen test graphs. In the Appendix, we report the full set of results with UUGs; both for exact computation of isomorphims with VF2, as well as for a heuristic simply matching a set of edge degrees.

Usual Suspects
We briefly discuss other factors assumed to be predictive of the performance of dependency parsers.
Treebank size It is trivially true that parser performance depends on treebank size, and it is unsurprising that the correlation is strong. Obviously, if the treebank does not contain any training data, supervised parsers will have to resort to blind guessing, and the more data they see, the less variance they have to resolve. That said, it is well established that increasing the size of a treebank often comes with diminishing returns (Sagae et al., 2008). Since treebank size is nevertheless trivially related to parsing performance, we correlate all other factors φ in combination with treebank size (see §4): Morphology Previous work has pointed to morphology as a source of lower parsing performance (Tsarfaty et al., 2013;Coltekin and Rama, 2018). In languages with rich morphology, many relations which are expressed implicitly by word order and adjacency in languages like English, are encoded in morphological affixes, which requires subwordlevel processing to detect (in the tail). Expressing functional information morphologically also allows for a high degree of word-order variation. In our experiments, we use the most predictive morphological feature in WALS 3 and impute the missing values.
Sentence length Parser performance unsurprisingly also depends on input length, i.e., the search space of possible parses (McDonald and Nivre, 2011). This, for example, is why unsupervised dependency parsing has successfully relied on baby steps training (Spitkovsky et al., 2010). We correlate state-of-the-art parser performance with training set size and average test sentence length. (2011) discuss graph properties that seem to correlate with parsing performance. We include average dependency length in our experiments below, which we compute by simply dividing the total length of dependencies by word tokens in the test section.

Graph properties McDonald and Nivre
Open class ratio Nivre and Fang (2017) argue that open word classes (especially nouns and verbs) tend to be harder to attach than other parts of speech, and that languages with many of them will therefore be harder to parse. We therefore evaluate

Empirical Comparison of Factors
We correlate the factors φ assumed to influence syntactic dependency parser performance with state-ofthe-art performance figures from the CoNLL 2018 shared task, i.e,. the performance of the best performing system per language. 4 See the Appendix for the full statistics. While computing their Pearson's ρ coefficients is standard methodology for validating performance metrics (Lin, 2004;Miculicich Werlen and Popescu-Belis, 2017) and has also been used to evaluate factors predicting system performance (Martin and Foltz, 2004;Søgaard and Haulrich, 2010), this is inadequate in our case: Many factors are potentially covariate, and we are, for example, not interested in factors that correlate strongly with treebank size, e.g., out-of-vocabulary rate or type-token ratio (Kettunen, 2014). Instead we compute the explained variance and mean absolute error of a linear regression model with treebank size and φ as input, i.e., at s + bφ + c with t s treebank size and a, b, c learned parameters. We report explained variance and mean absolute error from three-fold cross-validation experiments to avoid overfitting. We make our code publicly available. 5 Results Our main results are presented in Table 1. Treebank size correlates strongly with parser performance; see the plot in Figure 3 (Left). Both morphological complexity and open class ratio are not very predictive. None of them correlate very strongly with parser performance, and in com- bination with treebank size, they do not add much predictiver power, it seems. A-distance correlates strongly with parsing performance; the explained variance improves a little, and the error decreases a bit. Average dependency length is only weakly, negatively correlated with parsing performance (ρ ∼ 0.067), a result that is not significant; and the absolute error of the linear regression model decreases only a little from adding the feature; the explained variance improves to 0.05. Sentence length, perhaps unsurprisingly, correlates more strongly with parsing performance; and the explained variance of our linear regression model increases a lot from adding this feature. Graph-level train-test leakage, however, is more predictive of parsing performance than any of these factors. See the correlation of treebank size over DUG-level train-test leakage in Figure 3 (Right). It also leads to much better performance of our linear regression model; both in terms of explained variance and mean absolute error. We note that using DUGs to compute the isomorphisms is slightly more predictive than relying on undirected graphs.

Related work
The factors evaluated in the above, from Nivre et al. (2007); Van Asch and Daelemans (2010); Mc-Donald and Nivre (2011);Nivre and Fang (2017); Coltekin and Rama (2018); Berdicevskis et al. (2018), were already discussed. A few other factors have been pointed at in the literature that were not applicable to our experiments: Søgaard and Haulrich (2010) show that the perplexity of the derivation orders of a transition-based dependency parser, is also predictive of parser performance.
They report Pearson's ρ scores that are considerably higher than those we found. Their study suffers from two biases, though; one imposed by the transition-based parser and the other imposed by the language model used to calculate the perplexity. Moreover, the results they report, are for only the non-convertedd dependency treebanks in the CoNLL 2006 (Buchholz and Marsi, 2006) and CoNLL 2007 (Nivre et al., 2007) treebank releases. These treebanks form a very small set, providing limited statistical support, and, moreover, rely on very different linguistic formalisms and annotation guidelines, leading to very different levels of complexity of derivation. In other words, a comparison would be inconclusive because of the free parameters imposed by the language model and the transition oracle, and the fact that no code is publicly available. Also, their high correlation scores are unlikely to transfer to Universal Dependencies.

Discussion and Conclusion
This paper suggested a factor contributing to variance in (universal) dependency parser performance across languages: graph-level train-test leakage in treebanks. This form of leakage can be quantified by computing graph isomorphisms from training sections and counting the ratio of trees in the test sections that are not isomorphic with any tree in the training data. I compared this factor to previous attempts to explain variance in parser performance across languages through a series of correlation and linear regression experiments; and showed that graph-level train-test leakage, treebank size aside, is the most predictive factor among those proposed, yet complementary. The result is perhaps not too surprising, since graph isomorphisms correlate with syntactic constructions, which in turn correlate with the occurrence of linguistic markers and tail linguistic phenomena. 6 The observation that treebanks leak, quite dramatically, at the graph level, is not only interesting for explaining variance in parser performance. It also suggests a new and improved evaluation methodology: Since language is Zipfian, not only at the level of words, but at the level of phrases (Ha et al., 2002;Williams et al., 2015), standard evaluation methodology relying on random samples (Gorman and Bedrick, 2019;Dodge et al., 2019) is biased toward frequent phenomena. Evaluating only on non-isomorphic trees, i.e., leaving out graphs that have been seen at training time from the test sections of treebanks, would reduce this bias. We hope this is a factor that designers of future syntactic treebanks will take into account. It is an open question whether graph-level train-test leakage is predictive of performance in other sentence-level NLP tasks, i.e., whether the ratio of test sentences whose (predicted) syntactic dependency structure is identical to that of one of our training examples, correlates with state-of-the-art performance.