Using Universal Dependencies in cross-linguistic complexity research

We evaluate corpus-based measures of linguistic complexity obtained using Universal Dependencies (UD) treebanks. We propose a method of estimating robustness of the complexity values obtained using a given measure and a given treebank. The results indicate that measures of syntactic complexity might be on average less robust than those of morphological complexity. We also estimate the validity of complexity measures by comparing the results for very similar languages and checking for unexpected differences. We show that some of those differences that arise can be diminished by using parallel treebanks and, more importantly from the practical point of view, by harmonizing the language-specific solutions in the UD annotation.


Introduction
Analyses of linguistic complexity are gaining ground in different domains of language sciences, such as sociolinguistic typology (Dahl, 2004;Wray and Grace, 2007;Dale and Lupyan, 2012), language learning (Hudson Kam and Newport, 2009;Perfors, 2012;Kempe and Brooks, 2018), and computational linguistics (Brunato et al., 2016). Here are a few examples of the claims that are being made: creole languages are simpler than "old" languages (McWhorter, 2001); languages with high proportions of non-native speakers tend to simplify morphologically (Trudgill, 2011); morphologically rich languages seem to be more difficult to parse (Nivre et al., 2007).
Ideally, strong claims have to be supported by strong empirical evidence, including quantitative evidence. An important caveat is that complexity is notoriously difficult to define and measure, and that there is currently no consensus about how proposed measures themselves can be evaluated and compared.
To overcome this, the first shared task on measuring linguistic complexity was organized in 2018 at the EVOLANG conference in Torun. Seven teams of researchers contributed overall 34 measures for 37 pre-defined languages (Berdicevskis and Bentz, 2018). All corpus-based measures had to be obtained using Universal Dependencies (UD) 2.1 corpora (Nivre et al., 2017).
The shared task was unusual in several senses. Most saliently, there was no gold standard against which the results could be compared. Such a benchmark will in fact never be available, since we cannot know what the real values of the constructs we label "linguistic complexity" are.
In this paper, we attempt to evaluate corpusbased measures of linguistic complexity in the absence of a gold standard. We view this as a small step towards exploring how complexity varies across languages and identifying important types of variation that relate to intuitive senses of "linguistic complexity". Our results also indicate to what extent UD in its current form can be used for cross-linguistic studies. Finally, we believe that the methods we suggest in this paper may be relevant not only for complexity, but also for other quantifiable typological parameters. Section 2 describes the shared task and the proposed complexity measures, Section 3 describes the evaluation methods we suggest and the results they yield, Section 4 analyzes whether some of the problems we detect are corpus artefacts and can be eliminated by harmonizing the annotation and/or using the parallel treebanks, Section 5 concludes with a discussion.

Data and measures
For the shared task, participants had to measure the complexities of 37 languages (using the "original" UD treebanks, unless indicated otherwise in parentheses): Afrikaans, Arabic, Basque, Bulgarian, Catalan, Chinese, Croatian, Czech, Danish, Greek, Dutch, English, Estonian, Finnish, French, Galician, Hebrew, Hindi, Hungarian, Italian, Latvian, Norwegian-Bokmål, Norwegian-Nynorsk, Persian, Polish, Portuguese, Romanian, Russian (SynTagRus), Serbian, Slovak, Slovenian, Spanish (Ancora), Swedish, Turkish, Ukrainian, Urdu and Vietnamese. Other languages from the UD 2.1 release were not included because they were represented by a treebank which either was too small (less than 40K tokens), or lacked some levels of annotation, or was suspected (according to the information provided by the UD community) to contain many annotation errors. Ancient languages were not included either. In this paper, we also exclude Galician from consideration since it transpired that its annotation was incomplete.
The participants were free to choose which facet of linguistic complexity they wanted to focus on, the only requirement was to provide a clear definition of what is being measured. This is another peculiarity of the shared task: different participants were measuring different (though often related) constructs.
All corpus-based measures had to be applied to the corpora available in UD 2.1, but participants were free to decide which level of annotation (if any) to use. The corpora were obtained by merging together train, dev and test sets provided in the release.
From every contribution to the shared task, we selected those UD-based measures that we judged to be most important. Table 1 lists these measures and briefly describes their key properties, including those levels of treebank annotation on which the measures are directly dependent (this information will be important in Section 4). We divide measures into those that gauge morphological complexity and those that gauge syntactic complexity, although these can of course be interdependent.
In Appendix A, we provide the complexity rank of each language according to each measure.
It should be noted that all the measures are in fact gauging complexities of treebanks, not complexities of languages. The main assumption of corpus-based approaches is that the former are reasonable approximations of the latter. It can be questioned whether this is actually the case (one obvious problem is that treebanks may not be representative in terms of genre sample), but in this paper we largely abstract away from this question and focus on testing quantitative approaches.

Evaluation
We evaluate robustness and validity. By robustness we mean that two applications of the same measure to the same corpus of the same language should ideally yield the same results. See Section 3.1 for the operationalization of this desideratum and the results.
To test validity, we rely on the following idea: if we take two languages that we know from qualitative typological research to be very similar to each other (it is not sufficient that they are phylogenetically close, though it is probably necessary) and compare their complexities, the difference should on average be lower than if we compare two random languages from our sample. For the purposes of this paper we define very similar as 'are often claimed to be variants of the same language'. Three language pairs in our sample potentially meet this criterion: Norwegian-Bokmål and Norwegian-Nynorsk; Serbian and Croatian; Hindi and Urdu. For practical reasons, we focus on the former two in this paper (one important problem with Hindi and Urdu is that vowels are not marked in the Urdu UD treebank, which can strongly affect some of the measures, making the languages seem more different than they actually are). Indeed, while there certainly are differences between Norwegian-Bokmål and Norwegian-Nynorsk and between Serbian and Croatian, they are structurally very close (Sussex and Cubberley, 2006;Faarlund, Lie and Vannebo, 1997) and we would expect their complexities to be relatively similar. See section 3.2 for the operationalization of this desideratum and the results.
See Appendix B for data, detailed results and scripts.

Evaluating robustness
For every language, we randomly split its treebank into two parts containing the same number of sentences (the sentences are randomly drawn from anywhere in the corpus; if the total number of sentences is odd, then one part contains one extra sentence), then apply the complexity measure of interest to both halves, and repeat the procedure for n iterations (n = 30). We want the measure to yield similar results for the two halves, and we test whether it does by performing a paired t-test on the two samples of n measurements each (some of the samples are not normally distributed, but paired ttests with sample size 30 are considered robust to non-normality, see Boneau, 1960). We also calculate the effect size (Cohen's d, see Kilgarriff, 2005 about the insufficience of significance testing in corpus linguistics). We consider the difference to be significant and non-negligible if p is lower than 0.10 and the absolute value of d is larger than 0.20. Note that our cutoff point for p is higher than the conventional thresholds for significance (0.05 or 0.01), which in our case means more conservative approach. For d, we use the conventional threshold, below which the effect size is typically considered negligible.
We consider the proportion of cases when the difference is significant and non-negligible a measure of non-robustness. See Figure 1 for the non-robustness of treebanks (i.e. the proportion of measures that yielded a significant and nonnegligible difference for a given treebank according to the resampling test); see Figure 2 for The Czech and Dutch treebanks are the least robust according to this measure: resampling yields unwanted differences in 20% of all cases, i.e. for three measures out of 15. 12 treebanks exhibit nonrobustness for two measures, 9 for one, 13 are fully robust.
It is not entirely clear which factors affect treebank robustness. There is no correlation between non-robustness and treebank size in tokens (Spearman's r = 0.14, S = 6751.6, p = 0.43). It is possible that more heterogeneous treebanks (e.g. those that contain large proportions of both very simple and very complex sentences) should be less robust, but it is difficult to measure heterogeneity. Note also that the differences are small and can be to a large extent random.
As regards measures, CR_POSP is least robust, yielding unwanted differences for seven languages out of 36, while TL_SemDist, TL_SemVar and PD_POS_TRI_UNI are fully robust. Interestingly, the average non-robustness of morphological measures (see Table 1) is 0.067, while that of syntactic is 0.079 (our sample, however, is neither large nor representative enough for any meaningful estimation of significance of this difference). A probable reason is that syntactic measures are likely to require larger corpora. Ross (2018: 28-29), for instance, shows that no UD 2.1 corpus is large enough to provide a precise estimate of RO_DEP. The heterogeneity of the propositional content (i.e. genre) can also affect syntactic measures (this has been shown for EH_SYNT, see Ehret, 2017).

Evaluating validity
For every measure, we calculate differences between all possible pairs of languages. Our prediction is that differences between Norwegian-Bokmål and Norwegian-Nynorsk and between Serbian and Croatian will be close to zero or at least lower than average differences. For the purposes of this section, we operationalize lower than average as 'lying below the first (25%) quantile of the distribution of the differences'.
We plot the distributions of differences for these measures, highlighting the differences between Norwegian-Bokmål and Norwegian-Nynorsk and between Serbian and Croatian (see Figure 3).
It should be noted, however, that the UD corpora are not parallel and that the annotation, while meant to be universal, can in fact be quite different for different languages. In the next section, we explore if these two issues may affect our results.

Harmonization and parallelism
The Norwegian-Bokmål and Norwegian-Nynorsk treebanks are of approximately the same size (310K resp. 301K tokens) and are not parallel. They were, however, converted by the same team from the same resource (Øvrelid and Hohle, 2016). The annotation is very similar, but Norwegian-Bokmål has some additional features. We harmonize the annotation by eliminating the prominent discrepancies (see Table 2). We ignore the discrepancies that concern very few instances and thus are unlikely to affect our results.
The Croatian treebank (Agić and Ljubešić, 2015) has richer annotation than the Serbian one (though Serbian has some features that Croatian is missing) and is much bigger (197K resp. 87K tokens); the Serbian treebank is parallel to a subcorpus of the Croatian treebank (Samardžić et al., 2017). We created three extra versions of the Croatian treebank: Croatian-parallel (the parallel subcorpus with no changes to the annotation); Croatian-harmonized (the whole corpus with the annotation harmonized as described in Table 3);
It should be noted that our harmonization (for both language pairs) is based on comparing the stats.xml file included in the UD releases and the papers describing the treebanks (Øvrelid and Hohle, 2016;Agić and Ljubešić, 2015;Samardžić et al., 2017). If there are any subtle differences that do not transpire from these files and papers (e.g. different lemmatization principles), they are not eliminated by our simple conversion.
Using the harmonized version of Norwegian-Bokmål does not affect the difference for CR_POSP (which is unsurprising, given that the harmonization changed only feature annotation, to which this measure is not sensitive).
For Croatian, we report the effect of the three manipulations in Table 4. Using Croatian-parallel solves the problems with CR_TTR, CR_MSP, EH_SYNT, PD_POS_TRI, PD_POS_TRI_UNI. Using Croatian-harmonized and Serbianharmonized has an almost inverse effect. It solves the problems with CR_MFE, CR_CFEWM, CR_POSP, but not with any other measures. It does strongly diminish the difference for RO_DEP, though.
Finally, using Croatian-parallelharmonized and Serbian-harmonized turns out to be most efficient. It solves the problems with all the measures apart from RO_DEP, but the difference does become smaller also for this measure. Note that this measure had the biggest original difference (see Section 3.2).
Some numbers are positive, which indicates that the difference increases after the harmonization.
Small changes of this kind (e.g. for CR_MSP, EH_SYNT) are most likely random, since many measures are using some kind of random sampling and never yield exactly the same value. The behaviour of EH_MORPH also suggests that the changes are random (this measure cannot be affected by harmonization, so Croatianharmonized and Croatian-parallel-harmonized should yield similar results). The most surprising result, however, is the big increase of PD_POS_TRI_UNI after harmonization. A possible reason is imperfect harmonization of POS annotation, which introduced additional variability into POS trigrams. Note, however, that the difference for CR_POSP, which is similar to PD_POS_TRI_UNI, was reduced almost to zero by the same manipulation.
It can be argued that these comparisons are not entirely fair. By removing the unreasonable discrepancies between the languages we are focusing on, but not doing that for all language pairs, we may have introduced a certain bias. Nonetheless, our results should still indicate whether the harmonization and parallelization diminish the differences (though they might overestimate their positive effect).

Discussion
As mentioned in Section 1, some notion of complexity is often used in linguistic theories and analyses, both as an explanandum and an explanans. A useful visualization of many theories that involve the notion of complexity can be obtained, for instance, through The Causal Hypotheses in Evolutionary Linguistics Database (Roberts, 2018). Obviously, we want to be able to  In this paper, we leave aside the question of how well we understand what complexity "really'' is and focus on how good we are at quantifying it using corpus-based measures (it should be noted that other types of complexity measures exist, e.g. grammar-based measures, with their own strengths and weaknesses).
Our non-robustness metric shows to what extent a given measure or a given treebank can be trusted. Most often, two equal treebank halves yield virtually the same results. For some treebanks and measures, on the other hand, the proportion of cases in which the differences are significant (and large) is relatively high. Interestingly, measures of syntactic complexity seem to be on average less robust in this sense than measures of morphological complexity. This might indicate that language-internal variation of syntactic complexity is greater than language-internal variation of morphological complexity, and larger corpora are necessary for its reliable estimation. In particular, syntactic complexity may be more sensitive to genres, and heterogeneity of genres across and within corpora may affect robustness. It is hardly possible to test this hypothesis with UD 2.1, since detailed genre metadata are not easily available for most treebanks. Yet another possible explanation is that there is generally less agreement between different conceptualizations of what "syntax" is than what "morphology" is.
Our validity metric shows that closely related languages which should yield minimally divergent results can, in fact, diverge considerably. However, this effect can be diminished by using parallel treebanks and harmonizing the UD annotation. The latter result has practical implications for the UD project. While Universal Dependencies are meant to be universal, in practice language-specific solutions are allowed on all levels. This policy has obvious advantages, but as we show, it can inhibit cross-linguistic comparisons. The differences in Table 2 and Table 3 strongly affect some of our measures, but they do not reflect any real structural differences between languages, merely different decisions adopted by treebank developers. For quantitative typologists, it would be desirable to have a truly harmonized (or at least easily harmonizable) version of UD.
The observation that non-parallelism of treebanks also influences the results has further implications for a corpus-based typology. Since obtaining parallel treebanks even for all current UD languages is hardly feasible, register and genre variation are important confounds to be aware of. Nonetheless, the Norwegian treebanks, while nonparallel, did not pose any problems for most of the measures. Thus, we can hope that if the corpora are sufficiently large and well-balanced, quantitative measures of typological parameters will still yield reliable results despite the non-parallelism. In general, our results allow for some optimism with regards to quantitative typology in general and using UD in particular. However, both measures and resources have to be evaluated and tested before they are used as basis for theoretical claims, especially regarding the interpretability of the computational results.