The relation between dependency distance and frequency

This present pilot study investigates the relationship between dependency distance and frequency based on the analysis of an English dependency treebank. The preliminary result shows that there is a non-linear relation between dependency distance and frequency. This relation between them can be further formalized as a power law function which can be used to predict the distribution of dependency distance in a treebank


Introduction
As a well-discussed norm (Hudson, 1995;Temperley, 2007;Futrell et al., 2015;Liu et al., 2017), dependency distance shows several attractive features for quantitative studies. First, its definition is rather clear. It is the linear distance between a word and its head. 1 Second, it is very easy to quantify. We can simply compute dependency distance as the difference of the word ID and its head's ID in a CoNLL style treebank (Buchholz & Marsi, 2006). These features together with the emergence of large-scale dependency treebanks made dependency distance one of the popular topics in quantitative syntactic studies.
Among various interesting discussions, the most striking finding is probably the dependency distance minimization phenomena. After empirically examining the dependency distance distributions of different human languages and comparing the results with different random baselines, Liu (2008Liu ( , 2010 found that there is a universal trend of minimizing the dependency distance in human languages. Futrell et al. (2015) conducted a similar study which widened the language range and added one more random baseline. Their results are coherent with Liu's finding. Both Liu (2008) and Futrell et al. (2015) connect this phenomenon with the short-term memory (or working memory) storage of human beings and the least effort principle (Zipf, 1949). Since long dependencies, which have longer distance, occupy more shortterm memory storage, they are more difficult or inefficient to process. Therefore, for lowering the processing difficulty and boosting the efficiency of communications, short dependencies are preferable according to the least effort principle.
Initially, the least effort principle was brought up by Zipf for explaining the observed power-law distributions of word frequencies. Later on, similar power-law frequency distributions have been repeatedly observed in various linguistic units, such as letters, phoneme, word length, and etc. (Altmann & Gerlach, 2016). The power law distribution, therefore, has been considered as a universal linguistic law. After investigating the relationships between different word features (such as length vs frequency, frequency vs polysemy, and etc.), people found out an interesting phenomenon. The relations between two highly correlated word features are usually non-linear and can be formulated as a power law function (Köhler, 2002). Kohler (1993) further proposed a word synergetic framework to model the interactions between different word features. This model has proved quite successful also then adapted to syntax features. The first studies mainly focused on the analysis of phrase structure treebanks (Köhler, 2012), which naturally are limited in language types since phrase structure grammar is less suitable for describing free word order languages (Mel'čuk, 1988). As the dependency treebanks are getting dominant, studies based on dependency grammar start to take lead. We can find recent studies discussing the relations between sentence lengths, tree heights, tree widths, and mean dependency distances (Jing & Liu, 2017;Zhang & Liu, 2018;Jiang & Liu, 2015).
Knowing that short dependencies are preferable by languages due to the least effort principle and that syntax features behavior similar to word features, we can easily draw our hypotheses: • The relation between dependency distance and frequency can be formulated as a non-linear function (probably also a power law function).
Contrary to above-mentioned studies, our study here is not focusing on mean dependency distances but the distribution of the distance of every single dependency. In the dependency minimization studies or synergetic syntax studies, the observed feature is mean dependency distance per sentence. In a way, these observed dependency distances are treated as a dependent feature of dependency trees. This is a very reasonable choice since the dependency distance is defined as the linear distance between two words in the same sentence. In particular, when the studies discuss other tree-related features such as tree heights and widths, mean dependency distance is a more easily comparable feature than a group of individual dependency distances. However, we believe the value of individual dependency distances is neglected. Individual dependency distances (Liu, 2010;Chen & Gerdes, 2017, 2018 provide more details of the fluctuation than the average which would level-up differences of dependencies in a sentence and it should be given the same attention as the mean dependency distance. Therefore, our study here is trying to pick up the missing detail of previous studies by investigating the relations between individual dependency distances and their frequencies. The paper is structured as follows. Section 2 describes the data set, the Parallel Universal Dependencies (PUD) English treebank of Universal Dependencies treebanks, and introduces our computing method for dependency distance and frequency. Section 3 presents the empirical results and discussions. Finally, Section 4 presents our conclusions.

Material and Methods
Universal Dependencies (UD) is a project of developing a cross-linguistically consistent treebank annotation scheme for many languages, with the goal of facilitating multilingual parser development, crosslingual learning, and parsing research from a language typology perspective. The annotation scheme is based on an evolution of Stanford dependencies (De Marneffe et al., 2014), Google universal part-ofspeech tags (Petrov et al., 2012), and the Interset interlingua for morphosyntactic tagsets (Zeman, 2008). The general philosophy is to provide a universal inventory of categories and guidelines to facilitate consistent annotation of similar constructions across languages while allowing language-specific extensions when necessary. UD is also an open resource which allows for easy replication and validation of the experiments (all Treebank data on its page is fully open and accessible to everyone). For the present paper, we used the PUD English Treebank from the UD 2.3 dataset for our study since English is a rather reasonable choice for a pilot study. Furthermore, PUD is a parallel treebank with a wide range of languages, namely Arabic, Chinese, Czech, English, Finnish, French, German, Hindi, Indonesian, Italian, Japanese, Korean, Portuguese, Russian, Spanish, Swedish, Thai, and Turkish. This makes PUD a good choice for future studies which would further test whether our finding here can be generalized into different human languages. We use the Surface-syntactic UD version of the treebank , which is more suitable for studies in distributional dependency syntax as it corrects the artificially long dependency distances of UD into a more standard syntactic analysis based on distributional criteria (Osborne & Gerdes, 2019).
We first compute the dependency distance for every single dependency in the treebank except the root relation. The dependency distance is computed as the absolute difference between the word ID and its head's word ID. For instance, in Figure 1, there are 4 dependencies. We would take three of them into account except the root dependency. The dependency distances of these three dependencies are: abs (1-2) =1 (for subj), abs (4-2) =2 (for comp), and abs (3-4) =1 (for det). After computing all the dependency distances of the treebank, we then count the frequencies of each dependency distance, i.e. we count how many dependencies with dependency distance 1 occurred in the treebank, how many dependencies with distance 2 occurred, and so on. We then try to formulate the relation into a non-linear function. We will test different non-linear functions to see which one can predict the empirical data best. In other words, we try to see whether our data can be fitted by the power law function. This result can then either confirm or reject our hypothesis.
We also introduce two random baselines to see whether we can observe similar phenomenon in random dependency trees. Based on the PUD English treebank, we generate two random tree-banks. For the random treebank RT, we just randomly reorder the words of each sentence. For the random treebank PRT, we randomly reorder the words in a way that keeps the sentence's dependencies projective (noncrossing).

Results and Discussion
The PUD English treebank is part of the Parallel Universal Dependencies (PUD) treebanks created for the CoNLL 2017 shared task on multilingual parsing (Zeman et al., 2017). There are 1000 sentences in each language. The sentences are taken from the news domain and from Wikipedia. The PUD English treebank contains 21,176 tokens. See Appendix for the frequencies of dependency distances in the treebank.
The scatter plot Figure 2 shows that the relationship between dependency distance and frequency is indeed non-linear. Since the observed data points scatter as a L-ish shape, we tried to fit the data to four non-linear functions, namely quadratic, exponent, logarithm, and power law functions. Although there are different ways of measuring the goodness-of-fit (Mačutek & Wimmer, 2013), we choose to use the most common Pearson chi-square goodness-of-fit test to evaluate the fitting results in this study. The formula of the test is defined as (1) with fi being the observed frequency of the value i, Pi being the expected probability of the value i, n being the number of different data values and N being the sample size. The obtained results of R-squared is presented in The results show that the observed data can indeed be formulated as a power law function. However, it seems that the data also fits an exponent regression very well. This is a very common issue in quantitative linguistic studies (Baixeries et al., 2013). In many situations, both exponent and power-law models can describe the data fluctuation reasonably well. One way to decide which model is better is by adding more observations from other languages. However, this is out of the scope of this pilot study. Another solution can be introducing baselines for comparison, which is our choice in this paper. By comparing the results in Table 1 with the results of two different random treebanks, we try to deliver the answer for this question, which model is better to represent the relation between dependency distance and frequency, exponent or power law?
For the two random English PUD treebank variations, RT and PRT, we replicate the same computation for the frequency and dependency distance, see Appendix. The scatter plots Figure 3 and 4 show that the relations between dependency distance and frequency in RT and PRT are both non-linear. Similarly, we fit the data points to four non-linear models, see Tables 2 and 3 for results. We can see from Table 2 that RT fits to all non-linear models very well except to the power law function, which is very different from the PUD English treebank who fits to power law very well but does not fit to quadratic and exponent models. When we add the projectivity restriction, the fitting results of PRT seems more 'human language' like. Similar to the results of PUD in Table 1, PRT fits to exponent and powerlaw models better. However, the power law fitting result of PUD is clearly more satisfying than the result of PRT.  Beyond considering the projectivity feature of dependency trees that deals with the crossing problem, we would also like to have a closer look at the role of syntax in this question. Our way of addressing this is to exclude less syntactic dependencies from the analysis. The UD/SUD annotation scheme includes predefined dependency structures for some constructions, in particular for MWE and punctuation. The distance of relations such as fixed, compound, flat, and punct are not based on distributional criteria of the tokens involved. Therefore, we also tested the results when these dependencies are excluded from our analysis, taking into account only syntactic dependencies (subj, aux, cop, case, mark, cc, dislocated, vocative, expl, discourse, det, clf). See the Appendix for the details. We first tested these three data sets with a linear regression model, and the results are similar to the previous analysis (PUD R 2 =0.21, RT R 2 =0.77, PRT R 2 =0.34). We then repeated the same non-linear regression analysis on these three selected data sets and the results are presented in Table 4.

Syntactic Data Set
Non  Very similar to the results of the previous analysis, PRT is closer to the PUD English results. However, the results with syntactic dependencies demonstrate more clearly that a power law model is the better choice for representing the relation between dependency distance and frequency. First, the original PUD data fits to the power law function best, whereas in the previous analysis we could not easily draw such a conclusion due to the very similar R 2 values for both power law and exponent models. Secondly, the goodness of the power law model fitting somehow can distinguish the natural PUD data from random baselines.

Conclusion
Our results are coherent with our hypothesis that there is indeed a non-linear relation between dependency distance and frequency. Furthermore, this relation can be formulated as a power law function.
However, the results in Table 1 show that the power-law model is not the only candidate for formulating the relation, and we could also apply an exponential model to it. For figuring out which model is better for representing the relation, we introduce two random baselines. By randomly reordering the words in a sentence, while preserving the words' dependencies, we generate random treebanks: PRT with and RT without the projectivity restriction, in which PRT possesses a more 'natural' structure reproducing more closely the rarity of non-projective relations. We replicate the same analysis on these two random treebanks and compare the results with the PUD results. We found that we can distinguish the PUD from RT and PRT by looking at the results of power-law fitting. Therefore, we would like to cautiously draw our conclusion here that the power law model is probably a better choice for representing the relation between dependency distance and frequency, a hypothesis that is further strengthened by the results on purely syntactic dependency relations. 5a: All functions.
5b: Syntactic functions only. Another interesting phenomenon we can observe from our data is that the projective random data-set has almost as good a fit to a power law function as the syntactically parsed true treebank. Although we need more samples to conduct a statistical significance testing for the difference, it seems that if we compare the natural PUD and the control PRT on the most relevant "syntactic functions only", for example in the logarithmic presentation of Figure 5b., there is practically no difference between the linearity of PRT and PUD. This shows that projectivity has a major role as the responsible factor for the power-law function of dependency distance. Of course, our conclusion based on this pilot study needs to be tested with more languages in the future. This leads to the open question to actually pinpoint the additional syntactic constraint of PUD, compared to random treebanks, that results in the power law distribution.
We believe the result presented here has several potential applications. We can use the power law model to predict the distribution of dependency distance in a treebank. Since natural language treebanks fit to power law model betters than random treebanks, we might even use it as an index for assessing the quality of parse results.