What can we learn from natural and artificial dependency trees

This paper is centered around two main contributions : the first one consists in introducing several procedures for generating random dependency trees with constraints; we later use these artificial trees to compare their properties with the properties of natural trees (i.e trees extracted from treebanks) and analyze the relationships between these properties in natural and artificial settings in order to find out which relationships are formally constrained and which are linguistically motivated. We take into consideration five metrics: tree length, height, maximum arity, mean dependency distance and mean flux weight, and also look into the distribution of local configurations of nodes. This analysis is based on UD treebanks (version 2.3, Nivre et al. 2018) for four languages: Chinese, English, French and Japanese.


Introduction
We are interested in looking at the linguistic constraints on syntactic dependency trees to understand what makes certain structures plausible while others are not so plausible. To effectively do this kind of work, we need to observe natural trees (syntactic trees that are the results of linguistic analysis) to see what this population looks like. Similar work has been done for example by Jiang and Liu (2015) on the relation between sentence length, dependency distance and dependency direction. But observing natural trees only has its limits : we cannot see what is special about them and their properties, and we cannot distinguish the effects of the various constraints that affect them. We can only observe the structures that are the result of all these constraints and their interactions. On the other hand, if we start from a blank canvas, randomly generated trees, and incrementally add constraints on these trees, we might be able to study one by one the effects of each constraint, and to progressively add them to get closer to natural trees. Using artificially generated trees can also be insightful to determine which constraints are formally motivated (they are a result of the mathematical structure of the tree) and which constraints are linguistically or cognitively motivated. Research in the line of Gildea and Temperley (2009) who have used random and optimal linearisations to study dependency length and its varying degrees of minimization can help us to discover constraints that would be helpful to explain why we only find a small subset of all potential trees in syntactic analyses on real data.
Our objective is therefore twofold: first we want to see how different properties of syntactic dependency trees correlate, in particular properties that are related to syntactic complexity such as height, mean dependency distance and mean flux weight, then we want to find out if these properties can allow us to distinguish between artificial dependency trees (trees that have been manipulated using random components and constraints), and dependency trees from real data.

Features
In this work we use the five following metrics to analyze the properties of dependency trees:  We chose these properties because we believe that they all interfere in linearization strategies, that is how words are ordered in sentences, and the effects of those linearisation strategies. Recently, there have been many quantitative works (Futrell et al., 2015;Liu 2008) that have focussed on dependency length and its minimization across many natural languages. In complement to these linear properties we also use "flux weight", a metrics proposed by Kahane et al. (2017) which captures the level of nestedness of a syntactic construction (the more nested the construction is, the higher its weight in terms of dependency flux). In their paper, they claim the existence of a universal upper bound for flux weight, as they have found it to be to 5 for 70 treebanks in 50 languages.
In addition to these tree-based metrics, we propose to look at local configurations using the linearised dependency trees. To look at these configurations, we extract and compare the proportion of all potential configurations of bigrams (two successive nodes) and trigrams (three successive nodes). For bigrams, we have three possible configurations: a -> b which indicates that a and b are linked with a relation on the right, a <b which indicates that a and b are linked with a relation on the left, and a ♢ b, which indicates that a and b are not linked by a dependency. For trigram configurations (a, b, c), the possibility is much wider and we obtain 25 possible configurations. There are projective configurations like: a -> b -> c, (a -> b) & (a -> c), (a -> c) and (b <-c), but also non-projective cases like: a <-c and b -> c.

Hypotheses
In this section, we describe some of our hypotheses concerning the relationship between our selected properties. First, we expect to find that tree length is positively correlated with other properties. As the number of nodes increases, the number of possible trees increases including more complex trees with longer dependencies (which would increase MDD) and more nestedness (which would result in a higher mean flux weight). The relationship with maximum arity is less clear, as there could be an upper limit, which would make the relation between both of these properties non-linear. We are also particularly interested in the relationship between mean dependency distance and mean flux weight. An increase in nestedness is likely to result in more descendents being placed between a governor and its direct dependents, which would mean an overall increase in mean dependency distance.
For local configurations, we know that in natural trees, most of the dependencies occur between neighbours, see for example Liu (2008), the proportion varying depending on the language. It will be interesting to see how much that is still the case in the different random treebanks, depending on the added constraints.
For trigrams of nodes we are interested in the distribution of four groups of configurations that represent four different linearization strategies: "chain" subtrees that introduce more height in the dependency tree in with both dependents in the same direction, "balanced" subtrees that alternate dependents on both sides of the governor, "zigzag" subtrees which are similar to chains but with the second dependent going in the opposite direction as the first one, and "bouquet" subtrees where the two dependents are linked to the same governor (see examples in Figure A1 in the Appendix). If one group of configurations is preferred in natural trees compared to artificial ones, it could indicate that there exists some linguistic and/or cognitive constraints that make the configuration more likely to appear. We are also interested in the hypothesis advanced by Temperley (2008) who proposes that languages that strongly favor head-initial or head-final de-pendencies will still tend to have some short phrases depending on the opposite direction, which could constitute a way of limiting dependency distances.

Random tree generation with constraints
In this section we will look at random dependency tree generation with constraints. We distinguish two different steps in the dependency tree generation process : the generation of the unordered structure, and the generation of the linearisation of the nodes. Throughout this generation process, we limited ourselves to projective trees. In order to compare the properties of natural and random trees we used 3 different tree generating algorithm, to which we assign the following names : original random (1), original optimal (2) and random random (3).
The second algorithm "original optimal" also samples an unordered dependency structure from a treebank, but instead of generating a simple projective linearisation, we add a second constraint to minimize dependency distances inside the linearised dependency tree. The idea comes from Temperley (2008): to minimize dependency distances in a projective setting, dependents of a governor should be linearized alternatively on opposing sides of the governor, with the smallest dependent nodes (i.e those that have the smallest number of direct and indirect descendants) linearized first. Using the same structure unordered tree as in fig. 1 we described the procedure below: 1. We start the linearisation at the root. 2. Then, we select its dependent node [1,2,3] and order them in order of their decreasing number of descendant nodes, which gives us [1,3,2]. 3. We select a first direction at random, for example "left", and order these nodes alternating between left and right, which gives us these linearisation steps [0], [10], [103], [2103]. 4. We repeat steps 1 through 2 until every node has been linearized, which gives us for example [425103].
The third algorithm "random random" is the only one to implement two random steps : first generate a completely random structure, then linearize it following the same procedure as in algorithm 1). The unordered structure generation step is described in fig 2. 1. We start the generation process with a single node 2. We introduce a new node and randomly draw its governor. For now, since there is only one potential governor, the edge has a probability of 1. 3. We introduce a new node and randomly draw its governor. There are two potential governors which gives us a probability 0.5 of drawing the node 0 and the same probability for the node 1. These potential edges are drawn in green on the graph. 4. We repeat this last step until all nodes have been drawn and attached to their governor. 1 .
These tree generation algorithms are only some of the many possible algorithms that could be implemented, but they us give us tools to analyze how different generation strategies will affect the properties of the generated trees, as we incorporate more and more constraints into the two generation steps. They are also easily extensible, for example during the linearisation process we could introduce a probability of creating a head-final edge, to produce trees that resemble more the trees of a head-final language like Japanese. For the unordered structure generation, we could introduce a constraint to limit length, arity or height. We need to distinguish constraints that happen during the unordered structure generation step and constraints that have to do with linearisation, like constraints on dependency distances and on flux weights.
One question that still remains concerns the ordering of the two steps : unordered structure generation and linearisation generation. So far we have only implemented the full generation starting with the generation of the unordered structure and then moving on to the linearisation, which is a synthesis approach as described in Meaning-Text-Theory (Mel'čuk 1998), but it would be interesting to go in the analysis direction, starting with a sequence of nodes, and then randomly producing a structure for it. This could allow us to see how generation algorithms impact the distribution of trees, especially as we add constraints into the generation. We could then see if one type of random generation (synthesis vs analysis) produces structures that resemble natural dependency structures more, or if they introduce biases towards some types of structures.

Correlation between properties
For each pair of properties presented in section 2.1 we measured the pearson correlation coefficient to find out the extent to which the relationship between these variables can be linearly captured. We looked into these results for the different natural treebanks ("original") and the artificial ones ("original random", "random random" and "original optimal"). Tables presenting the full results are showcased in tables 1-4 in appendix, with rankings for the correlation between parentheses.
Based on these results, we notice that mean dependency distance and mean flux weight are overall the most correlated properties with values ranging from 0.70 (jp_pud, "original") to 0.95 (fr_partut, "original optimal"). This can be explained by the fact that mean flux weight increases as the number of disjoint flux increases, which in turn tends to create longer dependencies than structure with few disjoint flux. An interesting observation about this correlation is that it is intensified in all the artificial treebanks, and is the strongest in the "original optimal" version. Introducing a dependency distance minimization constraint will favour shorter dependencies, which provides less opportunities for configurations that introduce disjoint flux. Therefore the mean flux weight will also decrease.
If we look at the the correlation between length and height, we find that it is strong in original structures (0.78 correlation) as well as in the random ones (0.71 correlation in "random random", which is the only format in which the height of the tree is affected by the manipulation). This means that the relationship between these two properties is not motivated by linguistic factors only. From a mathematical point of view, longer sentences have the potential to introduce more hierarchy which increases the height. Thus, there is a correlation between these two properties regardless of wether the structure is natural or random. Zhang and Liu (2018) have proposed that the relationship between these two properties in natural treebanks of English and Chinese can be described by a power law. Further examination could tell us if it is also the case for randomly generated trees, or if the relationship is better modelled by another type of function.
We also find quite strong correlations between mean dependency distance and height in the artificial treebanks (0.76, 0.79, 0.72 respectively for "original random", "original optimal" and "random random") while this correlation is less important for the natural treebanks (0.46). It is quite interesting that the correlation decreases in the original trees. Our interpretation is that perhaps there is a more complex relationship at play between height and mean dependency distance in real data that cannot be linearly captured, and this complex relationship would be altered by the random components when generating the various artificial trees, especially as we relinearize the nodes.

Distribution of configurations
In this section we look at the distribution of local syntactic configurations by extracting trigrams and looking at their dependency relations. First we look at the non-linearized configurations : a→b→c and b←a→c, to analyze the differences in local structures between natural and randomly generated trees. Then we analyze the distribution of the four different groups presented in section 2.2, and how tis distribution is impacted by language and the type of treebank (natural and artificial). We will discuss here a few points and present the full results in appendix.
In fig 3, we can see the distribution of non-linearized configurations for one example language, French. For the "random random" trees, we have 45% of b←a→c configurations and 55% of a→b→c configurations. For all other treebank types, the first configuration is by far the most frequent one. This will likely have some repercussions in the distribution of linearized configurations.
We also observe that the results are fairly similar across all 4 languages, with "original optimal" showing the most unequal distribution (80%-20% respectively for b←a→c and a→b→c configurations), followed by "original" and "original random" (around 60%-40%, although there is some variation depending on the language). One possible explanation for favouring b←a→c could be that it helps minimizing dependency distances, since it can lead to "balanced" configurations which are the optimal way to arrange dependents whithout introducing longer dependencies. If that is the case, we will see a high proportion of "balanced" configurations when we look more in detail at how these configurations are linearized. Another line of explanation could be that having too many a→b→c configurations introduces too much height in the trees, which could be a factor of complexity that natural languages try to avoid whenever possible. Differences between "original optimal", "original random" and "original" can be explained by the linearization process: the optimal trees tend to favour shorter dependencies, which means that a higher percentage of triplets of nodes will all be connex, while non-optimal trees will sometimes linearize the nodes further away, thus excluding them from the extraction of triplets. It would be interesting to see if the distribution is similar when we look at all configurations of triplets and not just at local ones.
We then go on to look at these configurations once they have been subdivided according to the classification proposed in section 2.2. Note that the configurations "bouquet" and "balanced" are a result of the b←a→c configurations and that a→b→c will produce either "chain" or "zigzag". We show the distribution for French in fig 4. First we comment the results that are stable across languages: "random random" trees have a slight preference for "chain" and "zigzag" as a result of the preference for b←a→c configurations, but inside each group ("chain" and "zigzag" / "bouquet" and "balanced") the distribution is equally divided. The "original optimal" trees have a very marked preference for "balanced" which is to be expected because alternatively ordering dependents of a governor is the prefered strategy to minimize dependency length. Next we find "zigzag" configurations, followed by "bouquet" and very few "chain". Contrary to the potential explanation we advanced for the high frequency of b←a→c configurations, "balanced" configurations are not particularly frequent in the original trees (23% in Chinese, 14% in English, 21% in French and 27% in Japanese), especially when compared to the "bouquet" configurations (37%, 52%, 48%, 30% respectively). Bouquet configurations are much more frequent in the natural trees than in the artificial ones. We have yet to find a satisfactory explanation for this. Even if we know that some arbitrary choices in the UD annotation scheme inflate the percentage of bouquet (conj, fixed and flat relations are always encoded as a bouquet), this does not seem sufficient to explain the difference with the other configurations. We also remark that, if we were to use a schema with functional heads most of these "bouquet" configurations would become "zigzagz" or "chain", so we could potentially find an explanation by investigating there. For the optimal model, the bouquet is not an optimal strategy to minimize dependency distances, so the bouquet configuration will, of course, be less critical in the optimal model. Compared to the other languages, Japanese has an interestingly high percentage of "zigzag" configurations. This can be partly explained by the segmentation used in the Japanese treebanks. The particles and agglutinated markers (for polarity, aspect, politenesss...) have been annotated as separate tokens, which often creates many dependents on a single governor. A lot of these dependencies fall outside the trigram windows and are excluded from our analysis. Japanese being a head-final language, the configurations captured will often contain a head-final dependency (obj, acl, nmod...) and a marker of the dependent, which means that it will often fall into the "zigzag" bin. Nonetheless "bouquet" are still quite frequent as a governor often has several marks, and "balanced" capture nominal modifiers or compounds, and their case or topic marker.

Conclusion
In this paper we introduced several ways to generate artificial syntactic dependency trees and proposed to use those trees as a way of looking into the structural and linguistic constraints on syntactic structures for 4 different languages. We propose to incrementally add constraints on these artificial trees to observe the ef- Figure 4: Trigram configurations distribution for French fects these constraints produce and how they interact with each other. We limited ourselves to generating projective trees, which we now realize was a very strong constraint that strongly restricts the types of structures available, and therefore the variations of the different observed properties, and think that it would be interesting to also look at the result when allowing non-projective edges.
To expand on this work we would also like to see how the observed properties and the relations between them are affected by the annotation scheme, in particular contrasting schemas where content words are governors (as is the case in UD) and schemas where function words are governors (for example using the SUD schema proposed by Gerdes et al. (2018)), as it will have an impact on height, dependency distances, and the types of configurations that can be extracted from the treebanks.
In the present paper, we have looked at local syntactic configurations through the extraction of sequences of nodes (pairs and triplets). However these configurations are not representative of all configurations inside the trees, as some syntactic relations are more likely to appear in more global configurations. In the future, we plan on looking at these larger configurations by extracting subtrees and analyzing their distribution. We also intend on digging deeper into the analysis of the present data, and propose predictive models that could help us clarify the relationship (whether they be linear or not) between the different features in order to build a more solid basis to verify our hypotheses and propose explanations for the observations we made.  (7) 0.68 (6)