Are formal restrictions on crossing dependencies epiphenominal?

Characterizing the distribution of crossing dependencies in natural language dependency trees is a crucial task for building parsers and understanding the formal properties of human language. A number of formal restrictions on crossing dependencies have been proposed, including bounds on gap degree, edge degree, and end-point crossings. Here we ask whether the empirical distribution of crossing dependencies in dependency treebanks offers evidence for these formal restrictions as true, independent constraints on dependency trees, or whether the distribution can be explained using other, more generic constraints affecting dependency trees. Speciﬁcally, we explore the null hypothesis that crossing dependencies are formally unrestricted, but occur at a low rate. We implement the null hypothesis using random trees where crossing dependencies occur at the same rate as in natural language trees, but without any formal restrictions. We ﬁnd that this baseline generally does not reproduce the same distribution of gap degree, edge degree, end-point-crossing, and heads’ depth difference as real trees, suggesting that these formal constraints are a consequence of factors beyond the rate of crossing dependencies alone.


Introduction
In dependency grammar formalisms, the syntactic structure of a sentence is encoded in the form of headdependent relations. For the most part, the dependents of a given head form a contiguous substring of the sentence, i.e., all the nodes occurring between the head and its dependent are (transitively) dominated by the head. Such dependencies have been termed projective. In addition to projective dependencies, we also find instances where the dependents of a head are discontinuous. This happens when a node in the span of a head and its dependents is not (directly or indirectly) dominated by the head. Such dependencies are known as crossing or non-projective dependencies. Formally, a dependency X h →X d is deemed crossing if and only if there is at least one node X i between X h and X d that X h does not dominate. In Figure 1 the dependency arc from the node X h to its dependent X d is crossing because X i is headed by a node (X j ) which is outside the span of X h → X d . Note that all other arcs in the dependency tree shown in Figure 1 are projective. For example, the arc X j → X i is a projective arc as X h is dominated by X j .
X d X i X h X j Figure 1: The dependency arc X h →X d is a crossing dependency. All other arcs are non-crossing.
The most basic cross-linguistic generalization about crossing dependencies is that they are rare (see e.g. Straka et al., 2015). The rarity of crossing dependencies poses several interesting questions that are relevant from formal, computational, and cognitive perspectives. Most fundamentally, why are these X k X i X g X d X h X j X g X k X d X i X h X j (a) Projection chains (b) Gap degree Figure 2: The projection chain of a node X is the set of all the nodes dominated by X which lie in a single path from X to a terminal node. For example, in the dependency tree (a), {X j ,X h ,X d ,X g } and {X j ,X i ,X k } are two projection chains from the node X j . A projection chain is continuous if it forms a continuous substring of the sentence. For example, the projection chain of X h , i.e., {X h ,X d ,X g } is a continuous substring of the sentence {X k ,X i ,X g ,X d ,X h ,X j }. The dependency tree (b) shows a dependency schema to illustrate gap degree. The gap degree of a node is the largest number of discontinuities in any projection chain. In (b), the projection chain for X h is {X h ,X d ,X g }, which contains 2 discontinuities or gaps, so the gap degree of node X j is 2.
Crossing constraints are important in two domains: in the development of computational parsers, and for theoretical formal syntax, because these restrictions correspond to the formal language class of natural language. Crossing dependencies indicate deviations from context-free grammar (Marcus, 1965;Shieber, 1985). More specifically, the hierarchy of mildly context-sensitive languages is defined by restrictions on gap degree. Gap degree corresponds to the number of components in a Multiple Context-Free Grammar (Seki et al., 1991) and to the number of distinct selector features in Minimalist Grammars (Michaelis, 1998). It corresponds to the 'limited amount of cross-serial dependencies' allowed in TAG derivations (Joshi, 1985), (also see Bodirsky et al., 2005). In the computational linguistics literature it is common to provide statistics showing that there are only a small number of dependency trees violating any given crossing constraint. For example, Kuhlmann (2013) shows that as gap degree increases, there are fewer and fewer trees per language with that gap degree.
These proposals across the theoretical syntax and parsing literature raise the possibility that crossing constraints might constitute independent, causal constraints on natural language syntax. However, it is also possible that the observed distribution of crossing dependencies may be epiphenomenal, i.e., a consequence of other constraints affecting dependency trees which have nothing to do with crossing dependencies themselves, such as a general pressure to minimize dependency length (e.g., as investigated in Ferrer-i-Cancho and Gómez-Rodríguez, 2016;Gómez-Rodríguez and Ferrer-i-Cancho, 2017). In this paper, we investigate the status of crossing constraints using dependency corpora, asking whether the empirical distribution of crossing dependencies gives evidence for crossing constraints, or whether the data is best explained by an extremely simple null hypothesis: that crossing dependencies are formally unrestricted but simply rare.
As an example of how crossing constraints might be epiphenomenal, consider gap degree. Gap degree refers to the number of discontinuities in the projection chain headed by a node (see Figure 2). So, for example, if the longest projection chain in a sentence is of length 6, then gap degree cannot exceed 5. Now suppose that linguistic dependency trees typically have short projection chains and that crossing dependencies are rare but randomly distributed across dependency trees. Then it is unlikely that we will observe a projection with many discontinuities, simply due to the fact that projection chains are usually short; so we will measure low gap degree. From this measurement, we might falsely conclude that there exists a bound on gap degree. These considerations suggest that gap degree might not have a causal role as a restriction on crossing dependencies, but rather emerges as a result of the rarity of crossing dependencies plus low tree depth.
In this work, we evaluate a number of crossing constraints to determine if dependency corpora give evidence for them as true independent constraints. Our null hypothesis is that crossing dependencies are formally unrestricted, but occur at a certain low rate per dependency arc. The alternative to the null hypothesis is the true constraint hypothesis (TCH), which is that there is a real dispreference for crossing dependencies violating that specific constraint, arising from grammar or cognitive pressures.
We compare the TCH against the null hypothesis by comparing natural language dependency trees with randomly generated baseline trees. The baseline trees simulate the null hypothesis: they consist of randomly generated trees where crossing dependencies have been inserted randomly at the same overall rate per dependency as in the real trees, but with no formal restrictions (more on this in Section 3.2). If the distribution of gap degree, edge degree, etc., in random baseline trees is indistinguishable from real language trees, then we cannot reject the null hypothesis: in that case dependency corpora would not show evidence for the TCH. On the other hand, if a formal measure like gap degree is minimized in observed data over the random baseline, then this is evidence against the null hypothesis and for the TCH.
Our paper is organized as follows. In Section 2, we review the crossing constraints that we will test. In Section 3, we discuss the natural language dataset and the random baselines. We present the results in Section 4. Section 5 concludes.

Measures
In order to test the TCH, we compare the distributions of violations of crossing constraints in random baseline trees vs. real language trees. Below we discuss the crossing constraints used in our investigation. In addition, we also discuss the properties of the dependency tree that are used in our comparison of real vs. random trees. In particular, we will be testing whether the correlation between these dependency tree properties and crossing constraint violations is the same in real vs. random trees.

Crossing Constraints
Gap degree: The gap degree of a node X is the number of discontinuities in the projection of node X. For example, in Figure 2, the projection chain of node X h contains two discontinuities; these discontinuities are present in X h →X d and in X d →X g . Therefore, the gap degree of node X h is 2. On the other hand, the gap degree of node X d is 1. The gap degree of a dependency tree is the maximum among the gap degrees of its nodes (Kuhlmann and Nivre, 2006). In Figure 2, the gap degree of the tree is 2 as the highest gap degree (associated with X h ) is 2. Since gap degree is number of discontinuities in a projection chain, it is upper bounded by the length of projections chains.
Edge degree: Let e be the span of dependency arc X h → X d . The span e consists of nodes between a head X h and its dependent X d , which are X i , X a , and X b in Figure 3. The edge degree of a dependency arc X h → X d is the number of nodes in the span e which are neither dominated by some node in the span e nor dominated by the head X h . For example, arc X h → X d in Figure 3(a) and 3(b) has an edge degree of 2 because node X i and X b are not dominated by any node in the span e. In addition, they are also not dominated by head X h . The edge degree of a dependency tree is the highest edge degree among the arcs of the tree.
There are cognitive reasons to suspect edge degree might be limited in natural language. From an online processing perspective, higher edge degree in a subtree results in the need to maintain an unresolved crossing dependency across a longer span of words, which may result in online processing difficulty due to higher working memory load (Gibson, 1998).
End-point crossing: The number of end-point crossings is the number of heads which dominate the gap of an arc. Given an arc X h → X d with a span e containing X i , X a and X b as in Figure 3, the end-point crossing of arc X h → X d is defined as the number of heads modified by the nodes in e that are not part of the projection chain of X h . For example, in Figure 3(a) and 3(b), X i and X b are not part of the projection chain of X h , in other words they are not dominated by either X h or any node in the span e. In 3(a), the number of heads modified by X i and X b is 1 (corresponding to X j ), therefore, the end-point crossing is 1. In 3(b), the number of heads modified by X i and X b are 2 (corresponding to X j and X r respectively), therefore, the end-point crossing is 2.
It has been argued that natural language dependency trees tend to have not more than one end-point crossing, which is called the 1-end-point crossing constraint (Pitler et al., 2013). Pitler et al. (2013) argue that this constraint is related to the Phase Impenetrability Condition from Minimalist syntax (Chomsky, 2007). From a processing based perspective, higher end-point crossings in a subtree should lead to multiple heads/dependents being maintained/stored at the same time in the parse stack. This should lead to increased storage cost (Gibson, 1998). In addition, a longer span of the crossing dependency could lead to similarity-based interference (Lewis and Vasishth, 2005) at the head.  Heads' depth difference (HDD): For a crossing dependency X h → X d , suppose that X i is the node which creates discontinuity, i.e. X i is not directly or indirectly dominated by X h (see Figure 4). For this configuration, we call X i the intervener, X j the head of the intervener, and X h the head of the crossing dependency. The heads' depth difference (HDD) is defined as the difference between the depth of head of the crossing dependency X h and depth of head of the intervener X j . This is schematically shown in Figure 4. Depth of a node is computed as the hierarchical position of that node in a projection chain. The depth of X h is 2 while the depth of X j is 0, making the HDD for this configuration equal to 2. Thus, HDD for a crossing dependency X h → X d is: where depth(X h ) is the hierarchical position of the head of the non-projective dependency (X h ) and depth(X j ) is the hierarchical position of the head of the intervening element (X i ). The HDD of a dependency tree is the maximum HDD among the HDDs of the arcs in the tree.
In terms of formal syntax, HDD can correspond to the hierarchical depth between a filler and a gap in a long distance dependency (e.g., wh movement). Based on the theoretical syntax literature, HDD should be unbounded, at least for leftward wh-dependencies (Sag et al., 1999). However, increasing HDD seems to correlate with increased online processing difficulty for humans (Phillips et al., 2005). More generally, HDD has been proposed (see Yadav et al., 2017) to formalize the experimental findings that increased embedding depth leads to processing difficulty (e.g., Yngve, 1960;Gibson and Thomas, 1999). Therefore, it is possible that HDD is restricted in dependency trees due to cognitive constraints.

Dependency tree properties
We study violations of crossing constraint as a function of the following properties of dependency trees.
Sentence length: Sentence length is measured as the total number of nodes in a dependency tree.
Arity: The arity of a node is the total number of dependents of that node. We quantify arity as a global property of a tree by taking the maximum arity per node in the tree. Tree depth: Tree depth is the number of heads in the longest projection chain in a dependency tree (see Figure 2). Tree depth represents the maximum number of levels of embedding occurring in a tree.

Natural languages dataset
We use the Universal Dependencies (UD v2.3) treebanks of 14 languages as a dataset (Nivre et al., 2018). The languages were selected for typological diversity: the dataset contains 8 head-initial languages and 6 head-final languages. We do not include dependencies marking punctuation (labeled as 'punct' in UD scheme) and the abstract root of the tree (labeled as 'root' in UD scheme) in our analysis.
As we discuss below, the process of sampling random baseline trees makes it prohibitively difficult to study all languages in the UD dataset. Therefore we study treebanks of 14 languages: German, English, Hindi, French, Arabic, Russian, Czech, Italian, Spanish, Afrikaans, Japanese, Korean, Bulgarian and Slovak. We present results aggregating over dependency trees from all these languages.

Random baseline
Our null hypothesis is that the only restriction on crossing dependencies is that they are rare, i.e. that they occur at some certain low rate per dependency in a sentence. We instantiate the null hypothesis by sampling random trees which are constrained to have the same distribution over sentence length and number of crossings per dependency as a corpus of some natural language.
We control for sentence length and crossing rate in the random trees in the following way. For each real dependency tree t of length n in a corpus, we sample random trees t from a uniform distribution over n n−1 directed labeled tree structures with n nodes using Prüfer codes (Prüfer, 1918). We control for the crossing rate by rejection sampling: we reject random samples t which do not have the same number of crossings as the original tree t. For long sentences (over length 12), the rejection sampling process is prohibitively slow, because the vast majority of random trees for n ≥ 12 have a very large number of crossings. So in the present work we only consider sentences of length less than 12.
Since we are only controlling the number of crossings and the sentence length, the distribution of arity and depth in random baseline trees is quite different from real language trees. In particular, we find that the growth of tree depth with respect to sentence length is faster for random baseline trees. In addition, the growth of arity with sentence length in the random tree is slower. In sum, random baseline trees are typically deeper than real trees.

Testing the Null and True Constraint Hypotheses
We compare the rate at which crossing constraints are violated in real trees as compared with random baseline trees, as a function of sentence length, arity, and tree depth. We evaluate the difference between real and random trees statistically using mixed-effects Poisson regression (Gelman and Hill, 2007;Baayen et al., 2008). We fit the regression to predict the rate of constraint violations as a function of dependency tree features (length, depth, and arity) and a dummy-coded variable encoding whether a given tree is real or random. We also include by-language random intercepts. For example, we predict the gap degree g i of the ith sentence s i in the jth language as: where |s i | is the length of sentence s i , r i is an indicator variable with value 1 for a real tree and 0 for a baseline tree, and γ j , subject to L 2 regularization, is a random intercept for the jth language. The fitted value of the interaction coefficient β lr gives the extent to which the growth rate of gap degree as a function of sentence length differs between the real and the random trees. If β lr is significantly negative, then this would mean that gap degree grows more slowly with sentence length in real trees as compared with random trees, i.e. gap degree would be minimized in real trees.

Results
We compared the regression pattern of each measure with length, arity and depth between observed and random baseline trees. Below we report the results for each crossing constraint. A summary of all regression results is found in Table 1.

Gap degree
We find that the distribution of gap degree as a function of sentence length and arity is not significantly different between real and random trees (see Figure 5). In particular, the interaction between length/arity and tree type was not significant in the respective models (see Table 1). However, growth rate of gap degree with tree depth is significantly different between real and random trees (p < .001). In other words, we found no evidence for the TCH for gap degree as a function of length and arity: the distribution of gap degree in natural language trees can be fully explained without formal restrictions on crossing dependencies or tree structures. However, the results with respect to depth provide support for the TCH. Gap degree Figure 5: Gap degree as a function of sentnece length and tree depth in real and random trees. In this and all other figures, for visual clarity, we only display results for trees with at least one crossing dependency. All statistical tests are performed using all trees.

Edge degree
As shown in Figure 6, edge degree grows faster in random trees in comparison to real trees as a function of sentence length, arity and depth. The mixed-effects Poisson regression models show that the three interaction coefficients (for length, arity, and depth) are significant in the respective models (see Table 1). This provides evidence for the TCH for edge degree. Maximum arity Edge Degree Figure 6: Edge degree as a function of sentence length and tree maximum arity for real and random trees.

End-point crossings
As shown in Figure 7, we find that end-point crossings grow at a slower rate in real trees as a function of tree depth as compared with random baselines. The results support the TCH for end-point crossings. Similar to gap degree, end-point crossing as a function of maximum arity and sentence length does not differ significantly between real and random trees (see Table 1).

Heads' Depth Difference (HDD)
The results show that HDD decreases with sentence length in real trees, and the rate of decrease is less than in random trees (Figure 8). HDD is also much higher in random trees compared to real trees as a function of tree depth. These results support the TCH for HDD. HDD does not differ between real and random tree with respect to maximum arity (see Table 1). We found that the distribution of gap degree, edge degree, end-point crossing and HDD cannot be explained solely in terms of sentence length and the rate of crossings. These constraints are violated at a different rate as a function of various tree properties than would be expected in random trees, suggesting that they may constitute real formal restrictions on trees.
The results show that the behavior of these crossing constraints differ depending dependency tree properties. Gap-degree and end-point crossings in real vs. random trees are only different as a function of tree depth (which itself has a very different distribution between real and random trees). HDD in real vs random trees is indistinguishable as a function of arity, but is different for tree depth and sentence length. Edge degree, on the other hand, emerges as the crossing constraint that is most distinct between real and random trees: its distribution is significantly different as a function of all three tree properties.
Our results do not rule out the possibility that the correlations reported here might themselves be epiphenomenal, resulting from other graph-theoretic properties of real dependency trees which were not controlled for here. For example, a great deal of work has shown that syntactic dependency trees are subject to dependency length minimization: a pressure for the linear distance between syntactic heads and dependents to be short (Hawkins, 1994;Gibson, 1998;Liu, 2008;Futrell et al., 2015) (for recent reviews, see Liu et al., 2017;Temperley and Gildea, 2018), and this pressure has been argued to underly the scarcity of crossing dependencies in general (Ferrer-i-Cancho, 2006;Ferrer-i-Cancho and Gómez-Rodríguez, 2016;Gómez-Rodríguez and Ferrer-i-Cancho, 2017). It is also possible that the differences between real trees and random trees in our results are driven by differences in the depth and arity of these trees, or by UD annotation decisions such as the use of content-head dependencies.
Our work provides a strong framework for evaluating any such theory that aims to predict the particular distribution of crossing dependencies in natural language. A syntactic theory can be tested in our framework by creating random baselines that control for the stipulations of the theory and then statistically comparing the distribution of crossing constraint violations with real trees. To that end, we make the code for our analysis freely available at http://github.com/yadavhimanshu059/measures_ of_nonProjectivity.