Explicit Causal Connections between the Acquisition of Linguistic Tiers: Evidence from Dynamical Systems Modeling

In this study, we model the causal links between the complexities of different macroscopic aspects of child language. We consider pairs of sequences of measurements of the quantity and diversity of the lexical and grammatical properties. Each pair of sequences is taken as the trajectory of a high-dimensional dynamical system, some of whose dimensions are unknown. We use Multispatial Convergent Cross Mapping to ascertain the directions of causality between the pairs of sequences. Our results provide support for the hypothesis that children learn grammar through distributional learning, where the generalization of smaller structures enables generalizations at higher levels, consistent with the proposals of construction-based approaches to language.


Introduction
A crucial question in language acquisition concerns how (or, according to some, whether) children learn the grammars of their native languages. Some researchers, mainly coming from the generative tradition, argue that, although the grammatical rules are possibly 'innate' (e.g., Pinker, 1994), children still need to learn how to map the different semantic/grammatical roles onto the different options offered by Universal Grammar (e.g., 'parameter-setting'). The evidence, however, does not seem to support this hypothesis. For instance, Bowerman (1990) notes that the type of semantic aspects learned by the child do not match well into the prototypical roles that would be required to map into hard linguistic rules (e.g., learning an AGENT category to map onto the SUBJECT syntactic role). Other researchers (e.g., Goldberg, 2003;Tomasello, 1992;Tomasello, 2005) propose that there is a gradual increase in the generality of the structures learned by the child, which are slowly acquired through distributional learning. Such a picture is strongly supported by the remarkably little creativity exhibited by children, most of whose utterances are often literal repetitions of those that they have previously heard (Lieven et al., 1997;Pine and Lieven, 1993), with little or no generalization in the early stages. It appears as though children progressively and conservatively increase the level at which they generalize linguistic constructions, building from the words upwards, in what some have termed 'lexically-based positional analysis ' (Lieven et al., 1997).
The Theory of Dynamical Systems offers powerful tools for modeling human development (e.g, Smith and Thelen, 2003;van Geert, 1991). It provides a mathematical framework for implementing the principle that development involves the mutual and continuous interaction of multiple levels of the developing system, which simultaneously unfold over many time-scales. Typically, a dynamical system is described by a system of coupled differential equations governing the temporal evolution of multiple parts of the system and their interrelations. One difficulty that arises when trying to model a dynamical system as complex as the development of language is that many factors that are important for the evolution of the system might not be available or might not be easily measurable or -even worse-there are additional variables relevant for the system of which the modeler is not even aware. In this respect, a crucial development was the discovery that, in a deterministic coupled dynamical system -even in the presence of noisethe dynamics of the whole system can be satisfactorily recovered using measurements of a single of the system's variables (Takens' Embedding Theorem;Takens, 1981).
The finding above opens an interesting avenue for understanding the processes involved in language acquisition. In the same way that systems of differential equations can be used to model the evolution of ecosystems (e.g., predator-prey systems), one could take measurements of the detailed properties of child language, and build a detailed system of equations capturing the macroscopic dynamics of the process. However, in order to achieve this, it is necessary to ascertain the ways in which different measured variables in the system affect each other. This problem goes beyond estimating correlations (as could be obtained, for instance, using regression models), as one needs to detect asymmetrical causal relations between the variables of interest, so that these causal influences can be incorporated into the models.
In this study, we investigate the causal relations different macroscopic-level measures characterizing the level of development of different tiers child language (i.e., number of words produced, lexical diversity, inflectional diversity and mean length of utterances), using the longitudinal data provided in the Manchester Corpus (Theakston et al., 2001). In order to detect causal relations between the different measures, we make use of state space reconstruction relying on Takens (1981)'s Embedding Theorem, and recently developed techniques for assessing the strength of causal relations in dynamical systems (Multispatial Convergent Cross Mapping; Clark et al., 2015). Here, we provide a detailed picture of the causal connections between the development of different aspects of a child's language (while acquiring English). Our result provide support for theories that advocate distributional learning of linguistic constructions by gradual generalizations from the level of words to larger scale constructions.  For instance, consider E. Lorenz's often studied dynamical system, which includes three coupled variables X(t), Y (t), and Z(t) whose coevolution is described by the system of differential equations

Causality Detection in Dynamical Systems
(1) The first equation in this system indicates that there is a relation by which Y causes X, as the change in X (i.e., its future value) depends on the value of Y (i.e., the future of X depends on the past of Y even after the past of X itself has been considered), a causal relation whose strength is indexed by parameter σ. The manifold defined by these three variables (Lorenz's famous strange attractor), which we can denote by M , is plotted in the top of Fig. 1. In many circumstances, however, not all variables of the system are available (some might be difficult to measure, or we might not even be aware of their relevance). It is at this point that Takens (1981)'s Embedding Theorem comes into play. Informally speaking, the theorem states that the properties of a coupled dynamical system's attractor can be recovered using only measurements from a single one of its variables. This is achieved by considering multiple versions of the same variable lagged in time, that is, instead of plotting ( . These reconstructed manifolds are termed "shadow" manifolds. M X denotes the shadow manifold of M reconstructed on the basis of X alone. There are well-studied techniques for finding the appropriate values for the parameters for the lag τ and the number of dimensions E (c.f., Abarbanel et al., 1993) so that the properties of the original manifold M are recovered by the shadow manifold M X . Fig. 1 illustrates this point by plotting the shadow manifolds M X (bottom-left) and M Y (bottom-right) for the Lorenz system. Notice how both shadow manifolds recover much of the original's structure, using only knowledge of one of its three variables.
Each point in the original manifold M maps onto points in its shadow manifolds, as is illustrated by the points labelled m(t), x(t), and y(t) in Fig. 1. The preservation of the topological properties of the original manifold in its shadow manifolds entails that points that are close-by in the original manifold will also be close-by in its shadow versions. This implies that, for causally linked variables within the same dynamical system, the state of one variable can identify the states of the others. Sugihara et al. (2012) noticed that, when one variable X stochastically drives another variable Y , information about the states of X can be recovered from Y , but not vice-versa. This is the basic insight of the CCM method. To test for causality from X to Y , CCM looks for the signature of X in Y 's time series by seeing whether the time indices of nearby points on M Y can be used to identify nearby points on M X . Crucially, in order to distinguish causation from mere correlation, CCM requires convergence, that is, that crossmapped estimates improve in estimation accuracy with the sample size (i.e., "library size") used for reconstructing the manifolds. As the library size increases, the trajectories defining the manifolds fill in, resulting in closer nearest neighbors and declining estimation error, which is reflected in a higher correlation coefficient between the points in the neighborhoods of the shadow manifolds. Con-vergence then becomes the necessary condition for inferring causation. Using both artificial systems and ecological time-series with known dynamics, Sugihara and his colleagues demonstrated that this technique successfully recovers true directional causal relations when these are present, and -crucially-is able to discard spurious causation in the case when both variables are causally driven by a third, unknown, variable, but there is no true direct causation between them.
An inconvenience of CCM, and in general of techniques that rely on manifold reconstruction, is that they generally require that relatively long time-series of the behavior of the system are available. Such long series are, however, very difficult, if not impossible, to obtain in many fields, including of course language acquisition. One can however obtain multiple short time series from different instances of a similar dynamical system. In ecology, for instance, one can obtain short sequences of measurements of the population densities of a group of species measured at different places and times. In language acquisition, we might have multiple, relatively short longitudinal sequences of measurements from different children. With this in mind Clark et al. (2015) developed Multispatial CCM (mCCM), an extension of CCM able to infer causal relations from multiple short time-series measured at different sites, making use of dewdrop regression (Hsieh et al., 2008) to take the additional heterogeneity into account.

Materials
We obtained from the CHILDES database (MacWhinney, 2000) the transcriptions contained in the Manchester Corpus (Theakston et al., 2001). This corpus contains annotated transcripts of audio recordings from a longitudinal study of 12 British English-speaking children (6 girls and 6 boys) between the ages of approximately two and three years. The children were recorded at their homes for an hour while they engaged in normal play activities with their mothers. Each child was recorded on two separate occasions in every threeweek period for one year. Each recording session is divided into two half-hour periods. The annotations include the lemmatized form of the words produced by the children (incomplete words and small word-internal errors were manually corrected in the lemmatization).
In order to increase the sample size in each period, we followed a sliding window technique of (Irvin et al., in press): We computed measures for the samples contained in overlapping windows of three consecutive corpus files. In this way, at each point we obtained samples originating from two files from the same recording session, and a file from either the previous or the next recording session.

Measures of Linguistic Development
As in previous studies (Irvin et al., in press; Moscoso del Prado Martín, in press), in order to measure the overall amount of speech produced by each child, we counted the total number of word tokens produced by each child in each temporal window. We refer to this measure as the child's loquacity.
In order to measure the diversity of the words used by the children, we use the lexical diversity measure (Irvin et al., in press; Moscoso del Prado Martín, in press). This is just the information entropy (Shannon, 1948) of the probability distribution of word lemmas found in the sample, where L refers to the set of word lemmas found in a sample, and p( ) is the probability with which the particular lemma is found in that sample. Entropy estimates obtained using Eq. 2 using maximum likelihood estimates of the probabilities are known to be strongly biased (Miller, 1955), with the bias magnitude correlating with the size of the sample used. Importantly, the sample size is nothing else than the loquacity measure described above. Therefore, using this plain maximum likelihood method would result in spurious correlations. For this reason, Moscoso del Prado Martín (in press) recommends using the bias-adjusted entropy estimator (Chao et al., 2013, see Appendix A) instead of Eq. 2. In order to measure the acquisition of inflectional morphological paradigms, we make use of the inflectional diversity measure (Moscoso del Prado Martín, in press). This is a macroscopic generalization of inflectional entropy (Moscoso del Prado Martín et al., 2004), a measure that is known to index morphological influences on adult lexical processing (Baayen and Moscoso del Prado Martín, 2005; Moscoso del Prado Martín et al., 2004) as well as in child language acquisition (Stoll et al., 2012). The inflectional entropy of a lemma (H[W | ]) is the information entropy of the inflected variants of that lemma. Our inflectional diversity is just the average value of inflectional entropy across all lemmas, where H[L] is the lexical diversity measure described above, and H[W, L] is the joint entropy between the inflected word forms and their corresponding lemmas, where L denotes the set of all distinct lemmas encountered in the sample, W is the set of all distinct inflected word forms encountered, and p(w, ) is the joint probability with which lemma occurs as the specific inflected form w. Inflectional diversity takes non-negative values, measuring how large are the average inflectional paradigms used in the language sample. Estimating H[W, L] using Eq. 4 is subject to the same estimation biases that were described for lexical diversity. Therefore, we also follow Moscoso del Prado Martín (in press) in using the bias-adjusted estimate (Chao et al., 2013, see Appendix A) for this magnitude, and then combining it with the lexical diversity using Eq. 3 to obtain our inflectional diversity estimates. Finally, in order to measure the degree of syntactic development of the children we used their mean length of utterances (MLU). Instead of measuring MLU in morphemes (Brown, 1973), we used the simpler, but equally accurate measure in number of words (c.f., Parker and Bronson, 2005). In these ages, MLUs are well known to provide an accurate measure of the syntactic richness of the utterances produced (Brown, 1973), and in fact correlate almost perfectly with explicit measurements of grammatical diversity (Moscoso del Prado Martín, in press).

Reconstruction of Shadow Manifolds
Using the windowing technique, for each child we obtained four time series, one corresponding to each of the four measures described above: loquacity, lexical diversity, inflectional diversity, and MLU. These time series are plotted in Fig. 2   In order to ensure that applying the non-linear dynamics techniques on these time series was sensible, the series were checked to ensure that they contained non-linear signal not dominated by noise. This was achieved using a prediction test (Clark et al., 2015): We ensured that, for all four variables, the ability to predict future values significantly decreased as one increases the distance in the future at which the predictions are being made. This increasing unpredictability is the hallmark of non-linear dynamical systems. Therefore, we could safely proceed to reconstruct the shadow attractors.
Following Clark et al. (2015), we reconstructed the shadow attractors from each of these collections of time series. The optimal time-lags (τ ) for constructing the shadow manifolds were estimated as the first local minimum of the lagged self-information in each of the time series (c.f., Abarbanel et al., 1993). The optimal embedding dimensionalities (E) were estimated by optimizing next-step prediction accuracy. The estimates were not found to differ significantly across children, and therefore for each measure, we used a single estimate of (τ, E) for all children. The estimated optimal parameter values used for the reconstruction of each shadow attractor are given in Table 1.

Detection of Causal Relationships
The presence of directional causal relations was tested for each of the six possible pairs of variables using mCCM. We performed 1,000 bootstrapping iterations for assessing the p-values of the relations. 1 Finally, to account for our lack of a priori predictions on the causal directions to be tested, the p-values were adjusted for multiple comparisons using the false discovery rate procedure for correlated data (FDR; Benjamini and Yekutieli, 2001).

Results
As plotted in Fig. 2, the four groups of time-series considered here exhibit different patterns of development. On the one hand, the loquacity, inflectional diversity, and MLU series show evidence of a more or less linear increase along the child's development, with their values towards the end of the studied interval being close to what was found for their mothers in those same conversations. On the other hand, the lexical diversity measure exhibits quite constant patterns across all children, with their values being pretty much indistinguishable from those observed for their mothers. This latter pattern is slighly different in two chilren (Ruth and Nick), who seem to be experiencing their 'vocabulary burst' later than the rest of the children did. In fact, if one examines panel (c) in detail, one sees that the inflectional diversity curves for these two children only begin their linear increases after the children have experienced their vocabulary bursts. A similar pattern can be seen in the MLU curves (panel (d)) for these two particular children, with syntactic development apparently being delayed by their late vocabulary bursts. These two patterns suggest that the development of both grammatical components of their language (inflectional morphology and syntax) depends on having attained a certain degree of vocabulary richness. However, just examining these curves does not provide explicit evidence on whether these hypothesized causal relationships are actually reliable ones or they are just statistical mirages. The mCCM method addresses such question directly. Fig. 3 plots the results of mCCM for each pair of reconstructed shadow manifolds. The curves plot how the correlations between nearest neighbors across shadow attractors evolve as one considers increasingly larger library sizes. The p-values report whether these correlation values are significantly increasing (the p-values are obtained by a Monte Carlo method with 1,000 resamplings, and further corrected for the twelve comparisons using the FDR procedure).
Using the p-values in Fig. 3 enables the reconstruction of the network of causal relations depicted in Fig. 4. In this graph, the causal relation between the loquacity and the lexical diversity is considered weaker than the rest. The reason for this is that the comparisons reported here are in fact part of a larger study considering many more comparisons (including many factors of the mothers as well), on which we did not have any clear a priori predictions on the relations that would be found. When applying the FDR method on the whole set of 56 comparisons that we actually considered, the relation plotted by the dashed arrow is in fact not significant. In short, one should not trust the reliability of that particular relation.
Considering only the fully reliable relations, one finds that, as was suspected from the curves in Fig. 2, there is an explicit causal relation between the development of vocabulary richness (i.e., lexical diversity) and the acquisition of inflectional paradigms (i.e., inflectional diversity). The increase in lexical diversity indeed causes the development of inflectional paradigms. In turn, that the inflectional paradigms begin to be in place enables the child to begin generalizing more syntactic relationships (as is reflected by the feedback loop found between the inflectional diversity and MLU manifolds). Importantly, that the inflectional paradigms are developed is also strongly coupled (i.e., forms a feedback loop) with the increase the children's overall loquacity; once children begin to get a hold of grammar (inflection and syntax) they are enabled to speak more, which in turn furthers their ability to generalize morphological relations and -by association-syntactic relations.

Discussion
In this study, we have -for the first time-documented the explicit causal relations between different tiers of children's linguistic development. At a macroscopic level, we find strictly causal relations between the acquisition of vocabulary, inflectional paradigms, and syntactic relationships. As schematized in Fig. 4, the development of a sufficiently large vocabulary is a crucial trigger for the successful acquisition of the grammatical aspects of language, which are in turn necessary for children to be able to speak more. These results are consistent with theories advocating the importance distributional learning for the acqui- sition of constructions (Lieven et al., 1997;Pine and Lieven, 1993;Tomasello, 1992;Tomasello, 2005). It shows how the level of generality of the constructions is progressively increased (Goldberg, 2003) by the use of 'lexically-based positional analysis ' (Lieven et al., 1997) to achieve early grammatical generalizations.
The picture of causal relations observed here could be put into an informal narrative as follows: The acquisition of sufficient lexical forms enables children to generalize their relations into inflectional paradigms. When a sufficient command of the language's inflectional morphology has arisen, children are able to begin generalizing syntactic relations. The presence of these early syntactic developments in turn serves to increase the child's awareness of the functional roles served by different paradigm members. From this point, one observes the strong bidirectional coupling between the development of syntax and inflectional morphology. An increasing awareness of the functional roles of the individual forms within these paradigms, and noticing the formal relations between them, in turn trigger further generalizations of the paradigms into inflectional classes (Milin et al., 2009), further increasing the productivity of the inflectional morphology system. This study also stresses the importance of macroscopic level linguistic analyses. Whereas much research in language acquisition has focused on the acquisition of specific individual constructions (microscopic level) or groups thereof (mesoscopic level), the investigation of the properties of the whole lexicon, inflectional and syntactic systems uncovers relations which are difficult to pinpoint at the other levels. This fits in well with the multiscale investigation of language development proposed from the point of view of the Theory of Dynamical Systems (van Geert, 1991). Indeed, one can see, at the mesoscopic level, that -also consistent with the distributional learning hypothesis-there is a causal chain by which the development of single word utterances triggers the development of two-word utterances, which in turn trigger three-word utterances, and so forth (Bassano and van Geert, 2007). The macroscopic analyses provided here complement that picture by indicating how that evolution of utterance lengths is strongly coupled with the development (or 'growth' in van Geert's terms) of grammatical knowledge.
An innovative aspect of the methods we have developed in this paper is that they provide an explicit procedure for testing whether there are ex-  plicit causal relations between the development of different aspects of language. Here we have used the methods at a macroscopic level, but it would be equally possible to apply them to both microscopic-or mesoscopic-level time series. Previous research on dynamical systems on language acquisition (e.g., Bassano and van Geert, 2007; Steenbeek and van Geert, 1991) relies on proposing different candidate models in terms of systems of differential equations, each including different sets of causal relations and couplings between time series. Our methods, using techniques for explicitly testing causal relations borrowed from ecology (a field whose study bears uncanny similarities with the study of human development), complement the curve-fitting by explicitly testing which couplings and causalities should be included in the models, thus significantly reducing the model space that needs to be explored.