Sentence Length

The distribution of sentence length in ordinary language is not well captured by the existing models. Here we survey previous models of sentence length and present our random walk model that offers both a better fit with the data and a better understanding of the distribution. We develop a generalization of KL divergence, discuss measuring the noise inherent in a corpus, and present a hyperparameter-free Bayesian model comparison method that has strong conceptual ties to Minimal Description Length modeling. The models we obtain require only a few dozen bits, orders of magnitude less than the naive nonparametric MDL models would.


Introduction
Traditionally, statistical properties of sentence length distribution were investigated with the goal of settling disputed authorship (Mendenhall, 1887;Yule, 1939). Simple models, such as a "monkeys and typewriters" Bernoulli process (Miller, 1957) do not fit the data well, and this problem is inherited from n-gram Markov to ngram Hidden Markov models, such as found in standard language modeling tools like SRILM (Stolcke et al., 2011). Today, length modeling is used more often as a downstream task to probe the properties of sentence vectors (Adi et al., 2017;Conneau et al., 2018), but the problem is highly relevant in other settings as well, in particular for the current generation of LSTM/GRU-based language models that generally use an ad hoc cutoff mechanism to regulate sentence length. The first modern study, interested in the entire shape of the sentence-length distribution, is Sichel (1974), who briefly summarizes the earlier proposals, in particular negative binomial (Yule, 1944), and lognormal (Williams, 1944), being rather critical of the latter: The lognormal model suggested by Williams and used by Wake must be rejected on several grounds: In the first place the number of words in a sentence constitutes a discrete variable whereas the lognormal distribution is continuous. Wake (1957) has pointed out that most observed log-sentence-length distributions display upper tails which tend towards zero much faster than the corresponding normal distribution. This is also evident in most of the cumulative percentage frequency distributions of sentence-lengths plotted on log-probability paper by Williams (1970). The sweep of the curves drawn through the plotted observations is concave upwards which means that we deal with sublognormal populations. In other words, most of the observed sentencelength distributions, after logarithmic transformation, are negatively skew. Finally, a mathematical distribution model which cannot fit real data -as shown up by the conventional χ 2 test-cannot claim serious attention. (Sichel, 1974, p. 26) Sichel's own model is a mixture of Poisson distributions given as where K γ is the modified Bessel function of the second kind of order γ. As Sichel notes, "a number of known discrete distribution functions such as the Poisson, negative binomial, geometric, Fisher's logarithmic series in its original and modified forms, Yule, Good, Waring and Riemann distributions are special or limiting forms of (1)". While Sichel's own proposal certainly cannot be faulted on the grounds enumerated above, it still leaves something to be desired, in that the parameters α, γ, θ are not at all transparent, and the model lacks a clear genesis. In Section 2 of this article we present our own model aimed at remedying these defects and in Section 3 we analyze its properties. Our results are presented is Section 4. The relation between the sentence length model and grammatical theory is discussed in the concluding Section 5.

The random walk model
In the following Section we introduce our model of random walk(s). The predicted sentence length is basically the return time of these stochastic processes, i.e. the probability of a given length is the probability of the appropriate return time. Let X k be a random walk on Z and X k (t) the position of the walk at time t. Let X k (0) = k be the initial condition. The walk is given by the following parameters: The random walk is the sum of these independent steps.
(2) is a simple model of valency (dependency) tracking: at any given point we may introduce, with probability p 2 , some word with two open valences (e.g. a transitive verb), with probability p 1 one that brings one new valence (e.g. an intransitive verb or an adjective), with probability p 0 one that doesn't alter the count of open valencies (e.g. an adverbial), and with probability p −1 one that fills an open valency, e.g. a proper noun. More subtle cases, where a single word can fill more than one valency, as in Latin accusativus cum infinitivo or (arguably) English equi, are discussed in Section 5. The return time is defined as In particular, τ 1 is the time needed to go from 1 → 0. We will calculate the probability-generating function to find the probabilities.
The generating function of τ k easily follows from τ 1 , since τ k is the sum of k independent copies of τ 1 , so the generating function of τ k is simply f (x) k .
In order to calculate f (x), we condition on the first step: Therefore, f (x) is the solution of the following equation (solved for f and x is a parameter): This can be solved with Cardano's formula. The probabilities are given by For given parameters p −1 , p 0 , p 1 , p 2 and k, and a given i, one can evaluate these probabilities numerically, but we need a bit more analytical form. Let us define the followings.
With these functions Eq. 6 becomes x = g(f (x)), meaning that we are looking for the inverse function of g. One can see that g(0) = 0 and g (0) = 1/p −1 = 0, therefore we can apply the Lagrange inversion theorem. Calculations detailed in the Appendix yield the following formula.
Since F is a polynomial one can calculate its powers by polynomial multiplication and get P(τ k = i) by looking up the appropriate coefficient. Here k is an integer (discrete) model parameter and p −1 , p 0 , p 1 , p 2 are real (continuous) numbers. That makes the above mentioned probabilities differentiable in the continuous parameters. We call the parameter k, the starting point of the random walk, the total valency. Note that τ k ≥ k with probability 1, therefore one cannot model the sentences shorter then k. To overcome this obstacle, we introduce the mixture model that consists of several models with various k values and coefficients for convex linear combination.
where the parameters α j are mixture coefficients; positive and sum up to 1. Also every term in the mixture have different p −1 , p 0 , p 1 and p 2 values (all positive and sum up to one). In this way, we can model the sentences with length at least min j k j . Theoretically, there is no obstacle to have different number of p values to different k values. The model can be a mixture of random walks, where the individual processes can have different upward steps.

Model analysis
Here we introduce and analyze the experimental setup that we will use in Section 4 to fit our model to various datasets. The raw data is a set of positive integers, the sentence lengths, and their corresponding weights (absolute frequencies) {n x } x∈X . We call n := x∈X n x the size and X the support of the data. Since the model is differentiable in the continuous parameters (including the mixing coefficients), the direct approach would be to perform gradient descent on the dissimilarity as an objective function to find the parameters. With fixed valency parameters k j this is a constrained optimization task dist(P sample , P modeled ) → min.
In some cases, especially for smaller datasets, we might find it expedient to bin the data, for example Adi et al. (2017) use bins (5-8), (9-12), (13-16), (17-20), (21-25), (26-29), (30-33), and (34-70). On empirical data (for English we will use the BNC 1 and the UMBC Webbase 2 and for other languages the SZTAKI corpus 3 ) this particular binning leaves a lot to be desired. We discuss this matter in Section 3.1, together with the choice of dissimilarity (figure of merit). An important consideration is that a high number of mixture components fit the data better but have more model parameters -this is discussed in 3.2.

Short utterances
Short utterances such as imperatives Stop! or Help are common both in spoken corpora and in written materials, both in fiction, where incomplete sentences abound, especially in dialog intended to sound natural, and in nonfiction, where they are encountered often in titles, subtitles, and itemized lists. The prevailing tokenization convention, where punctuation is counted as equivalent to a full word, also has an effect on the distribution, more perceptible at the low end.
Since the eight bins used by Adi et al. (2017) actually ignore the very low (1-4) and very high (71+) ranges of the data, we will use ordinary deciles, setting the ten bins as the data dictates. In this regard, it is worth noting that in the 18 non-English corpora used in this study the neglected low bin contains on the average 17.4% of the data (variance 6.3%, low 8.1% on Romanian, high 33.7% on Serbian_sr). Besides tokenization, perhaps the most important factor is morphological complexity, since in highly agglutinating languages a single word is sufficient for what would require a multiword sentence in English, as in Hungarian elvihetlek 'I can give you a ride'.
At the high end (sentences with 71 or more words) the original binning omits on the average 4.3% of the data (variance 3.1%, low 1.2% Nynorsk, high 15.1% Serbian_sr). Comparable figures for English are 3.7% (UMBC) and 14.4% (BNC) for the low bin, 1.0% (UMBC) and 0.8% (BNC) for the high bin. The last column of Table 1 shows the length of the longest sentence in each of the subcorpora considered. Since the number of datapoints is high, ranging from 1.3m (Nynorsk) to 136.6m (UMBC), the conventional χ 2 test does not provide a good figure of merit on the original data (no fit is ever significant, especially as there is a lot of variation at the high end where only few lengths are extant), nor on the binned data, where every fit is highly significant. A better choice is the Kullback-Leibler divergence, but this still suffers from problems when the supports of the distributions do not coincide. In our case we have this problem both at the low end, where the model predicts P(τ = i) = 0 for i < k, and at the high end, where we predict positive (albeit astronomically small) probabilities of arbitrarily long sentences. To remedy this defect, we define generalized KL divergence, gKL, as follows.
Definition 3.1 (Motivated by Theorem A.2.). Let P and Q be probability measures over the same measurable space (X, Σ) that are both absolutely continuous with respect to a third measure dx, and let λ be P(supp(P) ∩ supp(Q)). Then Clearly, gKL reduces to the usual KL divergence if the support of the distributions coincide. Perhaps the high end of the distribution could be ignored, at least for English, at the price of losing 1% of the data, but ignoring the short sentences, 14.4% of the BNC, is hard to countenance. As a practical matter this means we need to bring in mixture components with total valency k < 4, and these each bring 4 parameters (the mixture weight and 3 p i values) in tow. Obviously, the more components we use, the better the fit will be, so we need to control the tradeoff between these. In Section 3.2 we introduce a method derived from Bayesian model comparison MacKay (2003) that will remedy the zero modeled probabilities and answer the model complexity trade-off.

Bayesian model comparison
If a dataset D has support X, with n x > 0 being the number that length x occurred, the data size is |D| = x∈X n x and the observed probabilities are p x := nx |D| . Let H i be i th model in some list of models. Each model is represented by a parameter vector w in a (continuous) parameter space, and supp is not necessarily equal to X. Clearly, different H i may have different support, but a given model has the same support for every w. Model predictions are given by Q w i (x) := P(x | w i ), and the evidence the i th model has is If one supposes that no model is preferred over any other models (P(H i ) is constant) then the decision simplifies to finding the model that maximizes We make sure that no model parameter is preferred by setting a uniform prior: We estimated this integral with Laplace's method by introducing f (w i ) := − 1 |D| ln Q w i (D), i.e. the cross entropy (measured in nats).
Taking − 1 |D| ln(•) of the evidence amounts to minimizing in i the following quantity: where d is the dimension of H i (number of parameters), f is the Hessian and w * i = arg min w i ∈H i f (w i ) for a given i. Since the theoretical optimum of f (w i ) is the entropy of the data (ln 2·H(D)), we subtract this quantity from Eq. 17 so that the term f (w i ) becomes the relative entropy (measured in nats) with a theoretical minimum of 0.
We introduce an augmented model to deal with the datapoints where Q w i (x) = 0.
The newly introduced model parameters q = (q x ) x∈X\supp(H i ) are also constrained: they have to be positive and sum up to one, i.e. inside the probability simplex. After finding the optimum of q and modifying Eq. 17 with the auxiliary terms and subtracting the entropy of the data (ln 2 · H(D)) as discussed above, one gets: where d is the original model dimension plus the auxiliary model dimension. For sufficiently large corpora (|D| → ∞) all but the first term will be negligible, meaning that the most precise model (in terms of gKL divergence) wins regardless of model size. One way out would be to choose an 'optimum corpus size' (Zipf, 1949), a move that has already drawn strong criticism in Powers (1998) and one that would amount to little more than the addition of an extra hyperparameter to be set heuristically.
Another, more data-driven approach is based on the observation that corpora have inherent noise, measurable as the KL divergence between a random subcorpus and its complement (Kornai et al., 2013) both about the same size (half the original). Here we need to take into account the fact that large sentence lengths appear with frequency 1 or 0, so subcorpora D 1 and D 2 = D \ D 1 will not have the exact same support as the original, and we need to use symmetrized gKL: the inherent noise δ D of a corpus D is 1 2 (gKL(D 1 , D 2 ) + gKL(D 2 , D 1 )), where D 1 and D 2 are equal size subsets of the original corpus D, and the gKL divergence is measured on their empirical distributions.
δ D is largely independent of the choice of subsets D 1 , D 2 of the original corpus, and can be easily estimated by randomly sampled D i s. To the extent crawl data and classical corpora are sequentially structured (Curran and Osborne, 2002), we sometimes obtain different noise estimates based on random D i than from comparing the first to the second half of a corpus, the procedure we followed here. In the Minimum Description Length (MDL) setting where this notion was originally developed it is obvious that we need not approximate corpora to a precision better than δ, but in the Bayesian setup we use here matters are a bit more complicated.
For a sample P with inherent noise δ, a model Q is called tolerable if gKL δ (P, Q) = 0 If gKL δ is used instead of gKL in Eq. 19 then model size d becomes important. If a model fits within δ then the first term becomes zero and for large |D| values the number of model parameters (including auxiliary parameters) will dominate the evidence. The limiting behavior of our evidence formula, with tolerance for inherent noise, is determined by the following observations: 1. Any tolerable model beats any non-tolerable one.
2. If two models are both tolerable and have different number of model parameters (including auxiliary model), then the one with the fewer parameters wins.
3. If two models are both tolerable and have the same number of parameters, then the model volume and Hessian decides.
An interesting case is when no model can reach the inherent noise -in this case we recover the original situation where the best fit wins, no matter the model size.

Results
A single model H i fit to some dataset is identified by its order, defined as the number of upward steps the random walk can take at once: 1, 2 or 3, marked by the number before the first decimal; and its mixture, a non-empty subset of {1, 2, 3, 4, 5} that can appear as k: valency of a single component (for example k1.2.4 marks {1, 2, 4} ⊆ {1, 2, 3, 4, 5}). Altogether, we trained 3 × 31 = 93 locally optimal models for each dataset and compared them with Eq. 19, except that gKL δ is used with the appropriate tolerance.
We computed w * with a (non-batched) gradient descent algorithm. 4 We used Adagrad with initial learning rate η = 0.9, starting from uniform p and α values, and iterated until every coordinate of the gradient fell within ±10 −3 . The gradient descent typically took 10 2 − 10 3 iterations to reach a plateau, but about .1% of the models were more sensitive and required a smaller learning rate η = 0.1 with more (10k) iterations.

Validation
The model comparison methodology was first tested on artificially generated data. We generated 1M+1M samples of pseudo-random walks with parameters: p −1 = 0.5, p 0 = p 1 = 0.25 (at most one step upward) and k = 3 (no mixture) and obtained the inherent noise and length distribution. The inherent noise was about 3.442e-4 nats. We trained all 93 models and compared them as described above.
The validation data size is 2 · 10 6 but we also replaced |D| with a hyperparameter n in Eq. 19. This means that we faked the sample to be bigger (or smaller) with the same empirical distribution. We did this with the goal of imitating the 'optimum corpus size' as an adverse effect.
As seen on Table 2 the true model wins. We also tested the case when the true model was simply excluded from the competing models. In this case, the tolerance is needed to ensure a stable result as n → ∞.  As there are strong conceptual similarities between MDL methods and the Bayesian approach (MacKay, 2003), we also compared the models with MDL, using the same locally optimal parameters as before, but encoding them in bits. To this end we used a technique from Kornai et al. (2013) where all of the continuous model parameters are discretized on a log scale unless the discretization error exceeds the tolerance. The model with the least number of bits required wins if it fits within tolerance. (The constraints are hard-coded in this model, meaning that we re-normalized the parameters after the discretization.) In the artificial test example, the model 1.k3 wins, which is also the winner of the Bayesian comparison. If the true model is excluded, the winner is 1.k2.3.

Empirical data
Let us now turn to the natural language corpora summarized in Table 1. Not only are the webcrawl datasets larger than the BNC sections, but they are somewhat noisier and have suspiciously long sentences. To ease the computation, we excluded sentences longer than 1, 000 tokens. This cutoff is always well above the 99.9 th percentile given in the next to last column of Table 1. The results, summarized in Table 3, show several major tendencies.
First, most of the models (115 out of 145, 79.7%) fit sentence length of the entire subcorpus better than the empirical distribution of the first half would fit the distribution of the second half. When this criterion is not met for the best model, i.e. the gKL distance of the model from the data is above the internal noise, the ill-fitting model form is shown in italics.
Second, this phenomenon of not achieving tolerable fit is seen primarily (20 out of 30) in the first column of Table 3, corresponding to a radically undersampled condition n = 1, 000, and (9 out of 30) to a somewhat undersampled condition n = 10, 000. Only on the largest corpus (UMBC, 136.6m sentences) do we see the phenomenon persist to n = 100, 000, but even there, by setting n = 10 6 , still less than 1% of the actual dataset size, we get a good fit, within two thirds of the inherent noise.
Third, and perhaps most important, for sufficiently large n the Bayesian model comparison technique we advocate here actually selects rather simple models, with order 1 (no ditransitives, a matter we return to in Section 5) and only one or two mixture components. We emphasize that 'sufficiently large' is still in the realistic range, one does not have to take the limit n → ∞ to obtain the correct model. The last two columns (gigadata and infinity) always coincide, and in 21 of the 29 corpora the 1M column already yield the same result.
For a baseline, we discretized the naive (nonparametric) model in the same way. Not only does the quantization equire on the average two bits more, but we also have to countenance a considerably larger number of parameters to specify the distribution within inherent noise, so that the random walk model offers a size savings of at least 95.3% (BNC-A) to 99.7% (Polish).
With the random walk model, the total number of bits required for characterizing the most complex distributions (66 for BNC-A and 60 for Spanish) appears to be more related to the high consistency (low internal noise) of these corpora than to the complexity of the length distributions.

Conclusion
At the outset of the paper we criticized the standard mixture Poisson length model of Eq. 1 for lack of a clear genesis -there is no obvious candidate for 'arrivals' or for the mixture. In contrast, our random walk model is based on the suggestive idea of total valency 'number of things you want to say', and we see some rather clear methods for probing this further. First, we have extensive lexical data on the valency of individual words, and know in advance that e.g. color adjectives will be dependent on nouns, while relational nouns such as sister can bring further nouns or NPs. Combining the lexical knowledge with word frequency statistics is somewhat complicated by the fact that a single word form may have different senses with different valency frames, but these cause no problems for a statistical model that convolves the two distributions.
dataset Third, we can extend the analysis in a typologically sound manner to morphologically more complex languages. Using a morphologically analyzed Hungarian corpus (Oravecz et al., 2014) we measured the per-word morpheme distribution and per-sentence word distribution. We found that the random sum of 'number of words in a sentence' independent copies of 'number of morphemes in a word' estimates the per-sentence morpheme distribution within inherent noise.
Another avenue of research alluded to above would be the study of subjectand object-control verbs and infinitival constructions, where single nouns or NPs can fill more than one open dependency. This would complicate the calculations in Eq. 5 in a non-trivial way. We plan to extend our model in a future work.
One of the authors (Kornai and Tuza, 1992) already suggested that the number of dependencies open at any given point in the sentence must be subject to limitations of short-term memory (Miller, 1956) -this may act as a reflective barrier that keeps asymptotic sentence length smaller than the pure random walk model would suggest. In particular, Bernoulli and other well-known models predict exponential decay at the high end, whereas our data shows polinomial decay proportional to n −C , with C somewhere around 4 (in the 3 − 5 range). This is one area where our corpora are too small to draw reliable conclusions, but overall we should emphasize that corpora already collected (and in the case of UD treebanks, already analyzed) offer a rich empirical field for studying sentence length phenomena, and the model presented here makes it possible to use statistics to shed light on the underlying grammatico-semantic structure.
Yule, G. U. (1939). On sentence-length as a statistical characteristic of style in prose: with applicationsto two cases of disputed authorship. A Appendix Theorem A.1. Let us define f as x = f (x) F (f (x)) with F (0) > 0, then Proof. By Lagrange-Bürmann formula with composition function H(x) = x k . where f (w i ) is the cross entropy of the measured and the modeled distributions. See Eq. 17. If the augmented model (18) is used, then Eq. 19 follows. Proof.
Vol(H i ) Using Laplace method: Taking − 1 n ln(•) for scaling (does not effect the relative order of the models): As for the augmented model, the model parameters are the concatenation of the original parameters and the auxiliary parameters. Thus the overall Hessian is the block-diagonal matrix of the original and the auxiliary Hessian. Similarly, the overall model volume is the product of the original and the auxiliary volume. Trivially, the logarithm of product is the sum of the logarithms.
Since the auxiliary model can fit the uncovered part perfectly: p x = (1 − λ) · q x on x / ∈ supp H i . See (18) for that λ is the covered probability of the sample.
where d is the overall parameter number. Further, if one subtracts the entropy of the sample then only the first two term