Taylor’s law for Human Linguistic Sequences

Taylor’s law describes the fluctuation characteristics underlying a system in which the variance of an event within a time span grows by a power law with respect to the mean. Although Taylor’s law has been applied in many natural and social systems, its application for language has been scarce. This article describes a new way to quantify Taylor’s law in natural language and conducts Taylor analysis of over 1100 texts across 14 languages. We found that the Taylor exponents of natural language written texts exhibit almost the same value. The exponent was also compared for other language-related data, such as the child-directed speech, music, and programming languages. The results show how the Taylor exponent serves to quantify the fundamental structural complexity underlying linguistic time series. The article also shows the applicability of these findings in evaluating language models.


Introduction
Taylor's law characterizes how the variance of the number of events for a given time and space grows with respect to the mean, forming a power law. It is a quantification method for the clustering behavior of a system. Since the pioneering studies of this concept (Smith, 1938;Taylor, 1961), a substantial number of studies have been conducted across various domains, including ecology, life science, physics, finance, and human dynamics, as well summarized in (Eisler, Bartos, and Kertész, 2007).
More recently, Cohen and Xu (2015) reported Taylor exponents for random sampling from various distributions, and Calif and Schmitt (2015) reported Taylor's law in wind energy data using a non-parametric regression. Those two papers also refer to research about Taylor's law in a wide range of fields.
Despite such diverse application across domains, there has been little analysis based on Taylor's law in studying natural language. The only such report, to the best of our knowledge, is Gerlach and Altmann (2014), but they measured the mean and variance by means of the vocabulary size within a document. This approach essentially differs from the original concept of Taylor analysis, which fundamentally counts the number of events, and thus the theoretical background of Taylor's law as presented in Eisler, Bartos, and Kertész (2007) cannot be applied to interpret the results.
For the work described in this article, we applied Taylor's law for texts, in a manner close to the original concept. We considered lexical fluctuation within texts, which involves the cooccurrence and burstiness of word alignment. The results can thus be interpreted according to the analytical results of Taylor's law, as described later. We found that the Taylor exponent is indeed a characteristic of texts and is universal across various kinds of texts and languages. These results are shown here for data including over 1100 singleauthor texts across 14 languages and large-scale newspaper data.
Moreover, we found that the Taylor exponents for other symbolic sequential data, including child-directed speech, programming language code, and music, differ from those for written natural language texts, thus distinguishing different kinds of data sources. The Taylor exponent in this sense could categorize and quantify the structural complexity of language. The Chomsky hierarchy (Chomsky, 1956) is, of course, the most important framework for such categorization. The Taylor exponent is another way to quantify the complexity of natural language: it allows for continuous quantification based on lexical fluctuation.
Since the Taylor exponent can quantify and characterize one aspect of natural language, our findings are applicable in computational linguistics to assess language models. At the end of this article, in §5, we report how the most basic character-based long short-term memory (LSTM) unit produces texts with a Taylor exponent of 0.50, equal to that of a sequence of independent and identically distributed random variables (an i.i.d. sequence). This shows how such models are limited in producing consistent co-occurrence among words, as compared with a real text. Taylor analysis thus provides a possible direction to reconsider the limitations of language models.

Related Work
This work can be situated as a study to quantify the complexity underlying texts. As summarized in (Tanaka-Ishii and Aihara, 2015), measures for this purpose include the entropy rate (Takahira, Tanaka-Ishii, and Lukasz, 2016;Bentz et al., 2017) and those related to the scaling behaviors of natural language. Regarding the latter, certain power laws are known to hold universally in linguistic data. The most famous among these are Zipf's law (Zipf, 1965) and Heaps' law (Heaps, 1978). Other, different kinds of power laws from Zipf's law are obtained through various methods of fluctuation analysis, but the question of how to quantify the fluctuation existing in language data has been controversial. Our work is situated as one such case of fluctuation analysis.
In real data, the occurrence timing of a particular event is often biased in a bursty, clustered manner, and fluctuation analysis quantifies the degree of this bias. Originally, this was motivated by a study of how floods of the Nile River occur in clusters (i.e., many floods coming after an initial flood) (Hurst, 1951). Such clustering phenomena have been widely reported in both natural and social domains (Eisler, Bartos, and Kertész, 2007).
Fluctuation analysis for language originates in (Ebeling and Pöeschel, 1994), which applied the approach to characters. That work corresponds to observing the average of the variances of each character's number of occurrences within a time span. Their method is strongly related to ours but different from two viewpoints: (1) Taylor analysis considers the variance with respect to the mean, rather than time; and (2) Taylor analysis does not average results over all elements. Because of these differences, the method in (Ebeling and Pöeschel, 1994) cannot distinguish real texts from an i.i.d. process when applied to word sequences (Takahashi and Tanaka-Ishii, 2018).
Event clustering phenomena cause a sequence to resemble itself in a self-similar manner. Therefore, studies of the fluctuation underlying a sequence can take another form of long-range correlation analysis, to consider the similarity between two subsequences underlying a time series. This approach requires a function to calculate the similarity of two sequences, and the autocorrelation function (ACF) is the main function considered. Since the ACF only applies to numerical data, both Altmann, Pierrehumbert, and Motter (2009) and Tanaka-Ishii and Bunde (2016) applied long-range correlation analysis by transforming text into intervals and showed how natural language texts are long-range correlated. Another recent work (Lin and Tegmark, 2016) proposed using mutual information instead of the ACF. Mutual information, however, cannot detect the long-range correlation underlying texts. All these works studied correlation phenomena via only a few texts and did not show any underlying universality with respect to data and language types. One reason is that analysis methods for long-range correlation are nontrivial to apply to texts.
Overall, the analysis based on Taylor's law in the present work belongs to the former approach of fluctuation analysis and shows the law's vast applicability and stability for written texts and even beyond, quantifying universal complexity underlying human linguistic sequences.

Proposed method
Given a set of elements W (words), let X = X 1 , X 2 , . . . , X N be a discrete time series of length N , where X i ∈ W for all i = 1, 2, . . . , N , i.e., each X i represents a word. For a given segment length ∆t ∈ N (a positive integer), a data sample X is segmented by the length ∆t. The number of occurrences of a specific word w k ∈ W is counted for every segment, and the mean µ k and standard deviation σ k across segments are obtained. Doing this for all word kinds w 1 , . . . , w |W | ∈ W gives the distribution of σ with respect to µ. Following a previous work (Eisler, Bartos, and Kertész, 2007), in this article Taylor's law is defined to hold when µ and σ are correlated by a power law in the following way: Experimentally, the Taylor exponent α is known to take a value within the range of 0.5 ≤ α ≤ 1.0 across a wide variety of domains as reported in (Eisler, Bartos, and Kertész, 2007), including finance, meteorology, agriculture, and biology. Mathematically, it is analytically proven that α = 0.5 for an i.i.d process, and the proof is included as Supplementary Material. On the other hand, α = 1.0 when all segments always contain the same proportion of the elements of W . For example, suppose that W = {a, b}. If b always occurs twice as often as a in all segments (e.g., three a and six b in one segment, two a and four b in another, etc.), then both the mean and standard deviation for b are twice those for a, so the exponent is 1.0.
In a real text, this cannot occur for all W , so α < 1.0 for natural language text. Nevertheless, for a subset of words in W , this could happen, especially for a template-like sequence. For instance, consider a programming statement: while (i < 1000) do i-. Here, the words while and do always occur once, whereas i always occurs twice. This example shows that the exponent indicates how consistently words depend on each other in W , i.e., how words co-occur systematically in a coherent manner, thus indicating that the Taylor exponent is partly related to grammaticality.
To measure the Taylor exponent α, the mean and standard deviation are computed for every word kind 1 and then plotted in log-log coordinates. The number of points in this work was the number of different words. We fitted the points to a linear function in log-log coordinates by the least-squares method. We naturally took the logarithm of both cµ α and σ to estimate the exponent, because Taylor's law is a power law. The coefficientĉ, and exponentα are then estimated as the 1 In this work, words are not lemmatized, e.g. "say," "said," and "says" are all considered different words. This was chosen so in this work because the Taylor exponent considers systematic co-occurrence of words, and idiomatic phrases should thus be considered in their original forms. following: This fit function could be a problem depending on the distribution of errors between the data points and the regression line. As seen later, the error distribution seems to differ with the kind of data: for a random source the error seems Gaussian, and so the above formula is relevant, whereas for real data, the distribution is biased. Changing the fit function according to the data source, however, would cause other essential problems for fair comparison. Here, because Cohen and Xu (2015) reported that most empirical works on Taylor's law used least-squares regression (including their own), this work also uses the above scheme 2 , with the error defined as ϵ(ĉ,α).  large representative archives, parsed, and stripped of natural language comments), and 12 pieces of musical data (long symphonies and so forth, transformed from MIDI into text with the software SMF2MML 5 , with annotations removed). As for the randomized data listed in the last block, we took the text of Moby Dick and generated 10 different shuffled samples and bigramgenerated sequences. We also introduced LSTMgenerated texts to consider the utility of our findings, as explained in §5. Figure 1 shows typical distributions for natural language texts, with two single-author texts ((a) 5 http://shaw.la.coocan.jp/smf2mml/ and (b)) and two multiple-author texts (newspapers, (c) and (d)), in English and Chinese, respectively. The segment size was ∆t = 5620 words 6 , i.e., each segment had 5620 words and the horizontal axis indicates the averaged frequency of a specific word within a segment of 5620 words.

Taylor Exponents for Real Data
The points at the upper right represent the most frequent words, whereas those at the lower left represent the least frequent. Although the plots exhibited different distributions, they could globally be considered roughly aligned in a power-law manner. This finding is non-trivial, as seen in other analyses based on Taylor's law (Eisler, Bartos, and Kertész, 2007). The exponent α was almost the same even though English and Chinese are different languages using different kinds of script. As explained in §3.1, the Taylor exponent indicates the degree of consistent co-occurrence among words. The value of 0.58 obtained here suggests that the words of natural language texts are not strongly or consistently coherent with respect to each other. Nevertheless, the value is well above 0.5, and for the real data listed in Table 1 (first to third blocks), not a single sample gave an exponent as low as 0.5.
Although the overall global tendencies in Figure 1 followed power laws, many points deviated significantly from the regression lines. The words with the greatest fluctuation were often keywords. For example, among words in Moby Dick with large µ, those with the largest σ included whale, captain, and sailor, whereas those with the smallest σ included functional words such as to, that, and with.
The Taylor exponent depended only slightly on the data size. Figure 2 shows this dependency Figure 2: Taylor exponentα (vertical axis) calculated for the two largest texts: The New York Times and The Mainichi newspapers. To evaluate the exponent's dependence on the text size, parts of each text were taken and the exponents were calculated for those parts, with points taken logarithmically. The window size was ∆t = 5620.
As the text size grew, the Taylor exponent slightly decreased.
for the two largest data sets used, The New York Times (NYT, 1.5 billion words) and The Mainichi (24 years) newspapers. When the data size was increased, the exponent exhibited a slight tendency to decrease. For the NYT, the decrease seemed to have a lower limit, as the figure shows that the exponent stabilized at around 10 7 words.
The reason for this decrease can be explained as follows. The Taylor exponent becomes larger when some words occur in a clustered manner. Making the text size larger increases the number of segments (since ∆t was fixed in this experiment). If the number of clusters does not increase as fast as the increase in the number of segments, then the number of clusters per segment becomes smaller, leading to a smaller exponent. In other words, the influence of each consecutive co-occurrence of a particular word decays slightly as the overall text size grows.
Analysis of different kinds of data showed how the Taylor exponent differed according to the data source. Figure 3 shows plots for samples from enwiki8 (tagged Wikipedia), the child-directed speech of Thomas (taken from CHILDES), programming language data sets, and music. The distributions appear different from those for the natural language texts, and the exponents were significantly larger. This means that these data sets contained expressions with fixed forms much more frequently than did the natural language texts.   Figure 4 summarizes the overall picture among the different data sources. The median and quantiles of the Taylor exponent were calculated for the different kinds of data listed in Table 1. The first two boxes show results with an exponent of 0.50. These results were each obtained from 10 random samples of the randomized sequences. We will return to these results in the next section.
The remaining boxes show results for real data. The exponents for texts from Project Gutenberg ranged from 0.53 to 0.68. Figure 5 shows a histogram of these texts with respect to the value of α. The number of texts decreased significantly at a value of 0.63, showing that the distribution of the Taylor exponent was rather tight. The kinds of texts at the upper limit of exponents for Project Gutenberg included structured texts of fixed style, such as dictionaries, lists of histories, and Bibles.
The majority of texts were in English, followed by French and then other languages, as listed in Table 1. Whether α distinguishes languages is a difficult question. The histogram suggests that Chinese texts exhibited larger values than did texts in Indo-European languages. We conducted a statistical test to evaluate whether this difference was significant as compared to English. Since the numbers of texts were very different, we used the non-parametric statistical test of the Brunner-Munzel method, among various possible methods, to test a null hypothesis of whether α was equal for the two distributions (Brunner and Munzel, 2000). The p-value for Chinese was p = 1.24 × 10 −16 , thus rejecting the null hypothesis at the significance level of 0.01. This confirms that α was generally larger for Chinese texts than for English texts. Similarly, the null hypothesis was rejected for Finnish and French, but it was accepted for German and Japanese at the 0.01 significance level. Since Japanese was accepted despite its large difference from English, we could not conclude whether the Taylor exponent distinguishes languages.
Turning to the last four columns of Figure 4, representing the enwiki8, child-directed speech (CHILDES), programming language, and music data, the Taylor exponents clearly differed from those of the natural language texts. Given the template-like nature of these four data sources, the results were somewhat expected. The kind of data thus might be distinguishable using the Taylor exponent. To confirm this, however, would require assembling a larger data set. Applying this approach with Twitter data and adult utterances would produce interesting results and remains for our future work. The Taylor exponent also differed according to ∆t, and Figure 6 shows the dependence ofα on ∆t. For each kind of data shown in Figure 4, the mean exponent is plotted for various ∆t. As reported in (Eisler, Bartos, and Kertész, 2007), the exponent is known to grow when the segment size gets larger. The reason is that words occur in a bursty, clustered manner at all length scales: no matter how large the segment size becomes, a segment will include either many or few instances of a given word, leading to larger variance growth. This phenomenon suggests how word cooccurrences in natural language are self-similar. The Taylor exponent is initially 0.5 when the segment size is very small. This can be analytically explained as follows (Eisler, Bartos, and Kertész, 2007). Consider the case of ∆t=1. Let n be the frequency of a particular word in a segment. We have ⟨n⟩ ≪ 1.0, because the possibility of a specific word appearing in a segment becomes very small. Because ⟨n⟩ 2 ≈ 0, σ 2 = ⟨n 2 ⟩ − ⟨n⟩ 2 ≈ ⟨n 2 ⟩. Because n = 1 or 0 (with ∆t=1), ⟨n 2 ⟩ = ⟨n⟩ = µ. Thus, σ 2 ≈ µ.
Overall, the results show the possibility of ap- Figure 4: Box plots of the Taylor exponents for different kinds of data. Each point represents one sample, and samples from the same kind of data are contained in each box plot. The first two boxes are for the randomized data, while the remaining boxes are for real data, including both the natural language texts and language-related sequences. Each box ranges between the quantiles, with the middle line indicating the median, the whiskers showing the maximum and minimum, and some extreme values lying beyond. Figure 5: Histogram of Taylor exponents for long texts in Project Gutenberg (1129 texts). The legend indicates the languages, in frequency order. Each bar shows the number of texts with that value ofα. Because of the skew of languages in the original conception of Project Gutenberg, the majority of the texts are in English, shown in blue, whereas texts in other languages are shown in other colors. The histogram shows how the Taylor exponent ranged fairly tightly around the mean, and natural language texts with an exponent larger than 0.63 were rare. plying Taylor's exponent to quantify the complexity underlying coherence among words. Grammatical complexity was formalized by Chomsky via the Chomsky hierarchy (Chomsky, 1956), which describes grammar via rewriting rules. The constraints placed on the rules distinguish four different levels of grammar: regular, context-free, context-sensitive, and phrase structure. As indicated in (Badii and Politi, 1997), however, this does not quantify the complexity on a continuous scale. For example, we might want to quantify the complexity of child-directed speech as compared to that of adults, and this could be addressed in only a limited way through the Chomsky hierarchy. Another point is that the hierarchy is sentence-based and does not consider fluctuation in the kinds of words appearing.

Evaluation of Machine-Generated Text by the Taylor Exponent
The main contribution of this paper is the findings of Taylor's law behavior for real texts as presented thus far. This section explains the applicability of these findings, through results obtained with baseline language models. As mentioned previously, i.i.d. mathematical processes have a Taylor exponent of 0.50. We show here that, even if a process is not trivially i.i.d., the exponent often takes a value of 0.50 Figure 6: Growth ofα with respect to ∆t, averaged across data sets within each data kind. The plot labeled "random" shows the average for the two datasets of randomized text from Moby Dick (shuffled and bigrams, as explained in §5). Since this analysis required a large amount of computation, for the large data sets (such as newspaper and programming language data), 4 million words were taken from each kind of data and used here. When ∆t was small, the Taylor exponent was close to 0.5, as theoretically described in the main text. As ∆t was increased, the value ofα grew. The maximum ∆t was about 10,000, or about one-tenth of the length of one long literary text. For the kinds of data investigated here,α grew almost linearly. The results show that, at a given ∆t, the Taylor exponent has some capability to distinguish different kinds of text data. for random processes, including texts produced by standard language models such as n-gram based models. A more complete work in this direction is reported in (Takahashi and Tanaka-Ishii, 2018). Figure 7 shows samples from each of two simple random processes. Figure 7a shows the behavior of a shuffled text of Moby Dick. Obviously, (a) Text produced by LSTM (3-layer stacked character-based) Machine-translated text using neural language model Figure 8: Taylor analysis for two texts produced by standard neural language models: (a) a stacked LSTM model that learned the complete works of Shakespeare; and (b) a machine translation of Les Misérables (originally in French, translated into English), from a neural language model. since the sequence was almost i.i.d. following Zipf distribution, the Taylor exponent was 0.50. Given that the Taylor exponent becomes larger for a sequence with words dependent on each other, as explained in §3, we would expect that a sequence generated by an n-gram model would exhibit an exponent larger than 0.50. The simplest such model is the bigram model, so a sequence of 300,000 words was probabilistically generated using a bigram model of Moby Dick. Figure 7b shows the Taylor analysis, revealing that the exponent remained 0.50. This result does not depend much on the quality of the individual samples. The first and second box plots in Figure 4 show the distribution of exponents for 10 different samples for the shuffled and bigram-generated texts, respectively. The exponents were all around 0.50, with small variance.
State-of-the-art language models are based on neural models, and they are mainly evaluated by perplexity and in terms of the performance of individual applications. Since their architecture is complex, quality evaluation has become an issue. One possible improvement would be to use an evaluation method that qualitatively differs from judging application performance. One such method is to verify whether the properties underlying natural language hold for texts generated by language models. The Taylor exponent is one such possibility, among various properties of natural language texts.
As a step toward this approach, Figure 8 shows two results produced by neural language models. Figure 8a shows the result for a sample of 2 million characters produced by a stan-dard (three-layer) stacked character-based LSTM unit that learned the complete works of Shakespeare. The model was optimized to minimize the cross-entropy with a stochastic gradient algorithm to predict the next character from the previous 128 characters. See (Takahashi and Tanaka-Ishii, 2017) for the details of the experimental settings. The Taylor exponent of the generated text was 0.50. This indicates that the character-level language model could not capture or reproduce the word-level clustering behavior in text. This analysis sheds light on the quality of the language model, separate from the prediction accuracy.
The application of Taylor's law for a wider range of language models appears in (Takahashi and Tanaka-Ishii, 2018). Briefly, state-of-theart word-level language models can generate text whose Taylor exponent is larger than 0.50 but smaller than that of the dataset used for training. This indicates both the capability of modeling burstiness in text and the room for improvement. Also, the perplexity values correlate well with the Taylor exponents. Therefore, Taylor exponent can reasonably serve for evaluating machinegenerated text.
In contrast to character-level neural language models, neural-network-based machine translation (NMT) models are, in fact, capable of maintaining the burstiness of the original text. Figure 8b shows the Taylor analysis for a machinetranslated text of Les Misérables (from French to English), obtained from Google NMT (Wu et al., 2016). We split the text into 5000-character portions because of the API's limitation (See (Takahashi and Tanaka-Ishii, 2017) for the details). As is expected and desirable, the translated text retains the clustering behavior of the original text, as the Taylor exponent of 0.57 is equivalent to that of the original text.

Conclusion
We have proposed a method to analyze whether a natural language text follows Taylor's law, a scaling property quantifying the degree of consistent co-occurrence among words. In our method, a sequence of words is divided into given segments, and the mean and standard deviation of the frequency of every kind of word are measured. The law is considered to hold when the standard deviation varies with the mean according to a power law, thus giving the Taylor exponent.
Theoretically, an i.i.d. process has a Taylor exponent of 0.5, whereas larger exponents indicate sequences in which words co-occur systematically. Using over 1100 texts across 14 languages, we showed that written natural language texts follow Taylor's law, with the exponent distributed around 0.58. This value differed greatly from the exponents for other data sources: enwiki8 (tagged Wikipedia, 0.63), child-directed speech (CHILDES, around 0.68), and programming language and music data (around 0.79). These Taylor exponents imply that a written text is more complex than programming source code or music with regard to fluctuation of its components. None of the real data exhibited an exponent equal to 0.5. We conducted more detailed analysis varying the data size and the segment size.
Taylor's law and its exponent can also be applied to evaluate machine-generated text. We showed that a character-based LSTM language model generated text with a Taylor exponent of 0.5. This indicates one limitation of that model.
Our future work will include an analysis using other kinds of data, such as Twitter data and adult utterances, and a study of how Taylor's law relates to grammatical complexity for different sequences. Another direction will be to apply fluctuation analysis in formulating a statistical test to evaluate the structural complexity underlying a sequence.