A Three-Parameter Rank-Frequency Relation in Natural Languages

We present that, the rank-frequency relation in textual data follows f \propto r^{-\alpha}(r+\gamma)^{-\beta}, where f is the token frequency and r is the rank by frequency, with (\alpha, \beta, \gamma) as parameters. The formulation is derived based on the empirical observation that d^2 (x+y)/dx^2 is a typical impulse function, where (x,y)=(\log r, \log f). The formulation is the power law when \beta=0 and the Zipf–Mandelbrot law when \alpha=0. We illustrate that \alpha is related to the analytic features of syntax and \beta+\gamma to those of morphology in natural languages from an investigation of multilingual corpora.


Introduction
Zipf's law (Zipf, 1935(Zipf, , 1949 is an empirical law to formulate the rank-frequency (r-f) relation in physical and social phenomena. Linguistically, Zipf's law can be observed on the distribution of words in corpora of natural languages, where the frequency (f ) of words is inversely proportional to its rank (r) by frequency; that is, f ∝ r −1 . Zipf's law is a special form of a general power law, that is, f ∝ r −α , with α = 1.
The Zipf's/power law is usually examined under a log-log plot of rank and frequency, where the data points lie on a straight line. The simple proportionality of the Zipf's/power law can be observed on randomly generated textual data (Li, 1992) and it only roughly depicts the r-f relation in real textual data. A two-parameter generalization of the Zipf's/power law is the Zipf-Mandelbrot law, where f ∝ (r + β) −α (Mandelbrot, 1965). Li et al. (2010) considered the reversed rank of r max +1−r, where r max is the maximum of ranking index, and proposed a two-parameter formulation of f ∝ r −α (r max + 1 − r) β .
As a straightforward observation, the coefficients of proportionality should be distinguished for common and rear words (Powers, 1998; Li Figure 1: Rank-frequency plots on English words (left) and Chinese characters (right). The xand y-axes are log 10 r and log 10 f , respectively. The gray curves are the proposed formulation under logarithm: y = C − αx − β log 10 (10 x + 10 γ ), where C is a constant. The dashed lines are the asymptotes of C − (αx + βγ) and  , 2010). Therefore, an extension of the original Zipf's/power law requires at least two parameters. In this study, a three-parameter formulation of f ∝ r −α (r + γ) −β is derived based on the observation and analysis of multilingual corpora. It is a natural generalization of the power law and the Zipf-Mandelbrot law. The third parameter provides a depiction of the rigidness of different coefficients of proportionality. The proposed formulation can also fit non-Zipfian phenomena in natural languages, such as the r-f relation on Chinese characters. Figure 1 shows examples on English words from Europarl (Koehn, 2005) 1 and Chinese characters of Academia Sinica from the data of Sproat and Emerson (2003). 2

Proposed and Related Formulation
Under a logarithmic form, the Zipf's law states that x + y = C, where (x, y) = (log r, log f ), and C is roughly a constant. We further investigate the property of C = g(x). The first and second-order differences on g(x) are calculated as Here (x i , y i ) is the data point of the i-th frequent token, g i = x i +y i for i > 1, and g 1 = g 1 = 0. 3 Because the differences are intrinsically nonsmooth, Bézier curves are applied for smoothing in the investigation. Figure 2 shows examples of the smoothed g on English words and Chinese characters from the same dataset used for Fig. 1. An artificial Zipfian dataset generated in the manner of Li (1992) 4 is also used for comparison. It can be observed that the g on English words and Chinese characters has an impulse, but not that on the artificial data. Generally, the impulse becomes more obvious if the data are more non-Zipfian.
If we consider g as a general impulse function, then g is a general sigmoid function and g can be modeled by a general softplus function in the form of b log(exp(x − c) + 1). To replace x by a generalized linear form as ax + d, and to substitute (x, y) by (log r, log f ), we obtain, The obtained proportional form is a natural twocomponent extension of the power law and the 3 To avoid too many meaningless zeros in the differences, only the data point with the minimum x is used for data points with the same y, i.e., tokens with the same frequency. 4 Two letters a and b are used. The frequency of a, b, and space is 3 : 1 : 1, and 10 7 characters are randomly generated.  Figure 1 fitted by the gray curve of y = C − αx + β log 10 (r max + 1 − 10 x ). The dashed lines are of C − (αx + β log 10 (r max + 1)) and C − β log 10 (r max + 1 − 10 x ) for two ends. (α, β) is (1.15, 9.16) for English words and (0.62, 157.13) for Chinese characters.
Zipf-Mandelbrot law. Because the softplus function is a differentiable form of a rigid ramp function, Eq. (3) can also be considered as a smoothed piecewise broken power law. As shown in Fig. 1, α and (α + β) depict the proportional coefficients at the two ends, and the proportional coefficients are switched smoothly around Li et al. (2010) is also a two-component formulation. One more parameter (i.e., γ) in Eq. (3) is used to identify the location of the impulse observed in g . Under Li's formulation, we obtain g = y + αx = β log(r max + 1 − exp(r)) and g = −C 1 exp(x)(C 2 −exp(x)) −2 , where C 1 and C 2 are constants. g is a monotonically decreasing function with x = log(C 2 ) as the asymptote for x < log(C 2 ). Therefore, Li's formulation always has a steep tail and lacks the capacity to depict the switching of two stable proportional coefficients. Figure 3 shows examples using Li's formulation to fit data in Fig. 1. It can be observed that the non-Zipfian Chinese characters are fitted well, but not for the tail part in more Zipfian English words. This can be explained from the shape of g in Fig. 2. It is reasonable to model the g of Chinese characters using a monotonically decreasing function because the γ in Eq. (3) is quite large (around r max ). However, it is not proper for English words, where a proper γ is required.
Based on the analysis, it can be concluded that the formulation f ∝ r −α (r + γ) −β is a generalized form that covers the Zipf's/power law, Zipf-Mandelbrot law, piecewise broken power law, and Li's two-parameter formulation. In the next section, we show the linguistic interpretation of the parameter (α, β, γ).

Experiment and Discussion
We used the proposed formulation to fit data of various European languages and typical Asian languages. The Europarl corpus (Koehn, 2005) and data from the Second International Chinese Word Segmentation Bakeoff (ICWB2) (Sproat and Emerson, 2003) were mentioned in Section 1. We also used English-Japanese patent data from the 7th NTCIR Workshop (Fujii et al., 2008). The Europarl data and English data from NTCIR were lower-cased and tokenized using the toolkit provided by MOSES 5 (Koehn et al., 2007). Fitting was performed under a logarithmic scale using the fit function 6 in gnuplot. 7 Specifically, relation-frequency data were used to fit (α, β, γ) and C in y = C −αx−β log 10 (10 x +10 γ ). For the initialization, (α, β, γ) = (1, 1, rmax 2 ) and C = 3γ were applied. Table 1 lists the fitting results for all the languages 8 in the Europarl corpus. The (α, β, γ) with 5 http://www.statmt.org/moses/ 6 An implementation of the nonlinear least-squares Marquardt-Levenberg algorithm was used. the asymptotic standard error ( ± ) are listed. Because γ may depend on the vocabulary size, normalized γ norm = γ rmax is also listed. It can be observed that all the language data were fitted well with an α of around 1.0, which is in accordance with the original Zipf's law. β and γ norm for each language are plotted on the left of Fig. 4. 9 On the β-γ norm plane, we can observe the rough tendency that β and γ norm are linear, in addition to a separation for different language branches. Further principal component analysis on (α, β, γ norm ) suggests that α and β + γ norm can be generally considered as two dominant components. 10 The plot on the right of Fig. 4 shows that the language branches can be separated roughly by lines parallel to the axes of α and β + γ norm . This indicates the linguistic explainability of the two axes.
From the nature of these languages, we consider that α can be explained as an axis of analysissynthesis on syntax and β + γ norm as that on morphology. A large α suggests a couple of extremely frequent words in the corpus. As typical examples, languages with a relatively large α, that is, Romance and Germanic, generally contain abundant prepositions, particles, and determiners to mark syntactic roles, whereas those with a smaller α, that is, Slavic and Uralic, tend to use complex declension and conjugation within words to afford syntactic information. Interesting evidence is that bg, as a very analytic Slavic language, has a larger α than other Slavic languages. In another dimension, a large β + γ norm suggests a dramatic decrease in the frequency of rare words. Hence, lan-  guages with a small β + γ norm , that is, Germanic and Uralic, have a more gradual decrease in rare words, which are instances of various phenomena of derivation and compounding from complex morphology. By contrast, languages with a large β + γ norm , such as en and fr, tend to use phrases composed of multiple common words to express complex concepts, so that the drop in frequency of rare words is relatively dramatic. As β + γ norm is sensitive to the portion of rare words, this dimension may be easily affected by the property of specific data. An example is ro, for which a much larger β than other languages was fitted. Table 2 lists the fitting results on ICWB2 Chinese data. a. * , c. * , m. * , and p. * denote Academia Sinica, City University of Hong Kong, Microsoft Research, and Peking University data, respectively. * .w and * .c denote manually segmented words and characters, respectively. For the results on words, a trade-off on α and β + γ norm can be observed. Based on the previous analysis, we can consider that a.w has more segmentations on function words. An evidence is the segmentation of the expression shibushi (whether or not), which is composed of three characters shi (to be) bu (not), and shi (to be). The expression is segmented into shi / bu / shi in most cases in a.w, but always kept together in m.w. Regarding characters, we have small α and huge β + γ norm . Note that both common functional words and rare specific concepts in Chinese are commonly composed of multiple characters. Therefore, the contrast between common and rare characters is not so obvious, which leads to small α (no overwhelmingly functional words in syntax) and huge β + γ norm (extremely analytic in morphology). Figure 5 provides further evidence. The data size of typical languages in Europarl is gradu-  Figure 5: Effects on α and β + γ norm .
ally halved and the change of the fitted parameters is shown in the plot on the left of Fig. 5.
* .0 denotes the original data and * .n denotes the data of one n-th size. α does not change substantially for smaller data because of the stable syntax features and functional words. However, β + γ norm becomes larger, which suggests that there are fewer morphological varieties because of the smaller data size. The plot on the right of Fig. 5 shows how different word segmentations in Japanese affect the parameters. There are three common Japanese morphological analysis tools: kytea, mecab, and juman. kytea provides the most fragmentary segmentation and juman tends to attach suffixes to stems. For example, the three tools segment wakarimashita (understood, in polite form) as follows: waka / ri / ma / shi / ta (5 tokens) by kytea, wakari / mashi / ta (3 tokens) by mecab, and wakari / mashita (2 tokens) by juman. As the most fragmentary segmentation by kytea contains more functional suffixes as words, it has the largest α, and by contrast, the segmentation by juman has the smallest α. Furthermore, mecab has a smaller β +γ norm because it may keep proper nouns unsegmented, which can be considered as introducing more compounded words. For example, tōkyōdaigaku (The University of Tokyo) is kept as one word by mecab, but segmented as tōkyō / daigaku (Tokyo / university) by the other two tools.

Conclusion and Future Work
We have shown that f ∝ r −α (r + γ) −β for the rank-frequency relation in natural languages. This is an explainable extension of several related formulations, with α related to the analytic features of syntax and β + γ to that of morphology. A more general form, f ∝ k (r + γ k ) −β k , can be considered for further investigation. The k terms can depict k different proportional coefficients.