A Three-Parameter Rank-Frequency Relation in Natural Languages

Chenchen Ding, Masao Utiyama, Eiichiro Sumita


Abstract
We present that, the rank-frequency relation in textual data follows f ∝ r-𝛼(r+𝛾)-𝛽, where f is the token frequency and r is the rank by frequency, with (𝛼, 𝛽, 𝛾) as parameters. The formulation is derived based on the empirical observation that d2 (x+y)/dx2 is a typical impulse function, where (x,y)=(log r, log f). The formulation is the power law when 𝛽=0 and the Zipf–Mandelbrot law when 𝛼=0. We illustrate that 𝛼 is related to the analytic features of syntax and 𝛽+𝛾 to those of morphology in natural languages from an investigation of multilingual corpora.
Anthology ID:
2020.acl-main.44
Volume:
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
Month:
July
Year:
2020
Address:
Online
Editors:
Dan Jurafsky, Joyce Chai, Natalie Schluter, Joel Tetreault
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
460–464
Language:
URL:
https://aclanthology.org/2020.acl-main.44
DOI:
10.18653/v1/2020.acl-main.44
Bibkey:
Cite (ACL):
Chenchen Ding, Masao Utiyama, and Eiichiro Sumita. 2020. A Three-Parameter Rank-Frequency Relation in Natural Languages. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 460–464, Online. Association for Computational Linguistics.
Cite (Informal):
A Three-Parameter Rank-Frequency Relation in Natural Languages (Ding et al., ACL 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.acl-main.44.pdf
Video:
 http://slideslive.com/38928699