Using the Complexity of the Distribution of Lexical Elements as a Feature in Authorship Attribution

Leanne Spracklin, Diana Inkpen, Amiya Nayak


Abstract
Traditional Authorship Attribution models extract normalized counts of lexical elements such as nouns, common words and punctuation and use these normalized counts or ratios as features for author fingerprinting. The text is viewed as a “bag-of-words” and the order of words and their position relative to other words is largely ignored. We propose a new method of feature extraction which quantifies the distribution of lexical elements within the text using Kolmogorov complexity estimates. Testing carried out on blog corpora indicates that such measures outperform ratios when used as features in an SVM authorship attribution model. Moreover, by adding complexity estimates to a model using ratios, we were able to increase the F-measure by 5.2-11.8%
Anthology ID:
L08-1031
Volume:
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)
Month:
May
Year:
2008
Address:
Marrakech, Morocco
Editors:
Nicoletta Calzolari, Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odijk, Stelios Piperidis, Daniel Tapias
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2008/pdf/892_paper.pdf
DOI:
Bibkey:
Cite (ACL):
Leanne Spracklin, Diana Inkpen, and Amiya Nayak. 2008. Using the Complexity of the Distribution of Lexical Elements as a Feature in Authorship Attribution. In Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08), Marrakech, Morocco. European Language Resources Association (ELRA).
Cite (Informal):
Using the Complexity of the Distribution of Lexical Elements as a Feature in Authorship Attribution (Spracklin et al., LREC 2008)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2008/pdf/892_paper.pdf