CCOHA: Clean Corpus of Historical American English

Reem Alatrash, Dominik Schlechtweg, Jonas Kuhn, Sabine Schulte im Walde


Abstract
Modelling language change is an increasingly important area of interest within the fields of sociolinguistics and historical linguistics. In recent years, there has been a growing number of publications whose main concern is studying changes that have occurred within the past centuries. The Corpus of Historical American English (COHA) is one of the most commonly used large corpora in diachronic studies in English. This paper describes methods applied to the downloadable version of the COHA corpus in order to overcome its main limitations, such as inconsistent lemmas and malformed tokens, without compromising its qualitative and distributional properties. The resulting corpus CCOHA contains a larger number of cleaned word tokens which can offer better insights into language change and allow for a larger variety of tasks to be performed.
Anthology ID:
2020.lrec-1.859
Volume:
Proceedings of the Twelfth Language Resources and Evaluation Conference
Month:
May
Year:
2020
Address:
Marseille, France
Editors:
Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
6958–6966
Language:
English
URL:
https://aclanthology.org/2020.lrec-1.859
DOI:
Bibkey:
Cite (ACL):
Reem Alatrash, Dominik Schlechtweg, Jonas Kuhn, and Sabine Schulte im Walde. 2020. CCOHA: Clean Corpus of Historical American English. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 6958–6966, Marseille, France. European Language Resources Association.
Cite (Informal):
CCOHA: Clean Corpus of Historical American English (Alatrash et al., LREC 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.lrec-1.859.pdf