Shamela: A Large-Scale Historical Arabic Corpus

Yonatan Belinkov; Alexander Magidow; Maxim Romanov; Avi Shmidman; Moshe Koppel

Shamela: A Large-Scale Historical Arabic Corpus

Yonatan Belinkov, Alexander Magidow, Maxim Romanov, Avi Shmidman, Moshe Koppel

Abstract

Arabic is a widely-spoken language with a rich and long history spanning more than fourteen centuries. Yet existing Arabic corpora largely focus on the modern period or lack sufficient diachronic information. We develop a large-scale, historical corpus of Arabic of about 1 billion words from diverse periods of time. We clean this corpus, process it with a morphological analyzer, and enhance it by detecting parallel passages and automatically dating undated texts. We demonstrate its utility with selected case-studies in which we show its application to the digital humanities.

Anthology ID:: W16-4007
Volume:: Proceedings of the Workshop on Language Technology Resources and Tools for Digital Humanities (LT4DH)
Month:: December
Year:: 2016
Address:: Osaka, Japan
Editors:: Erhard Hinrichs, Marie Hinrichs, Thorsten Trippel
Venue:: LT4DH
SIG:
Publisher:: The COLING 2016 Organizing Committee
Note:
Pages:: 45–53
Language:
URL:: https://aclanthology.org/W16-4007
DOI:
Bibkey:
Cite (ACL):: Yonatan Belinkov, Alexander Magidow, Maxim Romanov, Avi Shmidman, and Moshe Koppel. 2016. Shamela: A Large-Scale Historical Arabic Corpus. In Proceedings of the Workshop on Language Technology Resources and Tools for Digital Humanities (LT4DH), pages 45–53, Osaka, Japan. The COLING 2016 Organizing Committee.
Cite (Informal):: Shamela: A Large-Scale Historical Arabic Corpus (Belinkov et al., LT4DH 2016)
Copy Citation:
PDF:: https://aclanthology.org/W16-4007.pdf

PDF Cite Search