Correcting Whitespace Errors in Digitized Historical Texts

Sandeep Soni, Lauren Klein, Jacob Eisenstein


Abstract
Whitespace errors are common to digitized archives. This paper describes a lightweight unsupervised technique for recovering the original whitespace. Our approach is based on count statistics from Google n-grams, which are converted into a likelihood ratio test computed from interpolated trigram and bigram probabilities. To evaluate this approach, we annotate a small corpus of whitespace errors in a digitized corpus of newspapers from the 19th century United States. Our technique identifies and corrects most whitespace errors while introducing a minimal amount of oversegmentation: it achieves 77% recall at a false positive rate of less than 1%, and 91% recall at a false positive rate of less than 3%.
Anthology ID:
W19-2513
Volume:
Proceedings of the 3rd Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature
Month:
June
Year:
2019
Address:
Minneapolis, USA
Editors:
Beatrice Alex, Stefania Degaetano-Ortlieb, Anna Kazantseva, Nils Reiter, Stan Szpakowicz
Venue:
LaTeCH
SIG:
SIGHUM
Publisher:
Association for Computational Linguistics
Note:
Pages:
98–103
Language:
URL:
https://aclanthology.org/W19-2513
DOI:
10.18653/v1/W19-2513
Bibkey:
Cite (ACL):
Sandeep Soni, Lauren Klein, and Jacob Eisenstein. 2019. Correcting Whitespace Errors in Digitized Historical Texts. In Proceedings of the 3rd Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature, pages 98–103, Minneapolis, USA. Association for Computational Linguistics.
Cite (Informal):
Correcting Whitespace Errors in Digitized Historical Texts (Soni et al., LaTeCH 2019)
Copy Citation:
PDF:
https://aclanthology.org/W19-2513.pdf
Code
 sandeepsoni/whitespace-normalizer