Incorporating an Error Corpus into a Spellchecker for Maltese

Michael Rosner, Albert Gatt, Andrew Attard, Jan Joachimsen


Abstract
This paper discusses the ongoing development of a new Maltese spell checker, highlighting the methodologies which would best suit such a language. We thus discuss several previous attempts, highlighting what we believe to be their weakest point: a lack of attention to context. Two developments are of particular interest, both of which concern the availability of language resources relevant to spellchecking: (i) the Maltese Language Resource Server (MLRS) which now includes a representative corpus of c. 100M words extracted from diverse documents including the Maltese Legislation, press releases and extracts from Maltese web-pages and (ii) an extensive and detailed corpus of spelling errors that was collected whilst part of the MLRS texts were being prepared. We describe the structure of these resources as well as the experimental approaches focused on context that we are now in a position to adopt. We describe the framework within which a variety of different approaches to spellchecking and evaluation will be carried out, and briefly discuss the first baseline system we have implemented. We conclude the paper with a roadmap for future improvements.
Anthology ID:
L12-1620
Volume:
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
Month:
May
Year:
2012
Address:
Istanbul, Turkey
Editors:
Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Mehmet Uğur Doğan, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
743–750
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2012/pdf/1040_Paper.pdf
DOI:
Bibkey:
Cite (ACL):
Michael Rosner, Albert Gatt, Andrew Attard, and Jan Joachimsen. 2012. Incorporating an Error Corpus into a Spellchecker for Maltese. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12), pages 743–750, Istanbul, Turkey. European Language Resources Association (ELRA).
Cite (Informal):
Incorporating an Error Corpus into a Spellchecker for Maltese (Rosner et al., LREC 2012)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2012/pdf/1040_Paper.pdf