Government Domain Named Entity Recognition for South African Languages

Roald Eiselen


Abstract
This paper describes the named entity language resources developed as part of a development project for the South African languages. The development efforts focused on creating protocols and annotated data sets with at least 15,000 annotated named entity tokens for ten of the official South African languages. The description of the protocols and annotated data sets provide an overview of the problems encountered during the annotation of the data sets. Based on these annotated data sets, CRF named entity recognition systems are developed that leverage existing linguistic resources. The newly created named entity recognisers are evaluated, with F-scores of between 0.64 and 0.77, and error analysis is performed to identify possible avenues for improving the quality of the systems.
Anthology ID:
L16-1533
Volume:
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
Month:
May
Year:
2016
Address:
Portorož, Slovenia
Editors:
Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Sara Goggi, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Helene Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
3344–3348
Language:
URL:
https://aclanthology.org/L16-1533
DOI:
Bibkey:
Cite (ACL):
Roald Eiselen. 2016. Government Domain Named Entity Recognition for South African Languages. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16), pages 3344–3348, Portorož, Slovenia. European Language Resources Association (ELRA).
Cite (Informal):
Government Domain Named Entity Recognition for South African Languages (Eiselen, LREC 2016)
Copy Citation:
PDF:
https://aclanthology.org/L16-1533.pdf