NeoTag: a POS Tagger for Grammatical Neologism Detection

Maarten Janssen


Abstract
POS Taggers typically fail to correctly tag grammatical neologisms: for known words, a tagger will only take known tags into account, and hence discard any possibility that the word is used in a novel or deviant grammatical category in the text at hand. Grammatical neologisms are relatively rare, and therefore do not pose a significant problem for the overall performance of a tagger. But for studies on neologisms and grammaticalization processes, this makes traditional taggers rather unfit. This article describes a modified POS tagger that explicitly considers new tags for known words, hence making it better fit for neologism research. This tagger, called NeoTag, has an overall accuracy that is comparable to other taggers, but scores much better for grammatical neologisms. To achieve this, the tagger applies a system of {\em lexical smoothing}, which adds new categories to known words based on known homographs. NeoTag also lemmatizes words as part of the tagging system, achieving a high accuracy on lemmatization for both known and unknown words, without the need for an external lexicon. The use of NeoTag is not restricted to grammatical neologism detection, and it can be used for other purposes as well.
Anthology ID:
L12-1653
Volume:
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
Month:
May
Year:
2012
Address:
Istanbul, Turkey
Editors:
Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Mehmet Uğur Doğan, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
2118–2124
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2012/pdf/1098_Paper.pdf
DOI:
Bibkey:
Cite (ACL):
Maarten Janssen. 2012. NeoTag: a POS Tagger for Grammatical Neologism Detection. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12), pages 2118–2124, Istanbul, Turkey. European Language Resources Association (ELRA).
Cite (Informal):
NeoTag: a POS Tagger for Grammatical Neologism Detection (Janssen, LREC 2012)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2012/pdf/1098_Paper.pdf