Part-of-Speech Annotation of English-Assamese code-mixed texts: Two Approaches

Ritesh Kumar; Manas Jyoti Bora

Part-of-Speech Annotation of English-Assamese code-mixed texts: Two Approaches

Abstract

In this paper, we discuss the development of a part-of-speech tagger for English-Assamese code-mixed texts. We provide a comparison of 2 approaches to annotating code-mixed data – a) annotation of the texts from the two languages using monolingual resources from each language and b) annotation of the text through a different resource created specifically for code-mixed data. We present a comparative study of the efforts required in each approach and the final performance of the system. Based on this, we argue that it might be a better approach to develop new technologies using code-mixed data instead of monolingual, ‘clean’ data, especially for those languages where we do not have significant tools and technologies available till now.

Anthology ID:: W18-4110
Volume:: Proceedings of the First International Workshop on Language Cognition and Computational Models
Month:: August
Year:: 2018
Address:: Santa Fe, New Mexico, USA
Editors:: Manjira Sinha, Tirthankar Dasgupta
Venue:: LCCM
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 94–103
Language:
URL:: https://aclanthology.org/W18-4110
DOI:
Bibkey:
Cite (ACL):: Ritesh Kumar and Manas Jyoti Bora. 2018. Part-of-Speech Annotation of English-Assamese code-mixed texts: Two Approaches. In Proceedings of the First International Workshop on Language Cognition and Computational Models, pages 94–103, Santa Fe, New Mexico, USA. Association for Computational Linguistics.
Cite (Informal):: Part-of-Speech Annotation of English-Assamese code-mixed texts: Two Approaches (Kumar & Bora, LCCM 2018)
Copy Citation:
PDF:: https://aclanthology.org/W18-4110.pdf

PDF Cite Search