A Twitter Corpus for Hindi-English Code Mixed POS Tagging

Kushagra Singh, Indira Sen, Ponnurangam Kumaraguru


Abstract
Code-mixing is a linguistic phenomenon where multiple languages are used in the same occurrence that is increasingly common in multilingual societies. Code-mixed content on social media is also on the rise, prompting the need for tools to automatically understand such content. Automatic Parts-of-Speech (POS) tagging is an essential step in any Natural Language Processing (NLP) pipeline, but there is a lack of annotated data to train such models. In this work, we present a unique language tagged and POS-tagged dataset of code-mixed English-Hindi tweets related to five incidents in India that led to a lot of Twitter activity. Our dataset is unique in two dimensions: (i) it is larger than previous annotated datasets and (ii) it closely resembles typical real-world tweets. Additionally, we present a POS tagging model that is trained on this dataset to provide an example of how this dataset can be used. The model also shows the efficacy of our dataset in enabling the creation of code-mixed social media POS taggers.
Anthology ID:
W18-3503
Volume:
Proceedings of the Sixth International Workshop on Natural Language Processing for Social Media
Month:
July
Year:
2018
Address:
Melbourne, Australia
Editors:
Lun-Wei Ku, Cheng-Te Li
Venue:
SocialNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
12–17
Language:
URL:
https://aclanthology.org/W18-3503
DOI:
10.18653/v1/W18-3503
Bibkey:
Cite (ACL):
Kushagra Singh, Indira Sen, and Ponnurangam Kumaraguru. 2018. A Twitter Corpus for Hindi-English Code Mixed POS Tagging. In Proceedings of the Sixth International Workshop on Natural Language Processing for Social Media, pages 12–17, Melbourne, Australia. Association for Computational Linguistics.
Cite (Informal):
A Twitter Corpus for Hindi-English Code Mixed POS Tagging (Singh et al., SocialNLP 2018)
Copy Citation:
PDF:
https://aclanthology.org/W18-3503.pdf