Twitter corpus of Resource-Scarce Languages for Sentiment Analysis and Multilingual Emoji Prediction

Nurendra Choudhary, Rajat Singh, Vijjini Anvesh Rao, Manish Shrivastava


Abstract
In this paper, we leverage social media platforms such as twitter for developing corpus across multiple languages. The corpus creation methodology is applicable for resource-scarce languages provided the speakers of that particular language are active users on social media platforms. We present an approach to extract social media microblogs such as tweets (Twitter). In this paper, we create corpus for multilingual sentiment analysis and emoji prediction in Hindi, Bengali and Telugu. Further, we perform and analyze multiple NLP tasks utilizing the corpus to get interesting observations.
Anthology ID:
C18-1133
Volume:
Proceedings of the 27th International Conference on Computational Linguistics
Month:
August
Year:
2018
Address:
Santa Fe, New Mexico, USA
Editors:
Emily M. Bender, Leon Derczynski, Pierre Isabelle
Venue:
COLING
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1570–1577
Language:
URL:
https://aclanthology.org/C18-1133
DOI:
Bibkey:
Cite (ACL):
Nurendra Choudhary, Rajat Singh, Vijjini Anvesh Rao, and Manish Shrivastava. 2018. Twitter corpus of Resource-Scarce Languages for Sentiment Analysis and Multilingual Emoji Prediction. In Proceedings of the 27th International Conference on Computational Linguistics, pages 1570–1577, Santa Fe, New Mexico, USA. Association for Computational Linguistics.
Cite (Informal):
Twitter corpus of Resource-Scarce Languages for Sentiment Analysis and Multilingual Emoji Prediction (Choudhary et al., COLING 2018)
Copy Citation:
PDF:
https://aclanthology.org/C18-1133.pdf