Language Identification in Code-Mixed Data using Multichannel Neural Networks and Context Capture

Soumil Mandal, Anil Kumar Singh


Abstract
An accurate language identification tool is an absolute necessity for building complex NLP systems to be used on code-mixed data. Lot of work has been recently done on the same, but there’s still room for improvement. Inspired from the recent advancements in neural network architectures for computer vision tasks, we have implemented multichannel neural networks combining CNN and LSTM for word level language identification of code-mixed data. Combining this with a Bi-LSTM-CRF context capture module, accuracies of 93.28% and 93.32% is achieved on our two testing sets.
Anthology ID:
W18-6116
Volume:
Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text
Month:
November
Year:
2018
Address:
Brussels, Belgium
Editors:
Wei Xu, Alan Ritter, Tim Baldwin, Afshin Rahimi
Venue:
WNUT
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
116–120
Language:
URL:
https://aclanthology.org/W18-6116
DOI:
10.18653/v1/W18-6116
Bibkey:
Cite (ACL):
Soumil Mandal and Anil Kumar Singh. 2018. Language Identification in Code-Mixed Data using Multichannel Neural Networks and Context Capture. In Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text, pages 116–120, Brussels, Belgium. Association for Computational Linguistics.
Cite (Informal):
Language Identification in Code-Mixed Data using Multichannel Neural Networks and Context Capture (Mandal & Singh, WNUT 2018)
Copy Citation:
PDF:
https://aclanthology.org/W18-6116.pdf