Byte-based Language Identification with Deep Convolutional Networks

Johannes Bjerva


Abstract
We report on our system for the shared task on discriminating between similar languages (DSL 2016). The system uses only byte representations in a deep residual network (ResNet). The system, named ResIdent, is trained only on the data released with the task (closed training). We obtain 84.88% accuracy on subtask A, 68.80% accuracy on subtask B1, and 69.80% accuracy on subtask B2. A large difference in accuracy on development data can be observed with relatively minor changes in our network’s architecture and hyperparameters. We therefore expect fine-tuning of these parameters to yield higher accuracies.
Anthology ID:
W16-4816
Volume:
Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3)
Month:
December
Year:
2016
Address:
Osaka, Japan
Editors:
Preslav Nakov, Marcos Zampieri, Liling Tan, Nikola Ljubešić, Jörg Tiedemann, Shervin Malmasi
Venue:
VarDial
SIG:
Publisher:
The COLING 2016 Organizing Committee
Note:
Pages:
119–125
Language:
URL:
https://aclanthology.org/W16-4816
DOI:
Bibkey:
Cite (ACL):
Johannes Bjerva. 2016. Byte-based Language Identification with Deep Convolutional Networks. In Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3), pages 119–125, Osaka, Japan. The COLING 2016 Organizing Committee.
Cite (Informal):
Byte-based Language Identification with Deep Convolutional Networks (Bjerva, VarDial 2016)
Copy Citation:
PDF:
https://aclanthology.org/W16-4816.pdf