Discriminating between Similar Languages using Weighted Subword Features

Adrien Barbaresi


Abstract
The present contribution revolves around a contrastive subword n-gram model which has been tested in the Discriminating between Similar Languages shared task. I present and discuss the method used in this 14-way language identification task comprising varieties of 6 main language groups. It features the following characteristics: (1) the preprocessing and conversion of a collection of documents to sparse features; (2) weighted character n-gram profiles; (3) a multinomial Bayesian classifier. Meaningful bag-of-n-grams features can be used as a system in a straightforward way, my approach outperforms most of the systems used in the DSL shared task (3rd rank).
Anthology ID:
W17-1223
Volume:
Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial)
Month:
April
Year:
2017
Address:
Valencia, Spain
Editors:
Preslav Nakov, Marcos Zampieri, Nikola Ljubešić, Jörg Tiedemann, Shevin Malmasi, Ahmed Ali
Venue:
VarDial
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
184–189
Language:
URL:
https://aclanthology.org/W17-1223
DOI:
10.18653/v1/W17-1223
Bibkey:
Cite (ACL):
Adrien Barbaresi. 2017. Discriminating between Similar Languages using Weighted Subword Features. In Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), pages 184–189, Valencia, Spain. Association for Computational Linguistics.
Cite (Informal):
Discriminating between Similar Languages using Weighted Subword Features (Barbaresi, VarDial 2017)
Copy Citation:
PDF:
https://aclanthology.org/W17-1223.pdf
Code
 adbar/vardial-experiments