Dialect Identification under Domain Shift: Experiments with Discriminating Romanian and Moldavian

Çağrı Çöltekin


Abstract
This paper describes a set of experiments for discriminating between two closely related language varieties, Moldavian and Romanian, under a substantial domain shift. The experiments were conducted as part of the Romanian dialect identification task in the VarDial 2020 evaluation campaign. Our best system based on linear SVM classifier obtained the first position in the shared task with an F1 score of 0.79, supporting the earlier results showing (unexpected) success of machine learning systems in this task. The additional experiments reported in this paper also show that adapting to the test set is useful when the training data comes from another domain. However, the benefit of adaptation becomes doubtful even when a small amount of data from the target domain is available.
Anthology ID:
2020.vardial-1.17
Volume:
Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects
Month:
December
Year:
2020
Address:
Barcelona, Spain (Online)
Editors:
Marcos Zampieri, Preslav Nakov, Nikola Ljubešić, Jörg Tiedemann, Yves Scherrer
Venue:
VarDial
SIG:
Publisher:
International Committee on Computational Linguistics (ICCL)
Note:
Pages:
186–192
Language:
URL:
https://aclanthology.org/2020.vardial-1.17
DOI:
Bibkey:
Cite (ACL):
Çağrı Çöltekin. 2020. Dialect Identification under Domain Shift: Experiments with Discriminating Romanian and Moldavian. In Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects, pages 186–192, Barcelona, Spain (Online). International Committee on Computational Linguistics (ICCL).
Cite (Informal):
Dialect Identification under Domain Shift: Experiments with Discriminating Romanian and Moldavian (Çöltekin, VarDial 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.vardial-1.17.pdf
Data
MOROCO