Language Modeling for Code-Switching: Evaluation, Integration of Monolingual Data, and Discriminative Training

Hila Gonen, Yoav Goldberg


Abstract
We focus on the problem of language modeling for code-switched language, in the context of automatic speech recognition (ASR). Language modeling for code-switched language is challenging for (at least) three reasons: (1) lack of available large-scale code-switched data for training; (2) lack of a replicable evaluation setup that is ASR directed yet isolates language modeling performance from the other intricacies of the ASR system; and (3) the reliance on generative modeling. We tackle these three issues: we propose an ASR-motivated evaluation setup which is decoupled from an ASR system and the choice of vocabulary, and provide an evaluation dataset for English-Spanish code-switching. This setup lends itself to a discriminative training approach, which we demonstrate to work better than generative language modeling. Finally, we explore a variety of training protocols and verify the effectiveness of training with large amounts of monolingual data followed by fine-tuning with small amounts of code-switched data, for both the generative and discriminative cases.
Anthology ID:
D19-1427
Volume:
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)
Month:
November
Year:
2019
Address:
Hong Kong, China
Editors:
Kentaro Inui, Jing Jiang, Vincent Ng, Xiaojun Wan
Venues:
EMNLP | IJCNLP
SIG:
SIGDAT
Publisher:
Association for Computational Linguistics
Note:
Pages:
4175–4185
Language:
URL:
https://aclanthology.org/D19-1427
DOI:
10.18653/v1/D19-1427
Bibkey:
Cite (ACL):
Hila Gonen and Yoav Goldberg. 2019. Language Modeling for Code-Switching: Evaluation, Integration of Monolingual Data, and Discriminative Training. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4175–4185, Hong Kong, China. Association for Computational Linguistics.
Cite (Informal):
Language Modeling for Code-Switching: Evaluation, Integration of Monolingual Data, and Discriminative Training (Gonen & Goldberg, EMNLP-IJCNLP 2019)
Copy Citation:
PDF:
https://aclanthology.org/D19-1427.pdf
Attachment:
 D19-1427.Attachment.zip