Improving the Language Model for Low-Resource ASR with Online Text Corpora

Nils Hjortnaes, Timofey Arkhangelskiy, Niko Partanen, Michael Rießler, Francis Tyers


Abstract
In this paper, we expand on previous work on automatic speech recognition in a low-resource scenario typical of data collected by field linguists. We train DeepSpeech models on 35 hours of dialectal Komi speech recordings and correct the output using language models constructed from various sources. Previous experiments showed that transfer learning using DeepSpeech can improve the accuracy of a speech recognizer for Komi, though the error rate remained very high. In this paper we present further experiments with language models created using KenLM from text materials available online. These are constructed from two corpora, one containing literary texts, one for social media content, and another combining the two. We then trained the model using each language model to explore the impact of the language model data source on the speech recognition model. Our results show significant improvements of over 25% in character error rate and nearly 20% in word error rate. This offers important methodological insight into how ASR results can be improved under low-resource conditions: transfer learning can be used to compensate the lack of training data in the target language, and online texts are a very useful resource when developing language models in this context.
Anthology ID:
2020.sltu-1.47
Volume:
Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL)
Month:
May
Year:
2020
Address:
Marseille, France
Editors:
Dorothee Beermann, Laurent Besacier, Sakriani Sakti, Claudia Soria
Venue:
SLTU
SIG:
Publisher:
European Language Resources association
Note:
Pages:
336–341
Language:
English
URL:
https://aclanthology.org/2020.sltu-1.47
DOI:
Bibkey:
Cite (ACL):
Nils Hjortnaes, Timofey Arkhangelskiy, Niko Partanen, Michael Rießler, and Francis Tyers. 2020. Improving the Language Model for Low-Resource ASR with Online Text Corpora. In Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL), pages 336–341, Marseille, France. European Language Resources association.
Cite (Informal):
Improving the Language Model for Low-Resource ASR with Online Text Corpora (Hjortnaes et al., SLTU 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.sltu-1.47.pdf