RSC: A Romanian Read Speech Corpus for Automatic Speech Recognition

Alexandru-Lucian Georgescu, Horia Cucu, Andi Buzo, Corneliu Burileanu


Abstract
Although many efforts have been made in the last decade to enhance the speech and language resources for Romanian, this language is still considered under-resourced. While for many other languages there are large speech corpora available for research and commercial applications, for Romanian language the largest publicly available corpus to date comprises less than 50 hours of speech. In this context, Speech and Dialogue research group releases Read Speech Corpus (RSC) – a Romanian speech corpus developed in-house, comprising 100 hours of speech recordings from 164 different speakers. The paper describes the development of the corpus and presents baseline automatic speech recognition (ASR) results using state-of-the-art ASR technology: Kaldi speech recognition toolkit.
Anthology ID:
2020.lrec-1.814
Volume:
Proceedings of the Twelfth Language Resources and Evaluation Conference
Month:
May
Year:
2020
Address:
Marseille, France
Editors:
Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
6606–6612
Language:
English
URL:
https://aclanthology.org/2020.lrec-1.814
DOI:
Bibkey:
Cite (ACL):
Alexandru-Lucian Georgescu, Horia Cucu, Andi Buzo, and Corneliu Burileanu. 2020. RSC: A Romanian Read Speech Corpus for Automatic Speech Recognition. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 6606–6612, Marseille, France. European Language Resources Association.
Cite (Informal):
RSC: A Romanian Read Speech Corpus for Automatic Speech Recognition (Georgescu et al., LREC 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.lrec-1.814.pdf