The SI TEDx-UM speech database: a new Slovenian Spoken Language Resource

Andrej Žgank, Mirjam Sepesy Maučec, Darinka Verdonik


Abstract
This paper presents a new Slovenian spoken language resource built from TEDx Talks. The speech database contains 242 talks in total duration of 54 hours. The annotation and transcription of acquired spoken material was generated automatically, applying acoustic segmentation and automatic speech recognition. The development and evaluation subset was also manually transcribed using the guidelines specified for the Slovenian GOS corpus. The manual transcriptions were used to evaluate the quality of unsupervised transcriptions. The average word error rate for the SI TEDx-UM evaluation subset was 50.7%, with out of vocabulary rate of 24% and language model perplexity of 390. The unsupervised transcriptions contain 372k tokens, where 32k of them were different.
Anthology ID:
L16-1740
Volume:
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
Month:
May
Year:
2016
Address:
Portorož, Slovenia
Editors:
Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Sara Goggi, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Helene Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
4670–4673
Language:
URL:
https://aclanthology.org/L16-1740
DOI:
Bibkey:
Cite (ACL):
Andrej Žgank, Mirjam Sepesy Maučec, and Darinka Verdonik. 2016. The SI TEDx-UM speech database: a new Slovenian Spoken Language Resource. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16), pages 4670–4673, Portorož, Slovenia. European Language Resources Association (ELRA).
Cite (Informal):
The SI TEDx-UM speech database: a new Slovenian Spoken Language Resource (Žgank et al., LREC 2016)
Copy Citation:
PDF:
https://aclanthology.org/L16-1740.pdf