TED-LIUM: an Automatic Speech Recognition dedicated corpus

Anthony Rousseau, Paul Deléglise, Yannick Estève


Abstract
This paper presents the corpus developed by the LIUM for Automatic Speech Recognition (ASR), based on the TED Talks. This corpus was built during the IWSLT 2011 Evaluation Campaign, and is composed of 118 hours of speech with its accompanying automatically aligned transcripts. We describe the content of the corpus, how the data was collected and processed, how it will be publicly available and how we built an ASR system using this data leading to a WER score of 17.4 %. The official results we obtained at the IWSLT 2011 evaluation campaign are also discussed.
Anthology ID:
L12-1405
Volume:
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
Month:
May
Year:
2012
Address:
Istanbul, Turkey
Editors:
Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Mehmet Uğur Doğan, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
125–129
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2012/pdf/698_Paper.pdf
DOI:
Bibkey:
Cite (ACL):
Anthony Rousseau, Paul Deléglise, and Yannick Estève. 2012. TED-LIUM: an Automatic Speech Recognition dedicated corpus. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12), pages 125–129, Istanbul, Turkey. European Language Resources Association (ELRA).
Cite (Informal):
TED-LIUM: an Automatic Speech Recognition dedicated corpus (Rousseau et al., LREC 2012)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2012/pdf/698_Paper.pdf