The KIT Lecture Corpus for Speech Translation

Sebastian Stüker, Florian Kraft, Christian Mohr, Teresa Herrmann, Eunah Cho, Alex Waibel


Abstract
Academic lectures offer valuable content, but often do not reach their full potential audience due to the language barrier. Human translations of lectures are too expensive to be widely used. Speech translation technology can be an affordable alternative in this case. State-of-the-art speech translation systems utilize statistical models that need to be trained on large amounts of in-domain data. In order to support the KIT lecture translation project in its effort to introduce speech translation technology in KIT's lecture halls, we have collected a corpus of German lectures at KIT. In this paper we describe how we recorded the lectures and how we annotated them. We further give detailed statistics on the types of lectures in the corpus and its size. We collected the corpus with the purpose in mind that it should not just be suited for training a spoken language translation system the traditional way, but should also enable us to research techniques that enable the translation system to automatically and autonomously adapt itself to the varying topics and speakers of lectures
Anthology ID:
L12-1661
Volume:
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
Month:
May
Year:
2012
Address:
Istanbul, Turkey
Editors:
Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Mehmet Uğur Doğan, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
3409–3414
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2012/pdf/1121_Paper.pdf
DOI:
Bibkey:
Cite (ACL):
Sebastian Stüker, Florian Kraft, Christian Mohr, Teresa Herrmann, Eunah Cho, and Alex Waibel. 2012. The KIT Lecture Corpus for Speech Translation. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12), pages 3409–3414, Istanbul, Turkey. European Language Resources Association (ELRA).
Cite (Informal):
The KIT Lecture Corpus for Speech Translation (Stüker et al., LREC 2012)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2012/pdf/1121_Paper.pdf