MuST-Cinema: a Speech-to-Subtitles corpus

Alina Karakanta, Matteo Negri, Marco Turchi


Abstract
Growing needs in localising audiovisual content in multiple languages through subtitles call for the development of automatic solutions for human subtitling. Neural Machine Translation (NMT) can contribute to the automatisation of subtitling, facilitating the work of human subtitlers and reducing turn-around times and related costs. NMT requires high-quality, large, task-specific training data. The existing subtitling corpora, however, are missing both alignments to the source language audio and important information about subtitle breaks. This poses a significant limitation for developing efficient automatic approaches for subtitling, since the length and form of a subtitle directly depends on the duration of the utterance. In this work, we present MuST-Cinema, a multilingual speech translation corpus built from TED subtitles. The corpus is comprised of (audio, transcription, translation) triplets. Subtitle breaks are preserved by inserting special symbols. We show that the corpus can be used to build models that efficiently segment sentences into subtitles and propose a method for annotating existing subtitling corpora with subtitle breaks, conforming to the constraint of length.
Anthology ID:
2020.lrec-1.460
Volume:
Proceedings of the Twelfth Language Resources and Evaluation Conference
Month:
May
Year:
2020
Address:
Marseille, France
Editors:
Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
3727–3734
Language:
English
URL:
https://aclanthology.org/2020.lrec-1.460
DOI:
Bibkey:
Cite (ACL):
Alina Karakanta, Matteo Negri, and Marco Turchi. 2020. MuST-Cinema: a Speech-to-Subtitles corpus. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 3727–3734, Marseille, France. European Language Resources Association.
Cite (Informal):
MuST-Cinema: a Speech-to-Subtitles corpus (Karakanta et al., LREC 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.lrec-1.460.pdf
Data
MuST-CinemaJESCMuST-COpenSubtitles