A Fully Annotated Corpus of Russian Speech

Pavel Skrelin, Nina Volskaya, Daniil Kocharov, Karina Evgrafova, Olga Glotova, Vera Evdokimova


Abstract
The paper introduces CORPRES ― a fully annotated Russian speech corpus developed at the Department of Phonetics, St. Petersburg State University as a result of a three-year project. The corpus includes samples of different speaking styles produced by 4 male and 4 female speakers. Six levels of annotation cover all phonetic and prosodic information about the recorded speech data, including labels for pitch marks, phonetic events, narrow and wide phonetic transcription, orthographic and prosodic transcription. Precise phonetic transcription of the data provides an especially valuable resource for both research and development purposes. Overall corpus size is 528 458 running words and contains 60 hours of speech made up of 7.5 hours from each speaker. 40% of the corpus was manually segmented and fully annotated on all six levels. 60% of the corpus was partly annotated; there are labels for pitch period and phonetic event labels. Orthographic, prosodic and ideal phonetic transcription for this part was generated and stored as text files. The fully annotated part of the corpus covers all speaking styles included in the corpus and all speakers. The paper contains information about CORPRES design and annotation principles, overall data description and some speculation about possible use of the corpus.
Anthology ID:
L10-1188
Volume:
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)
Month:
May
Year:
2010
Address:
Valletta, Malta
Editors:
Nicoletta Calzolari, Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odijk, Stelios Piperidis, Mike Rosner, Daniel Tapias
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2010/pdf/274_Paper.pdf
DOI:
Bibkey:
Cite (ACL):
Pavel Skrelin, Nina Volskaya, Daniil Kocharov, Karina Evgrafova, Olga Glotova, and Vera Evdokimova. 2010. A Fully Annotated Corpus of Russian Speech. In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10), Valletta, Malta. European Language Resources Association (ELRA).
Cite (Informal):
A Fully Annotated Corpus of Russian Speech (Skrelin et al., LREC 2010)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2010/pdf/274_Paper.pdf