Design and Data Collection for the Accentological Corpus of the Russian Language

Elena Grishina, Svetlana Savchuk, Alexej Poljakov


Abstract
Accentological corpus provides a researcher an opportunity to study word stress and stress variation, which are very important for the Russian language. Moreover, Accentological corpus allows studying the history of the Russian language stress development. The research presents the main characteristics of Accentological corpus available at ruscorpora.ru. Corpora size, type and sources of text material, the way it is represented in the corpora, types of linguistic annotation, corpora composition and ways of their effective use according to their purposes are described. There are two zones in the Accentological corpus. 1) The zone of prose includes oral texts and films transcripts, in which stressed syllables are marked according to the real pronunciation. 2) The zone of poetry contains texts with marked accented syllables, so it is possible to define the exact word stress using special rules. The Accentological corpus has four types of annotations (metatextual, morphological, semantic and sociological) and also has its own accentological mark-up. Due to accentological annotation each word is supplied with stress marks, so a user can make queries and retrieve the stressed or unstressed word forms in combination with grammatical and semantic features.
Anthology ID:
L10-1247
Volume:
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)
Month:
May
Year:
2010
Address:
Valletta, Malta
Editors:
Nicoletta Calzolari, Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odijk, Stelios Piperidis, Mike Rosner, Daniel Tapias
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2010/pdf/358_Paper.pdf
DOI:
Bibkey:
Cite (ACL):
Elena Grishina, Svetlana Savchuk, and Alexej Poljakov. 2010. Design and Data Collection for the Accentological Corpus of the Russian Language. In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10), Valletta, Malta. European Language Resources Association (ELRA).
Cite (Informal):
Design and Data Collection for the Accentological Corpus of the Russian Language (Grishina et al., LREC 2010)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2010/pdf/358_Paper.pdf