A Corpus of Spontaneous Multi-party Conversation in Bosnian Serbo-Croatian and British English

Emina Kurtić, Bill Wells, Guy J. Brown, Timothy Kempton, Ahmet Aker


Abstract
In this paper we present a corpus of audio and video recordings of spontaneous, face-to-face multi-party conversation in two languages. Freely available high quality recordings of mundane, non-institutional, multi-party talk are still sparse, and this corpus aims to contribute valuable data suitable for study of multiple aspects of spoken interaction. In particular, it constitutes a unique resource for spoken Bosnian Serbo-Croatian (BSC), an under-resourced language with no spoken resources available at present. The corpus consists of just over 3 hours of free conversation in each of the target languages, BSC and British English (BE). The audio recordings have been made on separate channels using head-set microphones, as well as using a microphone array, containing 8 omni-directional microphones. The data has been segmented and transcribed using segmentation notions and transcription conventions developed from those of the conversation analysis research tradition. Furthermore, the transcriptions have been automatically aligned with the audio at the word and phone level, using the method of forced alignment. In this paper we describe the procedures behind the corpus creation and present the main features of the corpus for the study of conversation.
Anthology ID:
L12-1282
Volume:
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
Month:
May
Year:
2012
Address:
Istanbul, Turkey
Editors:
Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Mehmet Uğur Doğan, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
1323–1327
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2012/pdf/513_Paper.pdf
DOI:
Bibkey:
Cite (ACL):
Emina Kurtić, Bill Wells, Guy J. Brown, Timothy Kempton, and Ahmet Aker. 2012. A Corpus of Spontaneous Multi-party Conversation in Bosnian Serbo-Croatian and British English. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12), pages 1323–1327, Istanbul, Turkey. European Language Resources Association (ELRA).
Cite (Informal):
A Corpus of Spontaneous Multi-party Conversation in Bosnian Serbo-Croatian and British English (Kurtić et al., LREC 2012)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2012/pdf/513_Paper.pdf