SpiCE: A New Open-Access Corpus of Conversational Bilingual Speech in Cantonese and English

Khia A. Johnson, Molly Babel, Ivan Fong, Nancy Yiu


Abstract
This paper describes the design, collection, orthographic transcription, and phonetic annotation of SpiCE, a new corpus of conversational Cantonese-English bilingual speech recorded in Vancouver, Canada. The corpus includes high-quality recordings of 34 early bilinguals in both English and Cantonese—to date, 27 have been recorded for a total of 19 hours of participant speech. Participants completed a sentence reading task, storyboard narration, and conversational interview in each language. Transcription and annotation for the corpus are currently underway. Transcripts produced with Google Cloud Speech-to-Text are available for all participants, and will be included in the initial SpiCE corpus release. Hand-corrected orthographic transcripts and force-aligned phonetic transcripts will be released periodically, and upon completion for all recordings, comprise the second release of the corpus. As an open-access language resource, SpiCE will promote bilingualism research for a typologically distinct pair of languages, of which Cantonese remains understudied despite there being millions of speakers around the world. The SpiCE corpus is especially well-suited for phonetic research on conversational speech, and enables researchers to study cross-language within-speaker phenomena for a diverse group of early Cantonese-English bilinguals. These are areas with few existing high-quality resources.
Anthology ID:
2020.lrec-1.503
Volume:
Proceedings of the Twelfth Language Resources and Evaluation Conference
Month:
May
Year:
2020
Address:
Marseille, France
Editors:
Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
4089–4095
Language:
English
URL:
https://aclanthology.org/2020.lrec-1.503
DOI:
Bibkey:
Cite (ACL):
Khia A. Johnson, Molly Babel, Ivan Fong, and Nancy Yiu. 2020. SpiCE: A New Open-Access Corpus of Conversational Bilingual Speech in Cantonese and English. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 4089–4095, Marseille, France. European Language Resources Association.
Cite (Informal):
SpiCE: A New Open-Access Corpus of Conversational Bilingual Speech in Cantonese and English (Johnson et al., LREC 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.lrec-1.503.pdf