Collecting Voices from the Cloud

Ian McGraw, Chia-ying Lee, Lee Hetherington, Stephanie Seneff, Jim Glass


Abstract
The collection and transcription of speech data is typically an expensive and time-consuming task. Voice over IP and cloud computing are poised to greatly reduce this impediment to research on spoken language interfaces in many domains. This paper documents our efforts to deploy speech-enabled web interfaces to large audiences over the Internet via Amazon Mechanical Turk, an online marketplace for work. Using the open source WAMI Toolkit, we collected corpora in two different domains which collectively constitute over 113 hours of speech. The first corpus contains 100,000 utterances of read speech, and was collected by asking workers to record street addresses in the United States. For the second task, we collected conversations with FlightBrowser, a multimodal spoken dialogue system. The FlightBrowser corpus obtained contains 10,651 utterances composing 1,113 individual dialogue sessions from 101 distinct users. The aggregate time spent collecting the data for both corpora was just under two weeks. At times, our servers were logging audio from workers at rates faster than real-time. We describe the process of collection and transcription of these corpora while providing an analysis of the advantages and limitations to this data collection method.
Anthology ID:
L10-1567
Volume:
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)
Month:
May
Year:
2010
Address:
Valletta, Malta
Editors:
Nicoletta Calzolari, Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odijk, Stelios Piperidis, Mike Rosner, Daniel Tapias
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2010/pdf/822_Paper.pdf
DOI:
Bibkey:
Cite (ACL):
Ian McGraw, Chia-ying Lee, Lee Hetherington, Stephanie Seneff, and Jim Glass. 2010. Collecting Voices from the Cloud. In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10), Valletta, Malta. European Language Resources Association (ELRA).
Cite (Informal):
Collecting Voices from the Cloud (McGraw et al., LREC 2010)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2010/pdf/822_Paper.pdf