Massively Multilingual Pronunciation Modeling with WikiPron

Jackson L. Lee, Lucas F.E. Ashby, M. Elizabeth Garza, Yeonju Lee-Sikka, Sean Miller, Alan Wong, Arya D. McCarthy, Kyle Gorman


Abstract
We introduce WikiPron, an open-source command-line tool for extracting pronunciation data from Wiktionary, a collaborative multilingual online dictionary. We first describe the design and use of WikiPron. We then discuss the challenges faced scaling this tool to create an automatically-generated database of 1.7 million pronunciations from 165 languages. Finally, we validate the pronunciation database by using it to train and evaluating a collection of generic grapheme-to-phoneme models. The software, pronunciation data, and models are all made available under permissive open-source licenses.
Anthology ID:
2020.lrec-1.521
Original:
2020.lrec-1.521v1
Version 2:
2020.lrec-1.521v2
Volume:
Proceedings of the Twelfth Language Resources and Evaluation Conference
Month:
May
Year:
2020
Address:
Marseille, France
Editors:
Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
4223–4228
Language:
English
URL:
https://aclanthology.org/2020.lrec-1.521
DOI:
Bibkey:
Cite (ACL):
Jackson L. Lee, Lucas F.E. Ashby, M. Elizabeth Garza, Yeonju Lee-Sikka, Sean Miller, Alan Wong, Arya D. McCarthy, and Kyle Gorman. 2020. Massively Multilingual Pronunciation Modeling with WikiPron. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 4223–4228, Marseille, France. European Language Resources Association.
Cite (Informal):
Massively Multilingual Pronunciation Modeling with WikiPron (Lee et al., LREC 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.lrec-1.521.pdf