Grapheme-to-Phoneme Models for (Almost) Any Language

Grapheme-to-phoneme (g2p) models are rarely available in low-resource languages, as the creation of training and evaluation data is expensive and time-consuming. We use Wiktionary to obtain more than 650k word-pronunciation pairs in more than 500 languages. We then develop phoneme and language distance metrics based on phonological and linguistic knowledge; applying those, we adapt g2p models for high-resource languages to create models for related low-resource languages. We provide results for models for 229 adapted languages.


Introduction
Grapheme-to-phoneme (g2p) models convert words into pronunciations, and are ubiquitous in speech-and text-processing systems. Due to the diversity of scripts, phoneme inventories, phonotactic constraints, and spelling conventions among the world's languages, they are typically languagespecific. Thus, while most statistical g2p learning methods are language-agnostic, they are trained on language-specific data-namely, a pronunciation dictionary consisting of word-pronunciation pairs, as in Table 1.
Building such a dictionary for a new language is both time-consuming and expensive, because it requires expertise in both the language and a notation system like the International Phonetic Alphabet, applied to thousands of word-pronunciation pairs. Unsurprisingly, resources have been allocated only to the most heavily-researched languages. Global-Phone, one of the most extensive multilingual text and speech databases, has pronunciation dictionaries in only 20 languages  1 . 1 We have been unable to obtain this dataset.  word eng deu nld gift ɡ ɪ f tʰ ɡ ɪ f t ɣ ɪ f t class kʰ l ae s k l aː s k l ɑ s send s e̞ n d z ɛ n t s ɛ n t Table 2: Example pronunciations of English words using English, German, and Dutch g2p models.
For most of the world's more than 7,100 languages (Lewis et al., 2009), no data exists and the many technologies enabled by g2p models are inaccessible.
Intuitively, however, pronouncing an unknown language should not necessarily require large amounts of language-specific knowledge or data. A native German or Dutch speaker, with no knowledge of English, can approximate the pronunciations of an English word, albeit with slightly different phonemes. Table 2 demonstrates that German and Dutch g2p models can do the same.
Motivated by this, we create and evaluate g2p models for low-resource languages by adapting existing g2p models for high-resource languages using linguistic and phonological information. To facilitate our experiments, we create several notable data resources, including a multilingual pronunciation dictionary with entries for more than 500 languages.
The contributions of this work are: • Using data scraped from Wiktionary, we clean and normalize pronunciation dictionaries for 531 languages. To our knowledge, this is the most comprehensive multilingual pronunciation dictionary available. • We synthesize several named entities corpora to create a multilingual corpus covering 384 languages. • We develop a language-independent distance metric between IPA phonemes. • We extend previous metrics for languagelanguage distance with additional information and metrics. • We create two sets of g2p models for "high resource" languages: 97 simple rule-based models extracted from Wikipedia's "IPA Help" pages, and 85 data-driven models built from Wiktionary data. • We develop methods for adapting these g2p models to related languages, and describe results for 229 adapted models. • We release all data and models.

Related Work
Because of the severe lack of multilingual pronunciation dictionaries and g2p models, different methods of rapid resource generation have been proposed. Schultz (2009) reduces the amount of expertise needed to build a pronunciation dictionary, by providing a native speaker with an intuitive rulegeneration user interface. Schlippe et al. (2010) crawl web resources like Wiktionary for wordpronunciation pairs. More recently, attempts have been made to automatically extract pronunciation dictionaries directly from audio data (Stahlberg et al., 2016). However, the requirement of a native speaker, web resources, or audio data specific to the language still blocks development, and the number of g2p resources remains very low. Our method avoids these issues by relying only on text data from high-resource languages.
Instead of generating language-specific resources, we are instead inspired by research on cross-lingual automatic speech recognition (ASR) by  and Vu et al. (2014), who exploit linguistic and phonetic relationships in low-resource scenarios. Although these works focus on ASR instead of g2p models and rely on audio data, they demonstrate that speech technology is portable across related languages.

Method
Given a low-resource language l without g2p rules or training data, we adapt resources (either an existing g2p model or a pronunciation dictionary) from a high-resource language h to create a g2p for l. We assume the existence of two modules: a phoneme-to-phoneme distance metric phon2phon, which allows us to map between the phonemes used by h to the phonemes used by l, and a closest language module lang2lang, which provides us with related language h.
Using these resources, we adapt resources from h to l in two different ways: • Output mapping ( Figure 1a): We use g2p h to pronounce word l , then map the output to the phonemes used by l with phon2phon. • Training data mapping (Figure 1b): We use phon2phon to map the pronunciations in h's pronunciation dictionary to the phonemes used by l, then train a g2p model using the adapted data. The next sections describe how we collect data, create phoneme-to-phoneme and languageto-language distance metrics, and build highresource g2p models.

Data
This section describes our data sources, which are summarized in Table 3.

Phoible
Phoible (Moran et al., 2014) is an online repository of cross-lingual phonological data. We use  Table 3: Summary of data resources obtained from Phoible, named entity resources, Wikipedia IPA Help tables, and Wiktionary. Note that, although our Wiktionary data technically covers over 500 languages, fewer than 100 include more than 250 entries (Wiktionary train). two of its components: language phoneme inventories and phonetic features.

Phoneme inventories
A phoneme inventory is the set of phonemes used to pronounce a language, represented in IPA. Phoible provides 2156 phoneme inventories for 1674 languages. (Some languages have multiple inventories from different linguistic studies.)

Phoneme feature vectors
For each phoneme included in its phoneme inventories, Phoible provides information about 37 phonological features, such as whether the phoneme is nasal, consonantal, sonorant, or a tone. Each phoneme thus maps to a unique feature vector, with features expressed as +, -, or 0.

Named Entity Resources
For our language-to-language distance metric, it is useful to have written text in many languages. The most easily accessible source of this data is multilingual named entity (NE) resources.

Wikipedia IPA Help tables
To explain different languages' phonetic notations, Wikipedia users have created "IPA Help" pages, 3 which provide tables of simple grapheme examples of a language's phonemes. For example, on the English page, the phoneme z has the examples "zoo" and "has." We automatically scrape these tables for 97 languages to create simple graphemephoneme rules.
Using the phon2phon distance metric and mapping technique described in Section 5, we clean each table by mapping its IPA phonemes to the language's Phoible phoneme inventory, if it exists. If it does not exist, we map the phonemes to valid Phoible phonemes and create a phoneme inventory for that language.

Wiktionary pronunciation dictionaries
Ironically, to train data-driven g2p models for high-resource languages, and to evaluate our low-resource g2p models, we require pronunciation dictionaries for many languages. A common and successful technique for obtaining this data (Schlippe et al., 2010;Schlippe et al., 2012a;Yao and Kondrak, 2015) is scraping Wiktionary, an open-source multilingual dictionary maintained by Wikimedia. We extract unique word-pronunciation pairs from the English, German, Greek, Japanese, Korean, and Russian sites of Wiktionary. (Each Wiktionary site, while written in its respective language, contains word entries in multiple languages.) Since Wiktionary data is very noisy, we apply length filtering as discussed by Schlippe et al. (2012b), as well as simple regular expression filters for HTML. We also map Wiktionary pronunciations to valid Phoible phonemes and language phoneme inventories, if they exist, as discussed in Section 5. This yields 658k word-pronunciation pairs for 531 languages. However, this data is not uniformly distributed across languages-German, English, and French account for 51% of the data.
We extract test and training data as follows: For each language with at least 1 word-pron pair with a valid word (at least 3 letters and alphabetic), we extract a test set of a maximum of 200 valid words. From the remaining data, for every language with 50 or more entries, we create a training set with the available data.
Ultimately, this yields a training set with 629k word-pronunciation pairs in 85 languages, and a test set with 26k pairs in 501 languages.

Phonetic Distance Metric
Automatically comparing pronunciations across languages is especially difficult in text form. Although two versions of the "sh" sound, "ʃ" and "ɕ," sound very similar to most people and very different from "m," to a machine all three characters seem equidistant.
Previous research (Özbal and Strapparava, 2012;Vu et al., 2014) has addressed this issue by matching exact phonemes by character or manually selecting comparison features; however, we are interested in an automatic metric covering all possible IPA phoneme pairs. We handle this problem by using Phoible's phoneme feature vectors to create phon2phon, a distance metric between IPA phonemes. In this section we also describe how we use this metric to clean open-source data and build phonememapping models between languages.

phon2phon
As described in Section 4.1.2, each phoneme in Phoible maps to a unique feature vector; each feature value is +, -, or 0, representing whether a feature is present, not present, or not applicable. (Tones, for example, can never be syllabic or stressed.) We convert each feature vector into a bit representation by mapping each value to 3 bits. + to 110, -to 101, and 0 to 000. This captures the idea that add p s → p p to M ; end Algorithm 1: A condensed version of our procedure for mapping scraped phoneme sets from Wikipedia and Wiktionary to Phoible language inventories. The full algorithm handles segmentation of the scraped pronunciation and heuristically promotes coverage of the Phoible inventory.
the features + andare more similar than 0.
We then compute the normalized Hamming distance between every phoneme pair p 1,2 with feature vectors f 1,2 and feature vector length n as follows:

Data cleaning
We now combine phon2phon distances and Phoible phoneme inventories to map phonemes from scraped Wikipedia IPA help tables and Wiktionary pronunciation dictionaries to Phoible phonemes and inventories. We describe a condensed version of our procedure in Algorithm 1, and provide examples of cleaned Wiktionary output in Table 4.

Phoneme mapping models
Another application of phon2phon is to transform pronunciations in one language to another language's phoneme inventory. We can do this by  creating a single-state weighted finite-state transducer (wFST) W for input language inventory I and output language inventory O: W can then be used to map a pronunciation to a new language; this has the interesting effect of modeling accents by foreign-language speakers: think in English (pronounced "θ ɪ ŋ kʰ") becomes "s̪ ɛ ŋ k" in German; the capital city Dhaka (pronounced in Bengali with a voiced aspirated "ɖ ̤ ") becomes the unaspirated "d ae kʰ ae" in English.

Language Distance Metric
Since we are interested in mapping high-resource languages to low-resource related languages, an important subtask is finding the related languages of a given language. The URIEL Typological Compendium (Littell et al., 2016) is an invaluable resource for this task. By using features from linguistic databases (including Phoible), URIEL provides 5 distance metrics between languages: genetic, geographic, composite (a weighted composite of genetic and geographic), syntactic, and phonetic. We extend URIEL by adding two additional metrics, providing averaged distances over all metrics, and adding additional information about resources. This creates lang2lang, a table which provides distances between and information about 2,790 languages.

Phoneme inventory distance
Although URIEL provides a distance metric between languages based on Phoible features, it only takes into account broad phonetic features, such as whether each language has voiced plosives. This can result in some non-intuitive results: based on this metric, there are almost 100 languages phonetically equivalent to the South Asian language Gujarati, among them Arawak and Chechen.
To provide a more fine-grained phonetic distance metric, we create a phoneme inventory distance metric using phon2phon. For each pair of language phoneme inventories L 1,2 in Phoible, we compute the following: and normalize by dividing by ∑ i d(L 1 , L i ).

Script distance
Although Urdu is very similar to Hindi, its different alphabet and writing conventions would make it difficult to transfer an Urdu g2p model to Hindi. A better candidate language would be Nepali, which shares the Devanagari script, or even Bengali, which uses a similar South Asian script.
A metric comparing the character sets used by two languages is very useful for capturing this relationship.
We first use our multilingual named entity data to extract character sets for the 232 languages with more than 500 NE pairs; then, we note that Unicode character names are similar for linguistically related scripts. This is most notable in South Asian scripts: for example, the Bengali ক, Gujarati ક, and Hindi क have Unicode names BENGALI LETTER KA, GUJARATI LETTER KA, and DEVANAGARI LETTER KA, respectively.
We remove script, accent, and form identifiers from the Unicode names of all characters in our character sets, to create a set of reduced character names used across languages. Then we create a binary feature vector f for every language, with each feature indicating the language's use of a reduced character (like LETTER KA). The distance between two languages L 1,2 can then be computed with a spatial cosine distance:

Resource information
Each entry in our lang2lang distance table also includes the following features for the second language: the number of named entities, whether it is in Europarl (Koehn, 2005), whether it has its own Wikipedia, whether it is primarily written in the same script as the first language, whether it has an IPA Help page, whether it is in our Wiktionary test set, and whether it is in our Wiktionary training set. Table 5 shows examples of the closest languages to English, Hindi, and Vietnamese, according to different lang2lang metrics.

Evaluation Metrics
The next two sections describe our high-resource and adapted g2p models. To evaluate these models, we compute the following metrics: • % of words skipped: This shows the coverage of the g2p model. Some g2p models do not cover all character sequences. All other metrics are computed over non-skipped words. • word error rate (WER): The percent of incorrect 1-best pronunciations. • word error rate 100-best (WER 100): The percent of 100-best lists without the correct pronunciation. • phoneme error rate (PER): The percent of errors per phoneme. A PER of 15.0 indicates that, on average, a linguist would have to edit 15 out of 100 phonemes of the output. We then average these metrics across all languages (weighting each language equally).

High Resource g2p Models
We now build and evaluate g2p models for the "high-resource" languages for which we have either IPA Help tables or sufficient training data from Wiktionary. Table 6 shows our evaluation of these models on Wiktionary test data, and Table 7 shows results for individual languages.

IPA Help models
We first use the rules scraped from Wikipedia's IPA Help pages to build rule-based g2p models. We build a wFST for each language, with a path for each rule g → p and weight w = 1/count(g).
This method prefers rules with longer grapheme segments; for example, for the word tin, the output "ʃ n" is preferred over the correct "tʰ ɪ n" because of the rule ti→ʃ. We build 97 IPA Help models, but have test data for only 91-some languages, like Mayan, do not have any Wiktionary entries.
As shown in Table 6, these rule-based models do not perform very well, suffering especially from a high percentage of skipped words. This is because IPA Help tables explain phonemes' relationships to graphemes, rather than vice versa. Thus, the English letter x is omitted, since its composite phonemes are better explained by other letters.

Wiktionary-trained models
We next build models for the 85 languages in our Wiktionary train data set, using the wFSTbased Phonetisaurus (Novak et al., 2011) and MITLM (Hsu and Glass, 2008), as described by Novak et al (2012). We use a maximum of 10k pairs of training data, a 7-gram language model, and 50 iterations of EM.
These data-driven models outperform IPA Help models by a considerable amount, achieving a WER of 44.69 and PER of 15.06 averaged across all 85 languages. Restricting data to 2.5k or more training examples boosts results to a WER of 28.02 and PER of 7.20, but creates models for only 29 languages.
However, in some languages good results are obtained with very limited data; Figure 2 shows the varying quality across languages and data availability.

Unioned models
We also use our rule-based IPA Help tables to improve Wiktionary model performance. We accomplish this very simply, by prepending IPA help rules like the German sch→ʃ to the Wiktionary training data as word-pronunciation pairs, then running the Phonetisaurus pipeline.
Overall, the unioned g2p models outperform both the IPA help and Wiktionary models; however, as shown in Table 7 Table 7: WER scores for Bengali, Tagalog, Turkish, and German models. Unioned models with IPA Help rules tend to perform better than Wiktionary-only models, but not consistently.

Adapted g2p Models
Having created a set of high-resource models and our phon2phon and lang2lang metrics, we now explore different methods for adapting highresource models and data for related low-resource languages. For comparable results, we restrict the set of high-resource languages to those covered by both our IPA Help and Wiktionary data.

No mapping
The simplest experiment is to run our g2p models on related low-resource languages, without adaptation. For each language l in our test set, we determine the top high-resource related languages h 1,2,... according to the lang2lang averaged metric that have both IPA Help and Wiktionary data and the same script, not including the language itself. For IPA Help models, we choose the 3 most related languages h 1,2,3 and build a g2p model from their combined g-p rules. For Wiktionary and unioned models, we compile 5k words from the closest languages h 1,2,... such that each h contributes no more than one third of the data (adding IPA Help rules for unioned models) and train a model from the combined data.
For each test word-pronunciation pair, we trivially map the word's letters to the characters used in h 1,2,... by removing accents where necessary; we then use the high-resource g2p model to produce a pronunciation for the word. For example, our Czech IPA Help model uses a model built from g-p rules from Serbo-Croatian, Polish, and Slovenian; the Wiktionary and unioned models use data and rules from these languages and Latin as well.
This expands 56 g2p models (the languages covered by both IPA Help and Wiktionary models) to models for 211 languages. However, as shown in Table 8, results are very poor, with a very high WER of 92% using the unioned models and a PER of more than 50%. Interestingly, IPA Help models perform better than the unioned models, but this is primarily due to their high skip rate.

Output mapping
We next attempt to improve these results by creating a wFST that maps phonemes from the inventories of h 1,2... to l (as described in Section 5.3). As shown in Figure 1a, by chaining this wFST to h 1,2... 's g2p model, we map the g2p model's output phonemes to the phonemes used by l. In each base model type, this process considerably improves accuracy over the no mapping approach; however, the IPA Help skip rate increases (Table 8).

Training data mapping
We now build g2p models for l by creating synthetic data for the Wiktionary and unioned models, as in Figure 1b. After compiling wordpronunciation pairs and IPA Help g-p rules from closest languages h 1,2,... , we then map the pronunciations to l and use the new pronunciations as training data. We again create unioned models by adding the related languages' IPA Help rules to the training data.
This method performs slightly worse in accuracy than output mapping, a WER of 87%, but has a much lower skip rate of 7%.

Rescripting
Adaptation methods thus far have required that h and l share a script. However, this excludes languages with related scripts, like Hindi and Bengali. We replicate our data mapping experiment, but now allow related languages h 1,2,... with different scripts from l but a script distance of less than 0.2. We then build a simple "rescripting" table based on matching Unicode character names; we can then map not only h's pronunciations to l's phoneme set, but also h's word to l's script.
Although performance is relatively poor, rescripting adds 10 new languages, including Telugu, Gujarati, and Marwari. Table 8 shows evaluation metrics for all adaptation methods. We also show results using all 85 Wiktionary models (using unioned where IPA Help is available) and rescripting, which increases the total number of languages to 229. Table 9 provides examples of output with different languages.

Discussion
In general, mapping combined with IPA Help rules in unioned models provides the best results.
Training data mapping achieves similar scores as output mapping as well as a lower skip rate. Word skipping is problematic, but could be lowered by collecting g-p rules for the low-resource language.
Although the adapted g2p models make many individual phonetic errors, they nevertheless capture overall pronunciation conventions, without requiring language-specific data or rules. Specific points of failure include rules that do not exist in related languages (e.g., the silent "e" at the end of "fuse" and the conversion of "d̪ ʃ" to "ɡ" in Egyptian Arabic), mistakes in phoneme mapping, and overall "pronounceability" of the output.

Limitations
Although our adaptation strategies are flexible, several limitations prevent us from building a g2p model for any language. If there is not enough information about the language, our lang2lang table will not be able to provide related highresource languages. Additionally, if the language's script is not closely related to another language's and thus cannot be rescripted (as with Thai and Armenian), we are not able to adapt related g2p data or models.

Conclusion
Using a large multilingual pronunciation dictionary from Wiktionary and rule tables from Wikipedia, we build high-resource g2p models and show that adding g-p rules as training data can improve g2p performance. We then leverage lang2lang distance metrics and phon2phon phoneme distances to adapt g2p resources for highresource languages for 229 related low-resource languages. Our experiments show that adapting training data for low-resource languages outperforms adapting output. To our knowledge, these are the most broadly multilingual g2p experiments to date.
With this publication, we release a number of resources to the NLP community: a large multilingual Wiktionary pronunciation dictionary, scraped Wikipedia IPA Help tables, compiled named entity resources (including a multilingual gazetteer), and our phon2phon and lang2lang distance tables. 4 Future directions for this work include further improving the number and quality of g2p models, as well as performing external evaluations of the models in speech-and text-processing tasks. We plan to use the presented data and methods for other areas of multilingual natural language processing.