The SIGMORPHON 2020 Shared Task on Multilingual Grapheme-to-Phoneme Conversion

We describe the design and findings of the SIGMORPHON 2020 shared task on multilingual grapheme-to-phoneme conversion. Participants were asked to submit systems which take in a sequence of graphemes in a given language as input, then output a sequence of phonemes representing the pronunciation of that grapheme sequence. Nine teams submitted a total of 23 systems, at best achieving a 18% relative reduction in word error rate (macro-averaged over languages), versus strong neural sequence-to-sequence baselines. To facilitate error analysis, we publicly release the complete outputs for all systems—a first for the SIGMORPHON workshop.


Introduction
Speech technologies such as automatic speech recognition and text-to-speech synthesis require mappings between written words and their pronunciations. Even recent attempts to do away with explicit pronunciation models via "end-to-end" systems (e.g., Watts et al. 2013, Chan et al. 2016, Sotelo et al. 2017, Chiu et al. 2018, Pino et al. 2019) must induce an implicit mapping of this sort. For open-vocabulary applications, these mappings must generalize to unseen words, and so must be expressed as mappings between sequences of graphemes-i.e., glyphsand phonemes or phones-i.e., sounds. 1 For some languages, this mapping is sufficiently consistent that a literate, linguisticallysophisticated speaker can simply enumerate the necessary rules; this sequence of rules can then be compiled into a finite-state transducer (e.g., Sproat 1996, Black et al. 1998). However, rulebased systems require linguistic expertise to develop and maintain, and may be brittle or inaccurate. Therefore, modern speech engines usually treat grapheme-to-phoneme conversion as a machine learning problem, either using generative models expressed as weighted finite-state transducers (e.g., Taylor 2005, Bisani and Ney 2008, Wu et al. 2014, Novak et al. 2016 or discriminative models based on conditional random fields (Lehnen et al. 2013), recurrent neural networks (e.g., Rao et al. 2015, Yao and Zweig 2015, van Esch et al. 2016, Lee et al. 2020 or transformers (Yolchuyeva et al. 2019).
While the grapheme-to-phoneme conversion (or G2P) task is crucial to speech technology, the vast majority of published research focuses on English or a few other highly-resourced, globally hegemonic languages for which free pronunciation dictionaries are available. One exception, a recent study by van Esch et al. (2016), compares naïve rule-based systems and neural network-based sequence-to-sequence models for 20 languages; unfortunately, the data used in this study is proprietary. Like many other types of language resources, pronunciation dictionaries are expensive to create and maintain, and until recently, free high-quality dictionaries were only available for a small number of languages.
This limitation to a handful of languages is unfortunate because, as we discuss below, writing systems are almost as diverse as languages themselves. Therefore, we present a multilingual grapheme-to-phoneme conversion task with data sets, evaluation metrics, and strong baselines. In this we are aided by the recent release of WikiPron (Lee et al. 2020), a freely available collection of pronunciation dictionaries. The resulting task, the first of its kind, included data from fifteen languages and scripts, and received 23 submissions from nine teams.

Data
Fifteen language/script pairs were chosen to cover a wide variety of script types. Ten of the scripts are alphabetic systems known to descend from Phoenician (and ultimately from Egyptian hieroglyphs); of these, seven are variants of the Latin script. Two others, the Armenian aybuben and the Georgian mkhedruli, are alphabetic scripts of unknown origin, but may ultimately be modeled on Greek (Sanjian 1996). The devanāgarī script used to write Hindi, is an alphasyllabary, in which most glyphs-known traditionally as akṣara-denote consonant and consonant-vowel sequences. Vowels (or their absence) are primarily indicated with diacritics. It too is thought to ultimately descend from Phoenician. Hiragana, one of several scripts used to write Japanese, is a syllabary, in which most glyphs denote entire syllables The glyphs themselves are derived from Chinese characters. Like hiragana, the Korean hangul script is also a syllabary It may have been have been inspired by 'phags-pa, a Tibetan alphabet which is itself a distant cousin of devanāgarī (Ledyard 1966).
It is important to note that languages-and the scripts used to write them-differ enormously in their affordances for grapheme-to-phoneme conversion. Writing systems are, at their core, linguistic analyses, albeit sometimes quite naïve, and (as argued in DeFrancis 1989) explicitly encode details of the phonological and phonetic structure of the language they are used to write. Still, the exact details of these mappings can vary greatly between even closely related languages and/or scripts. Whereas related languages may retain telltale grammatical features across millennia, dozens of languages have abruptly switched from one script to another in just the last century, usually in response to political-rather than linguisticconcerns. It is thus unsurprising that Bjerva and Augenstein (2018) find grapheme embeddings induced by training G2P systems are poorly correlated with gross phonological typology, and experiments with "polyglot" G2P models (e.g., Peters et al. 2017) have produced equivocal results.
While we did not pay particular attention to language families when selecting language family, we note that nine of the languages are Indo-European (though no two are closely related) and none of the remaining six-Adyghe, Georgian, Hungarian, Japanese, Korean, and Vietnameseare known to be genetically related to each other.

Methods
The primary data for the shared task is derived from WikiPron (Lee et al. 2020), a massively multilingual resource of grapheme-phoneme pairs extracted from Wiktionary, an online multilingual dictionary. Depending on language and script, these pronunciations may be manually entered by human volunteers-usually working from language-specific pronunciation guidelines-or generated using server-side scripting routines; some languages (e.g., Bulgarian and French) use a mixture of the two approaches. WikiPron is configured to apply case-folding where appropriate. It removes stress and syllable boundary markers and segments pronunciation strings-encoded in the International Phonetic Alphabet-using the segments library (Moran and Cysouw 2018).
For this task, words with multiple pronunciations-both homographs and free pronunciation variants-were excluded, since pronunciations for such words are often selected by a rather different procedure: they are chosen from a small, predetermined set of possible pronunciations using classifiers conditioned on local context (e.g., Gorman et al. 2018).
Training and development data for ten languages-the "development" languages-was released at the start of the task; equivalent data for the five "surprise" languages was released one week before the start the evaluation phase. Table 1 provides sample training data pairs for the development and surprise languages.
As there is considerable variation in the number of available examples for any given language, each languages' data was downsampled to 4,500 examples. We regard as a "medium-resource" setting for this task; these data sets are, for instance, several orders of magnitude smaller than the proprietary G2P data used by van Esch et al. (2016). Following similar procedures in other shared tasks (e.g., Cotterell et al. 2017), words were sampled according to their frequency in the largest available Wortschatz (Goldhahn et al. 2012)  pʰ ɔ r m ɪ ɑ n ɪ Modern Greek gre καθισμένες k a θ i z m e n e s Hindi hin कै लकु ले टर k ɛː l k ʊ l eː ʈ ə ɾ Hungarian hun csendőrök t ͡ ʃ ɛ n d øː r ø k Icelandic hin þýskaland θ i s k a l a n t Korean kor 말레이시아 m a̠ ɭ ɭ e̞ i ɕʰ i a̠ Lithuanian lit galinčiais ɡ aː lʲ ɪ nʲ tʲ ʃʲ ɛ j s Adyghe ady бзыукъолэн b z ə w qʷ a l a n Dutch dut aanduiding aː n d oe y ̯ d ɪ ŋ Japanese hiragana jpn どちらさま d o̞ t ͡ ɕ i ɾ a̠ s a̠ m a̠ Romanian rum bineînțeles b i n e ɨ n t s e l e s Vietnamese vie duyên phận z w i ə n ˧˧ f ə n ˧˨ ʔ was not available for Adyghe, so uniform sampling was used for this language.
The downsampled data was then randomly split into training (80%; 3,600 examples), development (10%; 450 examples), and testing (10%; 450 examples) shards. For some languages, Wiktionary contains pronunciations for both lemmas (i.e., headwords, citations forms) and inflection variants; for others, pronunciations are only available for lemmas. We hypothesized that cases where one inflectional variant of a lemma is present in the training data and another in the test data-as might occur if the data was split totally at randomwould make the overall task somewhat easier. To forestall this possibility, the splitting procedure was constrained so that all inflectional variants of any given lemma-according to the UniMorph 2 (Kirov et al. 2018) paradigm tables, also extracted from Wiktionary-are limited to a single shard. For example, since the French word acteur 'actor' occurs in the training shard, so must its plural form acteurs. This additional constraint was applied to all languages but Japanese and Vietnamese, for which no UniMorph data was available. We note that Wiktionary does not generally provide pronunciations for inflectional variants in Japanese, and that Vietnamese is a highly isolating language with no discernable system of inflection (Noyer 1998), so this is unlikely to have introduced bias.

Evaluation
The primary metric for this task was word error rate (WER); we also report phone error rate (PER).
WER This is the percentage of words for which the hypothesized transcription sequence is not identical to the gold reference transcription; lower WER indicates better performance. Following common practice in speech research, we multiply the WER by 100 and display it as a percentage. We choose this as the primary metric for the shared task because we hypothesize that any G2P error, no matter how small, will result in a substantial degradation in subjective quality for downstream speech applications.
PER This is a more forgiving measure measuring the normalized distance (i.e., in number of insertions, deletions, and substitutions) between the predicted and reference transcriptions. It is computed by summing the minimum edit distancecomputed with the Wagner and Fischer (1974) algorithm-between prediction and reference transcriptions, and dividing by the sum of the reference transcription lengths. That is, PER := 100 × ∑ n i edits(p, r) ∑ n i |r| where p is the predicted pronunciation sequence, r is the reference sequence, and edits(p, r) is the Levenshtein distance between the two. Once again, we multiply it by 100, though strictly speaking it is not a true percentage because it can hypothetically exceed 100. As with WER, lower PER indicates better performance.
Participants were provided with two evaluation scripts: one which computes the two metrics for a single language, and another which macroaverages the metrics across all languages.

Baselines
Three baselines were made available at the start of the task. To aid reproducibility, participants were also provided with a Conda "environment", a schematic that allows users to reconstruct the exact software environment used to train and evaluate the baselines. Several submissions made use of the baselines for data augmentation or ensemble construction. We make these baseline implementations available under the task1/baselines subdirectory of the shared task repository. 2 Pair n-gram model The first baseline consists of a pair n-gram model, which be can thought of as a finite-state approximation of a hidden Markov model with states representing graphemes and emissions representing output phones. The model is quite similar to the Phonetisaurus toolkit (Novak et al. 2016), but here is implemented using the OpenGrm toolkit (Roark et al. 2012, Gorman 2016); see Lee et al. 2020 for a full description. The sole hyperparameter for this model, Markov model order, is tuned separately for each language using the development set.
Encoder-decoder LSTM The second baseline is a neural network sequence-to-sequence model consisting of a single-layer bidirectional LSTM encoder and a single-layer unidirectional LSTM decoder connected using an attention mechanism (Luong et al. 2015). It is implemented using the fairseq library (Ott et al. 2019). LSTM-based encoder-decoder models have been claimed to outperform pair n-gram G2P models, both in monolingual (e.g., Rao et al. 2015, Yao andZweig 2015) and multilingual (e.g., van Esch et al. 2016, Lee et al. 2020) evaluations, though these prior studies use substantially more training data than is available in this task. During training, we perform 4,000 updates to minimize label-smoothed crossentropy (Szegedy et al. 2016) with a smoothing rate of .1. We use the Adam optimizer (Kingma and Ba 2015) with a learning rate of α = .001 and weight decay coefficients of β = (.9, .98), and clip norms exceeding 1.0. We use the development set to tune-for each language-batch size (256, 512, 1024), dropout (.1, .2, .3), and the size of the encoder and decoder modules. A module is said to be "small" when it has a 128-dimension embedding layer and a 512-unit hidden layer, and "large" when it has a 256-dimension embedding layer and a 1024-unit hidden layer. In both cases, the decoder shares a single embedding layer for both inputs and outputs. Altogether, this defines a 36element hyperparameter grid. During tuning, we employ a form of early stopping; we save a checkpoint every 5 epochs, and then use the checkpoint that achieves the lowest WER on the development set. We use a beam of size 5 for decoding.

Encoder-decoder transformer
The third baseline is a transformer, a neural sequence-tosequence models that replaces hidden layer recurrence with layers of multi-head self-attention (Vaswani et al. 2017). Once again, it is implemented using fairseq. Here the model consists of four encoder layers and four decoder layers, both with pre-layer normalization, tuned for character-level tasks (Wu et al. 2020). The hyperparameter grid, tuning procedures, and beam size are the same as for the LSTM model above, except that learning rate is decayed on an inverse squareroot schedule after a 1,000-update linear warm-up period. While most participants chose to compare their results to the transformer and not the LSTM in system description papers, the transformer was outperformed by the LSTM baseline in most setting with the hyperparameter exploration budget.

System descriptions
Below we provide brief descriptions of submissions to the shared task.
CLUZH The Institute of Computational Linguistics at the University of Zurich submitted a single system (Makarov and Clematide 2020) extending earlier work (Makarov and Clematide 2018) on imitation learning-based transducers that output a sequence of edit actions rather than a target string itself. To adapt to the G2P task, where input (grapheme) and output (phone) vocabularies are largely disjoint, they add a substitution action. The costs of each edit action are drawn from a weighted finite state transducer (WFST). The authors suggest that external lexical information such as part of speech, etymology (borrowing particularly) and morphological segmentation would improve systems. During preprocessing, they decompose Korean hangul characters into their constituent jamo, each corresponding roughly to a single phoneme.
CU One team from the University of Colorado Boulder (Prabhu and Kann 2020) ensembled several transformer models created with different random seeds using majority voting. They also experiment with a form of multi-task learning: they train a "bidirectional" model to do both grapheme-tophoneme and phoneme-to-grapheme prediction.
CUZ A second team from the University of Colorado Boulder (Ryan and Hulden 2020) uses a "slice-and-shuffle" data augmentation strategy. First, they perform character-level one-to-one alignment between graphemes and phonemes. Then they concatenate frequent subsequence pairs to each other to create nonce training examples. Their submission is an LSTM model with a bidirectional encoder trained on this augmented data. While they also developed transformer models, these did not finish training in time for submission. Results for their transformer system, not reported here, are included in their system description.

DeepSPIN Researchers at the Instituto Superior
Técnico and Unbabel produced four submissions (Peters and Martins 2020) based on sparse attention models. Each submission consists of a single multilingual neural model in which separate learned "language embeddings" are concatenated to all encoder and decoder states, rather than prepending a language-identification token to the input sequence. Their submissions either use LSTM-or transformer-based encoder-decoder sequence-to-sequence models with different values of a hyperparameter enforcing sparsity in the final layer (Peters et al. 2019). Like CLUZH, they preprocess Korean hangul characters, decomposing them into constituent jamo, each corresponding roughly to a single phoneme.
IMS A single submission from the Institut für Maschinelle Sprachverarbeitung at the University of Stuttgart (Yu et al. 2020) uses self-training (Yarowsky 1995) and ensembles of the baseline models. The components of the ensemble are selected using a genetic algorithm. They report that their data augmentation does not affect performance substantially, except in a simulated lowresource setting with 200 training examples. They romanize Japanese and Korean texts as a preprocessing step, and they use external word frequency lists.
NSU The Novosibirsk State University team did not provide a system description.
UA The submissions from the University of Alberta (Hauer et al. 2020) either use a non-neural discriminative string transduction model (DTLM; Nicolai et al. 2018), or tranformers. They leverage both grapheme-to-phoneme and phoneme-tographeme models to filter candidates for data augmentation, enforcing a cyclic consistency constraint. They further show strong performance in a simulated low-resource scenario with 100 training examples. They note that the DTLM system is much faster to train than transformer models. Their six submissions vary the amount of training data and use either DTLM, a transformer, or a transformer with data augmentation.
UBCNLP The University of British Columbia submitted two systems (Vesik et al. 2020). One is a multilingual model akin to Peters et al. (2017), in which a language-identification token is prepended to the input sequence. They also ensemble multiple checkpoints. Their second submission adds self-training on Wikipedia text; they report that this data augmentation strategy does not improve scores.
UZH For all three of their submissions, the team from the Department of Informatics at the University of Zurich (ElSaadany and Suter 2020) used a single set of encoder-decoder parameters shared across all languages. UZH-1 is a large transformer model with large embedding, hidden layers, and batches, with a high dropout probability. UZH-2 augments this model with WikiPron data for six other languages. UZH-3 is an ensemble of the previous two models which selects from the predictions of the two component models using whichever model's prediction has a higher posterior probability. The ensemble outperformed the component models for most languages. During preprocessing they also decompose Korean hangul characters into their constituent jamo; they report this results in a 46% relative word error reduction.

Results
We now review baseline and submission results.

Baseline results
Baseline results are shown in Table 2. The encoder-decoder LSTM (Lee et al. 2020) performed best for nine out of fifteen languages; the transformer was the strongest for four languages, and for the remaining two-Modern Greek and Hungarian-there was a virtual tie between the two neural network baselines. The pair n-gram model was outperformed by the neural baselines on all languages, and by 10 or more points WER in Bulgarian, Georgian, and Korean. This suggest that this model is no longer competitive with powerful discriminative neural methods, at least in this medium-resource G2P task.
While this task was not designed explicitly to compare LSTM and transformer sequence-tosequence models, it does suggest an advantage for LSTM models. However, we speculate that additional training data, or a more generous hyperparameter tuning budget, might favor transformer models. Indeed, anticipating the results below, the one team that directly compared transformer and LSTM systems, DeepSPIN, achieved the third best submission overall using a transformer.
We also note that for four languages, the baseline system that achieves the best WER does not achieve the best PER, though the two metrics produce the same one-best ranking for the remaining eleven languages. Table 3 shows, for each language, the system or systems that achieved the best WER, as well as the best baseline WER. For all fifteen languages, at least one team outperformed the baselines, sometimes quite substantially. Six of the nine teams achieved the best WER on at least one language. More detailed per-language, per-submission results are available online. 3 Table 4 gives the macro-averaged WER and PER for the three baselines, and for the best overall submission from each team. As expected, the strongest baseline is the LSTM model. Across all submissions, the IMS team achieves both the lowest average WER, a 3% absolute (18% relative) word error reduction over the LSTM baseline, and the lowest overall PER, a 1% absolute (31% relative) phone error reduction over the LSTM baseline. The CLUZH and DeepSPIN-3 submissions achieve second and third place, respectively; the CU, UCBNLP, and UZH teams also submitted systems that outperform the LSTM baseline's WER.

Discussion
When this task was initially proposed, there was some concern that the submissions-if not the baselines themselves-would easily achieve perfect or near-perfect performance on some languages. This was not the case. Even on the "easiest" language, the best submission has .89% WER, and for three languages, no submission achieves an error rate below 20%.
At the same time, we observe a large range of error rates across languages. It is tempting to speculate that word and/or phone error rates actually represent differences in difficulty. Insofar as this is correct, we can begin to ask what makes a language "hard to pronounce", much like how Mielke et al. (2019) ask what makes a language "hard to language-model".
One thing that may make a language hard to pronounce is data sparsity. Consider the case of Korean, which has by far the highest baseline error rate of all fifteen languages. Three features of Korean and of hangul conspire to make this task particularly challenging. First, hangul is a syllabary, and therefore necessarily has a much larger graphemic inventory than an alphabet or alphasyllabary. A whopping 889 unique hangul characters appear across the 4,500 words used for this task. 4 Secondly, hangul is a relatively deep or abstract orthography (in the sense of Rogers 2005); it operates at a roughly-morphophonemic level whereas Lithuanian and Hungarian, for example, are is roughly phonemic. Third, Korean has many phonological processes that operate across syllable boundaries. Since the effect of these processes is not indicated by the highly abstract, morphophonemic orthography, they can only be learned by observing the targeted syllable bigrams during training. Lee et al. (2020)   to the LSTM baseline and observe errors caused by underapplication of these coda-onset cluster rules. It is unsurprising then that several submissions achieved substantial gains by either romanizing hangul or decomposing it into its constituent jamo during preprocessing, since both techniques reduce the size of the input vocabulary.
The results suggest that G2P technologies are not yet language-agnostic (in the sense of Bender 2009). However, some caution is in order here: inter-language differences in word error rate may also reflect inconsistencies in the WikiPron data itself. During the task, participants reported apparent transcription inconsistencies in the Bulgarian, Georgian, and Lithuanian Wiktionary data. If these inconsistencies are due to overly-narrow allophonic transcriptions, one might suspect that they can be learned by sufficiently sophisticated sequence-to-sequence models. However, if they represent free variation, inconsistent application of the transcription guidelines, or even typographical errors, they inflate error rates and increase the risk of overfitting. In response to this, we have begun development of quality assurance software for WikiPron, including a phone-based whitelisting approach. We anticipate that manual error analysis will reveal errors in the Wiktionary data, similar to the large number of test data er-rors identified by  for the 2017 CoNLL-SIGMORPHON morphological inflection task. To encourage this sort of error analysis, for the first time in the history of the SIGMOR-PHON workshop, we publicly release the predictions made by all 23 submissions. 5 Finally, we plan to apply large-scale consistency-enforcing edits upstream, i.e., to Wiktionary itself.
While the baselines are somewhat naïve and lack the sophisticated data augmentation and ensembling techniques used by the top submissions, we were pleasantly surprised by the substantial reductions in error achieved by the participating teams. As mentioned above, the best submissions handily outperforms the baselines for all languages. Interestingly, this is true for the most challenging languages-like Korean, where the best submission achieves a 45% relative word error reduction over the baseline-but also for Vietnamese, the language with the lowest baseline WER; there, the best submission achieves an impressive 81% relative word error reduction.
As mentioned above, top submissions make use of techniques such as preprocessing, data augmentation, ensembling, multi-task learning (e.g., phoneme-to-grapheme conversion), and self-   training. These techniques are commonly used in shared tasks and are essentially task-agnostic. However, we were surprised that few teams made use of task-specific resources such as the PHOIBLE phonemic inventories and feature specifications (Moran and McCloy 2019) or rule-based G2P systems like Epitran (Mortensen et al. 2018). Nor do any of the submissions make use of morphological analyzers or lexicons, which were found to be helpful in earlier work (e.g., Coker et al. 1990, Demberg et al. 2007). We speculate that such resources might further improve performance. Finally we note that submissions make use of unsupervised tokenization techniques such as byte-pair encoding (Schuster and Nakajima 2012). Finally, we note that several participants expressed interest in a low-resource version of this challenge, and two teams simulated a lowresource setting. We leave the design of a lowresource task for future work.

Conclusion
SIGMORPHON, under whose auspices this task was conducted, was once known as SIGPHON and was primarily focused on computational phonetics and phonology.
The shared task on multilingual grapheme-to-phoneme conversion, a uniquely phonological problem, thus represents something of a return to the roots of this special interest group. In this task, nine teams submitted 23 G2P systems for fifteen languages and achieved substantial improvements over the provided baselines. The results suggest many directions for improving G2P systems and the pronunciation dictionaries used to train them.