One Model to Pronounce Them All: Multilingual Grapheme-to-Phoneme Conversion With a Transformer Ensemble

The task of grapheme-to-phoneme (G2P) conversion is important for both speech recognition and synthesis. Similar to other speech and language processing tasks, in a scenario where only small-sized training data are available, learning G2P models is challenging. We describe a simple approach of exploiting model ensembles, based on multilingual Transformers and self-training, to develop a highly effective G2P solution for 15 languages. Our models are developed as part of our participation in the SIGMORPHON 2020 Shared Task 1 focused at G2P. Our best models achieve 14.99 word error rate (WER) and 3.30 phoneme error rate (PER), a sizeable improvement over the shared task competitive baselines.


Introduction
Speech technologies are becoming increasingly pervasive in our lives. The task of graphemeto-phoneme (G2P) conversion is an important component of both speech recognition and synthesis. In G2P conversion, sequences of graphemes (the symbols used to write words) are mapped to corresponding phonemes (pronunciation symbols, e.g., symbols of the International Phonetic Alphabet). Members of the Special Interest Group on Computational Morphology and Phonology (SIGMORPHON) have proposed a G2P shared task (SIGMOR-PHON 2020 Shared Task 1) 1 involving multiple languages. In this paper, we describe our submissions to the shared task. Organizers provide an overview of the task and submitted systems in  (this volume). 1 The shared task webpage is accessible at: https: //sigmorphon.github.io/sharedtasks/2020/task1.
The task was introduced with data from 10 languages, with an additional 5 'surprise' languages released during the task timeline. Our goal was to develop an effective system based on modern deep learning methods as a solution. However, deep learning technologies work best with sufficiently large training data. Hence, a clear challenge we came across is the limited size of the shared task training data for each of the 15 individual languages. To ease this bottleneck, we decided to view the task through a multilingual machine translation lens where we build a single model mapping from input to output across all the languages simultaneously. In this, we hypothesized that a multilingual model would allow for shared representations across the various languages that may be more powerful than individual representations of monolingual models. Abundant evidence now exists for approaching machine translation tasks from a multilingual perspective (Johnson et al., 2017a;Dong et al., 2015;Firat et al., 2016), which inspired our choice.
In order to make use of unlabeled data, we also explore a straightforward self-training approach. In particular, we employ our trained models to convert sequences of multilingual unlabeled graphemes, taken from Wikipedia data, into multilingual phonemes. We then select sequences of phonemes predicted with our models above a certain confidence threshold to augment the shared task training data, thus re-training our models with larger (gold and silver) training data from scratch. Our models are based on the Transformer architecture which exploits effective self-attention. We show that both our multilingual model and the self-trained variation outperform the results of the competitive baseline monolingual models provided by the task organizers. Ultimately, we demonstrate how our simple modeling choices enable us to provide an effective solution to the problem in spite of the lowresource challenge. Intrinsically, our approach also enjoys the simplicity of a single model rather than 15 different models.
The rest of the paper is organized as follows: Section 2 is a description of the shared task data, evaluation metrics, and baselines. Section 3 introduces both our fully supervised, multilingual models (Section 3.1) and self-trained model (Section 3.2). We present our results in Section 4. We provide an analysis of results and report on an ablation study in Section 5. We overview related work in Section 6, and conclude in Section 7.

Task Data, Evaluation, and Baselines
The data provided by the organizers of the shared task are extracted from Wiktionary 2 using the WikiPron library (Lee et al., 2020), and consist of 4,050 gold labeled graphemephoneme pairs for each of 15 languages, split into a training set (3,600 per language) and a development set ( Evaluation. For evaluation, the task organizers use both Word Error Rate (WER) and Phoneme Error Rate (PER). WER is the percentage of words whose predicted transcription does not match the gold transcription; PER is the micro-averaged edit distance between predicted and gold transcriptions. We follow this

Language Source
Target (IPA) Alphabet: set-up in evaluating our models on the development data as well, as reported in this paper.
Baselines. Organizers provide a number of monolingual baselines. The first is a pair n-gram model encoded as a weighted finitestate transducer (FST), implemented using the OpenGRMtoolkit 4 . The second is a bi-LSTM encoder-decoder sequence model implemented using the Fairseq toolkit 5 . The third is a Transformer model also implemented using the Fairseq toolkit. Organizer-provided shared task baselines are shown in Table 2 as WER and PER averages over the 15 languages. We now introduce our models.

Models
As explained, our models are based on Transformers and we offer two primary types of models, depending on how we supervise each. We first introduce fully supervised multilin-gual models, then we introduce our semisupervised models (also multilingual). Our semi-supervised models follow a self-training set up. We now explain each of these models.

Supervised, Multilingual Models
We use a multilingual approach where we train a single model on data from all 15 languages. For this purpose, we prepend a token comprising a language code (e.g. fre) to each grapheme sequence source. For our implementation, we use the PyTorch Transformer architecture in the OpenNMT Neural Machine Translation Toolkit (Klein et al., 2017). We set the model hyper-parameters as shown in Table 3, which follows those adopted by Vaswani et al. (2017).  We train the model with 3 different random seeds, and at inference we employ an ensemble consisting of the models from 4 training checkpoints (at 50k, 100k, 150k, and 200k steps) for each of the 3 models generated by the random seeds. We note that OpenNMT averages individual models' prediction distributions, which is how we deploy our ensemble. We use beam search with the OpenNMT default beam width of 5. 6

Wikipedia Data Augmentation
One of the models we submitted to the task employs a self-training approach, as a way to augment training data. The additional data is sourced from Wikipedia articles from 12 of the 15 languages (excluding Adyghe, Japanese, and Vietnamese) 7 . We download the Wikipedia dumps from the Wikimedia website 8 and use an off-the-shelf tool 9 for extracting text. Further pre-processing involved removing any remaining XML markup, discarding leading and trailing punctuation and numerals for each word, and ignoring any words with remaining word-internal punctuation or numerals. Due to time constraints, only one million words from each language were used, and from those only unique entries were submitted to the model for translation and subsequent evaluation as potential candidates for augmenting training data.

Procedure
As explained, self-training data is drawn from the translations of Wikipedia text in 12 languages as predicted by an ensemble model. In order to select pairs to augment the training set, we first calculate the mean per-class softmax value in the development set (which we 149 find to be at 0.11). 10 Comparatively, the average per-class softmax value for the predicted Wikipedia targets for each language ranges from 0.12 to 0.30. Based on this analysis, we select only those Wikipedia pairs whose predicted targets have a probability greater than 0.2. 11 The selected data are combined with the original (i.e., from official task) training set and the models are re-trained using the same hyper-parameters as the fully-supervised setting.  Both models demonstrate lower word error rates (WER) and phoneme error rates (PER), averaged across languages, than the baseline monolingual models provided by the task organizers (see Table 2 in Section 2). Error rates per language are shown in Table 5 for the development set and Table 6 for the blind test set (results published by organizers). Table 7 10 As is known, the softmax function produces a probability distribution over the classes. 11 There could be different ways to select predicted data for augmentation. For example, one can arbitrarily choose the top n% most confidently predicted points (with n being a hyper-parameter).

Multilingual
Self-trained  shows examples of prediction errors, which demonstrate some of the typical minor errors in phenomena such as voicing (e.g. k vs. ɡ), epenthesis and elision (e.g. p ʁ u vs. p ʁ u l), and coarticulation (e.g. bʲ vs. b).
On average, the fully-supervised models performed slightly better than the self-trained model. We expected that the self-trained model would see (at least slightly) better performance than the fully supervised; however, due to time constraints, we were not able to augment the training data to such a degree that this hypothesized improvement would be tangible. We leave it as a question for the future whether, and if so to what extent, selftraining can improve our models. We now provide an analysis of our findings and report on an ablation study under a number of settings.

Analysis & Ablation Study
We suspected that languages with shared writing systems (in our multilingual models) would benefit from the shared representation and hence see better results, posing a challenges to those languages with unique orthography (i.e., orthography not shared by o=any of the other languages considered). However, our results do not support this hypothesis; there did not

Lang Source Target Prediction
jpn こたま k o̞ d a̠ m a̠ k o̞ t a̠ m a̠ ひぞう ç i z oː ç i z o̞ ː rum ceri t S e r j > Ù e r j iubeau j u b j ae u j u b e̯ a w appear to be a significant correlation between writing system and results on G2P conversion. For example, a total of 7 of the languages (i.e., dut, fre, hun, ice, lit, rum, vie) use the Roman alphabet, but the WERs for these languages cover a reasonably wide range (from first-to eleventh-best) of the results. It is worth noting, however, that the two languages that use the Cyrillic alphabet (ady, bul) were the two worst-performing languages of the set.
Both prior and subsequent to the task deadline, we performed several ablations in order to assess the effectiveness of our approach. First, we compare results based on single models vs. those based on the ensemble. Table 8 shows the error rates of development set translation by the four training checkpoints used in the ensemble, in this case trained with the default (random) seed. Given that each of these results is poorer than our ensemble results for the multilingual model (WER 14.83 / PER 3.41), it is clear that the ensemble approach is superior. Clearly, the ensemble has the advantage of exploiting multiple predictions for each word. This does result in reduced error rates as compared to individual models.
We also compare our multilingual model's error rates on a given language to those acquired by the respective monolingual models. We note that each of the monolingual models is otherwise initialized with the same parameters as the multilingual model described in Section 3.1. Results for the 15 monolingual models are shown in Table 9. The average WER across all languages is almost twice as big as that of our multilingual model (whether individual or ensemble), and the per-

Related Work
Various data-driven models have been successfully applied to G2P conversion. In terms of English conversion, Bisani and Ney (2008) use co-segmentation and joint sequence models for early data-driven G2P. Novak et al.  Multilingual training is a crucial component in our system. Our approach is closely related to multilingual neural machine translation (Johnson et al., 2017b), where a single model is trained to translate between multiple source and target languages. Others have also explored multilingual approaches to G2P. Deri and Knight (2016) use multilingual G2P conversion for the purpose of adapting models from high-resource languages to train weighted finite-state transducers for related low-resource languages. Ni et al. (2018) experiment with multilingual training for deep learning models. They use pretrained character embeddings with LSTM encoder-decoders in order to train multilingual G2P models for Chinese, Japanese, Korean and Thai. In contrast to Ni et al. (2018), we inspect multilingual training in the context of transformer models. For our second model, whose training data is augmented from Wikipedia, we use a selftaining method. Sun et al. (2019) investigate self-training together with ensemble distillation for English G2P conversion, using transformer models. Their setting resembles ours: A teacher model is first trained using a gold standard labeled G2P training set. The teacher model is then used to label additional grapheme data, producing a silver standard training set. Subsequently, a model ensemble is trained on the combination of the gold and silver data. Sun et al. (2019) train on nearly 200k gold standard examples and 2M silver standard examples and report small improvements. In contrast, we do not observe improvements from self-training. This might be a consequence of the small size of the shared task datasets and our silver standard Wikipedia data.

Conclusion
We introduced a multilingual approach to G2P conversion, exploiting Transformers in a fully supervised multilingual setting. Strikingly, our choice to model all languages in a shared, nultilingual space reduces error rates (in WER and PER) by almost one half. We also showed how an ensemble of individuallytrained multilingual Transformers, is an improvement over non-ensemble models. We also leveraged multilingual Wikipedia data via a self-training strategy, though due to time constraints we were not able to incorporate enough silver labeled data into training to see the results we had hoped for 12 . Nevertheless, the multilingual models successfully surpassed all organizer-provided baselines on the task and compared favorably to several other submitted models. Our future work includes scaling up our self-training with larger Wikipedia data and choosing fully-trained models (e.g., in our case ones at 200K steps) to include in the ensemble.