Grapheme-to-Phoneme Conversion with a Multilingual Transformer Model

In this paper, we describe our three submissions to the SIGMORPHON 2020 shared task 1 on grapheme-to-phoneme conversion for 15 languages. We experimented with a single multilingual transformer model. We observed that the multilingual model achieves results on par with our separately trained monolingual models and is even able to avoid a few of the errors made by the monolingual models.


Introduction
Grapheme-to-phoneme conversion is the task of predicting the phonemic representation for a given orthographic word, where a phoneme is the smallest unit of sound which can distinguish one word from another. In many languages, some phonemes have different realizations depending on their context, and these variants are called allophones. While the task is about predicting phonemes and not allophones, in fact most datasets (e.g., the datasets for Hungarian, Bulgarian, and Armenian) also contain allophones. However, since the distribution of allophones conditioned on the context is learnable, this is not an issue.
The shared task training data consists of 15 languages which have diverse phonologies, ranging from tonal languages to languages with glottalized consonants, and they are written in eight different writing systems. The data comes from the English version of Wiktionary. Each training set contains 3600 words, and each development and test set contains 450 words. The official metrics for the task are Word Error Rate (WER) and Phoneme Error Rate (PER).
A multilingual approach for grapheme-tophoneme conversion has been explored by Milde et al. (2017). They propose a sequence-tosequence multilingual model that benefits from training on additional phonetic representations for the same language (which was not permitted in our shared task).
The Transformer (Vaswani et al. 2017) with its attention mechanism has been applied very successfully to machine translation tasks, and it was also used for grapheme-to-phoneme conversion. Yolchuyeva et al. (2019) suggested using a Transformer-based approach for grapheme-tophoneme conversion and Yu et al. (2020) proposed a multilingual Transformer model for languages with different writing systems by employing bytelevel input representation.
In our submission to the shared task, we explore the performance of a multilingual Transformer model with augmented input representation which can transduce a word from any language present in the training data into its IPA representation.

IPA
The phonemic representation in this task uses the International Phonetic Alphabet (IPA). Interestingly, there is an issue with IPA which is lack of "orthography". This might seem surprising given that the IPA aims at representing the pronunciation of words with more rigor than typical orthographies. However, different levels of depth of analysis are possible with IPA, and this makes inconsistent use of symbols among annotators unavoidable. To give an example, Bulgarian exhibits a voiceless coronal plosive /t/~/t̪ /. The phoneme is articulated as a dental plosive in Bulgarian. Somewhat randomly, the IPA provides an atomic symbol for the voiceless alveolar plosive (/t/), but only a composed symbol for the voiceless dental plosive (/t̪ /). In principle, /t̪ / would be the correct representation for the phoneme in question, but since there is no phonemic contrast between den-tal and alveolar articulation in Bulgarian, a simple /t/ suffices to represent the voiceless coronal plosive phoneme in Bulgarian. Hence, as is expected, the phoneme is not transcribed consistently in the training data; while /t̪ / is used 1588 times,/t/ is applied 681 times. Similar issues are found frequently for other phonemes, and for other languages.

Languages
In our monolingual baseline models trained with the Transformer baseline published by the task organizers, the WER (PER) ranged from only 3.78 (0.66) for Hungarian up to 40.00 (16.38) for Korean. Seeing these huge differences in performance, it seemed worth analyzing the difficulties faced by the model for the three languages with the worst WER, viz. Korean (40.00), Bulgarian (30.67), and Georgian (28.44).

Georgian
We were particularly surprised to see Georgian among the seemingly most difficult languages. Georgian has a fully phonemic alphabet; each character represents exactly one phoneme, and each phoneme is represented by exactly one character (Hewitt 1995). Grapheme-to-phoneme conversion (and phoneme-to-grapheme conversion) for Georgian is thus a trivial task and can be done in principle with 100% accuracy using a simple 1to-1 look-up table.
We actually implemented this look-up table, and this allowed us to identify and quantify the issues in the Georgian dataset. We found that there are three phonemes that are each inconsistently represented by two IPA symbols (and distributed roughly 50/50): i~ɪ; x~χ; ɣ~ʁ. The difference between these symbols is neither phonemic nor allophonic. Rather, it is caused by different annotators using different representation for a given phoneme, in line with the orthographic weakness of the IPA outlined above in Section 2.1.
We reported these data inconsistencies, 1 and we prepared a consistent dataset produced with our look-up table. Together with the organizers, we planned to update the Georgian data directly on Wiktionary and then re-retrieve the training data from there. Unfortunately, bulk uploading to Wiktionary is not trivial, and it was not possible for us to update the data before the task deadline. For the current task, it means that the WER cannot be substantially reduced for Georgian due to these inconsistencies.

Bulgarian
Bulgarian exhibits vowel reduction in unstressed syllables (similar phenomena are found, for instance, in English, German, and Russian), which leads to many allophones for vowels in unstressed positions (Leafgren 2020). These allophones should not be present in a purely phonemic transcription, however they are in the given training set. Furthermore, the pronunciation of a vowel in Bulgarian depends on the position of stress, yet Bulgarian word stress can fall on any syllable and is not completely predictable. We experimented with a self-written tool which predicts the stress position in Bulgarian based on heuristics, however the WER could only be decreased marginally using a stress-annotated training set, which is why we abandoned this approach. Similar issues like the ones discussed above for Georgian are present in the Bulgarian training data, and these were also discussed on GitHub. 2 However, these issues are somewhat more difficult to solve automatically compared to Georgian.

Korean
Korean uses an alphabet that provides a symbol for each consonant and for each vowel, yet it groups symbols into square syllable blocks, which makes it look somewhat close to Chinese and Japanese writing, although it is much simpler. By default, Unicode encodes Korean in syllable blocks and not as single sounds, which results in a character set comprising thousands of characters. Luckily, Unicode also provides code points for the single-sound characters (called Jamo), and syllable characters can easily be decomposed to singlesound characters. 3 We used hangul-jamo 4 for this decomposition. To give an example of the decomposition, 가감 /k a̠ ɡ a̠ m/, is decomposed to ㄱㅏㄱㅏㅁ . With this approach, we were able to decrease the WER and PER of our Korean baseline Transformer model considerably: the WER was reduced from 40.00 to 21.50, and the PER from 16.38 to 3.86. We use this preprocessing step for Korean for all our submitted models.

Approach
We trained a multilingual model which can transduce a word in any of the 15 source languages into its IPA representation. Multilingual models can be of the types many-to-one, one-to-many, or many-to-many. In our case, there are obviously multiple languages on the source side. On the target side, there is usually exactly one desired phoneme sequence for a given source word. Superficially, we thus have a many-to-one problem. However, many character sequences exist in more than one language. For instance, the character sequence <transformation> without further context can be read as an English word or as a French word, and its pronunciation depends on the choice of language (/tɹaens.fɔɹ.meɪ.ʃən/ vs. /tʁɑs.fɔʁ.ma.sjɔ/). This makes it a many-to-many problem for a subset of the data.
The possibility of multiple desired sequences on the target side for a given source word makes it necessary to annotate the source words with the desired language. In our approach, we prefix each source word with its two-letter ISO language code, followed by an underscore, e.g. 'fr_maison', or 'ka_ავტორი'. This is similar to the approach in Johnson et al. (2017).
A side effect of our multilingual approach is that the size of the training data is increased from 3600 to 54000 (15 x 3600) samples. Ideally, a model might profit from this enlarged dataset, and languages can learn from each other. Given the various source-side writing systems and differences in phoneme sets across languages, we expect crosslanguage learning to be somewhat limited.
The multilingual approach proposed here allows for language-specific preprocessing where needed. In our case, we only used a preprocessing step for Korean, as outlined above in Section 2.2.3.

Model UZH-1
For our first submission, we used the Transformer baseline 5 provided by the organizers and experimented with different hyperparameters. The Transformer (Vaswani et al. 2017) is implemented in Fairseq (Ott et al. 2019) and uses Adam (Kingma and Ba 2015) for optimization and ReLU as an activation function. It has 4 encoder and decoder layers with 4 attention heads each.
Our submitted model has the largest possible values for all tuned hyperparameters: embedding dimensions of 256, hidden sizes of 1024, a batch size of 1024, and a dropout probability of 0.3. Due to limitations in available computation power, further tuning with even larger hyperparameter values was not feasible for us.

Model UZH-2
For our second submission, we added extra language data from 6 languages not addressed in the task, viz. English, Italian, Portuguese, Czech, Danish, and Macedonian. Some of these languages have rather small data sets available on Wiktionary, therefore we added only 2400 training samples per language, and 300 development samples each, which is two thirds of the data for the other languages.
We selected the additional languages based on our intuition regarding whether a language might be useful for one or more of the 15 languages in the task. An additional restriction was the fact that large enough data sets are available mainly for European languages. Of the selected additional languages, some are closely related to another one from the official training set (e.g., Macedonian to Bulgarian, or, to a lesser degree, Danish to Dutch). Others have similar phonologies (e.g., Spanish and Greek, or Czech and Hungarian). In addition, some training sets (e.g., the one for French) contain English loanwords whose irregular pronunciation might be learned from additional English data.
The data was retrieved from Wiktionary using WikiPron (Lee et al. 2020) and sampled randomly. We used the same model architecture and the same hyperparameter search space for this experiment as in UZH-1, and the final model has the same hyperparameter values as UZH-1.

Model UZH-3
Our third submission is an ensemble model. It uses the predictions of UZH-1 and UZH-2, and for each word it takes the higher probability prediction from the two models.  As can be seen from Table 1, our basic multilingual system (UZH-1) achieved a macro-average WER of 17.07 and a PER of 3.47 on the official test set.

UZH
For the multilingual model with additional data from six extra languages (UZH-2), we achieved a macro-average WER of 17.63 and a PER of 3.56. While performance did not increase with this approach, it also did not decrease dramatically, which indicates that it would be possible to have an even larger multilingual model for more than 15 languages without major performance loss.
More interestingly, even though the performance of UZH-2 was slightly worse, the model was able to resolve some of the errors made by UZH-1, while at the same time introducing others. We assume that there is indeed a cross-language interference which can influence the result both positively and negatively. We observed similar behavior on the development set during our experiments, which brought us to the idea of combining the results of both systems to get the best of both. Indeed, our ensemble model (UZH-3), which takes the prediction with the higher probability from UZH-1 and UZH-2, was the bestperforming model among our submissions with a macro-average WER of 16.34 and PER of 3.27.