Frustratingly Easy Multilingual Grapheme-to-Phoneme Conversion

In this paper, we describe two CU-Boulder submissions to the SIGMORPHON 2020 Task 1 on multilingual grapheme-to-phoneme conversion (G2P). Inspired by the high performance of a standard transformer model (Vaswani et al., 2017) on the task, we improve over this approach by adding two modifications: (i) Instead of training exclusively on G2P, we additionally create examples for the opposite direction, phoneme-to-grapheme conversion (P2G). We then perform multi-task training on both tasks. (ii) We produce ensembles of our models via majority voting. Our approaches, though being conceptually simple, result in systems that place 6th and 8th amongst 23 submitted systems, and obtain the best results out of all systems on Lithuanian and Modern Greek, respectively.


Introduction
This paper describes the CU Boulder submissions to the SIGMORPHON 2020 shared task on multilingual grapheme-to-phoneme conversion (G2P). G2P is an important task, due to its applications in text-to-speech and automatic speech recognition systems. It is explained by Jurafsky and Martin (2009) as: The process of converting a sequence of letters into a sequence of phones is called grapheme-to-phoneme conversion, sometimes shortened g2p. The job of a grapheme-to-phoneme algorithm is thus to convert a letter string like cake into a phone string like [K EY K].
While the earliest G2P algorithms have used handwritten parser-based rules in the format of Chomsky-Halle rewrite rules, often called letter-tosound, or LTS, rules (Chomsky and Halle, 1968), later techniques have moved on to generating semiautomatic alignment tables such as in . Today, a lot of work aims at using machine learning -in particular deep learning techniquesto solve sequence-to-sequence problems like this.
We explore using a transformer model (Vaswani et al., 2017) for this problem, since it has shown great promise in several areas of natural language processing (NLP), outperforming the previous state of the art on a large variety of tasks, including machine translation (Vaswani et al., 2017), summarization (Raffel et al., 2019), question-answering (Raffel et al., 2019), and sentiment-analysis (Munikar et al., 2019). While previous work has used transformers for G2P, experiments have only been performed on English, specifically on the CMUDict (Weide, 2005) and NetTalk 1 datasets (Yolchuyeva et al., 2020;Sun et al., 2019). Our approach builds upon the standard architecture by adding two straightforward modifications: multi-task training (Caruana, 1997) and ensembling. We find that these simple additions lead to performance improvements over the standard model, and our models place 6th and 8th among 23 submissions to the SIGMORPHON 2020 shared task on multilingual grapheme-to-phoneme conversion. Our two models further perform the best on the languages Lithuanian and Modern Greek, respectively.
2 Task and Background 2.1 Grapheme-to-Phoneme Conversion G2P can be cast as a sequence-to-sequence task, where the input sequence is a sequence of graphemes, i.e., the spelling of a word, and the output sequence is a sequence of IPA-like symbols, representing the pronunciation of the same word.
Formally, let Σ G be an alphabet of graphemes and Σ P be an alphabet of phonemes. For a word w in a language, G2P then refers to the mapping with g(w) ∈ Σ * G and p(w) ∈ Σ * P being the grapheme and phoneme representations of w, respectively.

Related Work
Many different approaches to G2P exist in the literature, including rule-based systems , LSTMs (Rao et al., 2015), jointsequence models (Galescu and Allen, 2002), and encoder-decoder architectures, based on convolutional neural networks (Yolchuyeva et al., 2019), LSTMs (Yao and Zweig, 2015), or transformers (Yolchuyeva et al., 2020;Sun et al., 2019). In this paper, we improve over previous work by exploring two straightforward extensions of a standard transformer (Vaswani et al., 2017) model for the task: multi-task training (Caruana, 1997) and ensembling. Multi-task training has been explored previously for G2P (Milde et al., 2017), with the tasks being training on different languages and alphabet sets. Sun et al. (2019) successfully used token-level ensemble distillation for G2P to boost accuracy and reduce model-size, ensembling models based on multiple different architectures.

Proposed Approach
We submit two different systems to the shared task, which are based on the transformer architecture, multi-task learning, and ensembling. We describe all components individually in this section.

Model
Our model architecture is shown in Figure 1; the vanilla transformer proposed by Vaswani et al. (2017). In short, the transformer is an autoregressive encoder-decoder architecture, which uses stacked self-attention and fully-connected layers for both the encoder and decoder. The decoder is connected to the encoder via multi-head attention over the encoder outputs. Details can be found in the original paper.

Multi-task Training
We propose to train our model jointly on two tasks: (i) G2P and (ii) phoneme-to-grapheme conversion   (P2G). Using our formalization from before, given a word w, P2G corresponds to the mapping We denote the set of our original G2P training examples as D g2p and our P2G examples, which we obtain by inverting all examples in D g2p , as D p2g . We then aim to obtain model parameters θ that maximize the joint log-likelihood of both datasets: ! a a n d a c h t a: n d ! x t ? a: n d ! x t a a n d a c h t ! b a s s o n b ! s O n ? b ! s O n b a s s o n ! b e g i n t b @ G I n t ? b @ G I n t b e g i n t ! g i e r s t x i: r s t ? x i: r s t g i e r s t ! h e u p H ø: p ? H ø: p h e u p λ g , λ p / ∈ Σ G ∪ Σ P are special symbols which we prepend to each input. These so-called taskembeddings indicate to our model which task each individual input belongs to. Examples for both tasks are shown in Table 2.
Intuition. By training our model jointly on G2P and P2G, we expect it to learn properties that both tasks have in common. First, both tasks require learning of a monotonic left-to-right mapping. Second, for some languages, Σ G ∩ Σ P = ∅, cf. Table  2 for Dutch as an example. Symbols in Σ G ∩ Σ P are commonly mapped onto each other in both directions, such that we expect the model to learn this from both tasks.

Ensembling
Our second straightforward modification of the standard transformer model is that we create ensembles via majority voting. In particular, each of our two submitted systems is an ensemble of multiple different models for each language, which we generate using different random seeds. We then create predictions with all models participating in each ensemble, and choose the solution that occurs most frequently, with ties being broken randomly.
Our first submitted model -CU-1 -is an ensemble of 5 standard G2P transformers and 5 multi-task transformers. Our second system -CU-2 -is an ensemble of 5 multi-task transformers.

Data
The datasets provided for the shared task spans 15 individual languages, with each training set consisting of 3600 pairs of graphemes and their associated phonemes. The datasets include an initial set of core languages -Armenian (arm), Bulgarian (bul), French (fre), Georgian (geo), Modern Greek (gre), Hindi (hin), Hungarian (hun), Icelandic (ice), Korean (kor), and Lithaunian (lit) -, and a set of surprise languages, which have been released shortly before the shared task deadline -Adyghe (ady), Dutch (dut), Japanese (hiragana) (jap), Romanian  (rum), and Vietnamese (vie). The data is primarily extracted from Wiktionary using the wikipron library (Lee et al., 2020).

Hyperparameters
Following the official shared task baseline, we employ the hyperparameters shown in Table 1. All models are trained for 150 epochs. Starting from epoch 100, we evaluate every 5 epochs for early stopping. Encoder and decoder embeddings are tied, and the maximum sequence length is 24. Our system is built on the transformer implementation by Wu et al. (2020), and our final code is available on github. 2

Metrics
Word error rate (WER). Word error rate is the percentage of words for which the model's prediction does not exactly match the gold transcription. Phoneme error rate (PER). Phoneme error rate is the percentage of wrong characters in the model's prediction as compared to the gold standard. Both metrics are calculated using the official evaluation script 3 provided for the shared task.

Development Results
The results on the development sets are shown in Table 3. CU-TB represents a transformer baseline  trained by us (an average of 5 models), while CU-1 and CU-2 are our submitted systems, which are described in Section 3.3. CU-1 performs best with an average performance of 14.43 WER and 3.33 PER, followed by CU-2 with 14.65 WER and 3.34 PER, respectively. Both CU-1 and CU-2 improve over the baseline for each of the 15 languages, with an average improvement of 1.82 WER and 1.6 WER, respectively. Both systems show an average improvement of 0.47 PER over the baseline, performing better on all languages, with the sole exception of Bulgarian, where the baseline slightly outperforms CU-2.

Official Shared Task Results
The results on the test set in Table 4 mirror our development set results. Our systems CU-1 and CU-2 are compared with the two best official baselines: a transformer (SIG-TB) and an LSTM sequence-tosequence model (SIG-LSTM). CU-1 gives the best performance, with an average of 14.52 WER and 3.24 PER, followed by CU-2, with 14.96 WER and 3.31 PER. CU-1 shows an average improvement of 2.99 WER and 2.32 WER as well as 1.06 PER and 0.75 PER over SIG-TB and SIG-LSTM, respectively. CU-2 shows an average of 2.55 WER and 0.99 PER and, respectively, 1.88 WER and 0.68 PER improvement. Compared to all system submissions (Gorman et al.) CU-1 performs best on Lithuanian, with 18.67 WER and 3.53 PER. CU-2

Ablation Study
We further perform an ablation study to explicitly investigate the impact of our two modificationsmulti-task training and ensembling -with results shown in Table 5. T and MT are the standard and multi-task transformer, while T-E and MT-E are the ensembled versions of the same. The ensembles obtain better results: T-E shows an average improvement of 1.50 WER over T, and MT-E outperforms MT by 0.72 WER. Multi-task training also leads to performance gains, with MT improving over T by 0.88 WER and MT-E over T-E by 0.11 WER, showing that the effect of multi-task training is not as strong as that of ensembling. We conclude that both multi-task training and ensembling boost performance overall.

Conclusion
We described two CU Boulder submissions to SIG-MORPHON 2020 Task 1. Our systems consisted of transformer models, some of which were trained in a multi-task fashion on G2P and P2G. We further created ensembles consisting of multiple individual models via majority voting.
Our internal experiments and the official results showed that these two straightforward extensions of the transformer model enabled our systems to improve over the official shared task baselines and a standard transformer model for G2P. Our final models, CU-1 and CU-2, placed 6th and 8th out of 23 submissions, and obtained the best results of all systems for Lithuanian and Modern Greek, respectively.