CLUZH at SIGMORPHON 2020 Shared Task on Multilingual Grapheme-to-Phoneme Conversion

This paper describes the submission by the team from the Institute of Computational Linguistics, Zurich University, to the Multilingual Grapheme-to-Phoneme Conversion (G2P) Task of the SIGMORPHON 2020 challenge. The submission adapts our system from the 2018 edition of the SIGMORPHON shared task. Our system is a neural transducer that operates over explicit edit actions and is trained with imitation learning. It is well-suited for morphological string transduction partly because it exploits the fact that the input and output character alphabets overlap. The challenge posed by G2P has been to adapt the model and the training procedure to work with disjoint alphabets. We adapt the model to use substitution edits and train it with a weighted finite-state transducer acting as the expert policy. An ensemble of such models produces competitive results on G2P. Our submission ranks second out of 23 submissions by a total of nine teams.


Introduction
G2P requires mapping a sequence of characters in some language into a sequence of International Phonetic Alphabet (IPA) symbols, which represent the pronunciation of this input character sequence in some abstract way (not necessarily phonemic, despite the name of the task) ( Figure 1).
Multilingual G2P is Task I of this year's SIG-MORPHON challenge. It features fifteen languages from various phylogenetic families and written in different scripts. We refer the reader to Gorman et al. (2020) for an overview of the language data. Each language comes with 3,600 training and 450 development set examples. It is permitted to use external resources as well as to build a single multilingual model.
We participate in this shared task with an adaptation of our SIGMORPHON 2018 system (Makarov fathaigh → /fa:/ ("giants") Irish of Cois Fhairrge (de Bhaldraithe, 1953)  and Clematide, 2018b), which was particularly successful in type-level morphological inflection generation. Our system is a neural transducer that operates over explicit edit actions and is trained with imitation learning (Daumé III et al., 2009;Ross et al., 2011;Chang et al., 2015, IL). It has a number of useful inductive biases, one of which is the familiar bias towards copying the input (implemented as the traditional copy edit). This is particularly useful for morphological string transduction problems, which typically involve small and local edits and where most of the input is preserved in the output. This contrasts with models that rely purely on generating characters such as generic encoder-decoder models, which as a result suffer, particularly on smaller-sized datasets.
Copying requires that the input and output character alphabets overlap, preferably substantially. This also allows our IL training to leverage a simple-to-implement expert policy (which during training provides demonstrations to the learner of how to optimally solve the task). The optimal completion of the target given the prediction generated so far during training requires finding edits that would extend the prediction so that the Levenshtein distance (Levenshtein, 1966) between the target and the partial prediction + the future suffix is minimized. Unfortunately, this objective alone would not discriminate between multiple edit action sequences that relate the input and the partial prediction + the future suffix. To address this spurious ambiguity, our IL training adds edit sequence scores, computed using traditional costs, 1 into the objective. This naturally encourages the system to copy, however this would fail on any editing problem with disjoint alphabets.
G2P poses an interesting challenge for a system like ours. On the one hand, G2P shares many similarities with morphological string transduction: The changes are mostly local, it would suffice to perform traditional left-to-right transduction, and a substantial part of the work is arguably applying equivalence rules (e.g. the German letter "g" most often converts to /g/, "a" to /a/ or /a:/), which is similar to copying. Yet, a general solution to G2P cannot rely on overlapping alphabets since many scripts do not share many symbols, if any at all, with IPA (e.g. Korean or Georgian).
Our solution adapts the model to use substitution edits and trains it with a weighted finite-state transducer acting as the expert policy.

Model description
The underlying model is a neural transducer introduced in Aharoni and Goldberg (2017). It defines a conditional distribution over traditional edits , where x is an input sequence of graphemes and a = a 1 . . . a |a| is an edit action sequence. (The output sequence of IPA symbols y is deterministically computed from x and a.) The model is equipped with a long short-term memory (LSTM) decoder and a bidirectional LSTM encoder (Graves and Schmidhuber, 2005). The challenge is training this model: Due to the recurrent decoder, it cannot be trained with exact marginal likelihood unlike the more familiar weighted finite-state transducer (Mohri, 2004;Eisner, 2002, WFST) or its neuralizations (Yu et al., 2016). For a more detailed description of the model, we refer the reader to Makarov and Clematide (2018a). 2 IL training Makarov and Clematide (2018a) propose training the model using IL, a general model fitting framework for sequential problems over exponentially sized output spaces. IL has been applied successfully to natural language processing (NLP) problems, e.g. transition-based parsing (Goldberg and Nivre, 2012) and language generation (Welleck et al., 2019). IL relies on the availability of demonstrations of how the task can optimally : Ω / p(INS(Ω)) Σ : Ω / p(SUB(Σ, Ω)) Figure 2: Stochastic edit distance (Ristad and Yianilos, 1998): A memoryless probabilistic FST. Σ and Ω stand for any input and output symbol, respectively. be solved given any configuration. Due to the nature of many NLP problems, such demonstrations can often be provided by a rule-based program (known as expert policy). Makarov and Clematide (2018a) use a combination of Levenshtein distance and edit sequence cost as the task objective (β ED(ŷ, y) + ED(x,ŷ), β ≥ 1) and devise an expert policy for it. Given a target sequence y, a partially completed prediction y 1:n , and the remaining input sequence x k:l , the expert needs to (1) identify the set of target suffixes y j:m that when appended toŷ 1:n , lead to a prediction with minimum Levenshtein distance from the target, and (2) check which of the edit sequences producing those suffixes have the lowest cost, i.e. minimum Levenshtein distance from the remaining input.
The second part is crucial for training accurate models especially in the limited resource setting, as it reduces spurious ambiguity arising under the first part of the objective alone. It is also the second part of the training objective that hinges on the overlap of the input and output alphabets, as this permits minimization using the edit distance dynamic program with traditional costs.

Adaptation to G2P
The adaptation is two-fold: First, we introduce substitution edits, which have previously not been employed to keep the total number of edit actions to a minimum. For each output character c, there is now a substitution action SUBS[c] which substitutes c for any input character x.
When the alphabets are disjoint, the completing edit sequences cannot be very informatively scored using traditional edit costs. For example, for the data sample кит → /k j it/ (Russian: "whale"), we would like the following most natural edit sequence to attain the lowest cost: SUBS[t]. Yet, it is clear that under traditional costs, this sequence attains the same cost as any other that consists of three substitutions and one insertion. Our solution to this is to learn costs from the training data to ensure an intuitive ranking of edit sequences.
SED policy Learning costs as well as computing string distance can be achieved with a very simple WFST: Stochastic Edit Distance (Ristad and Yianilos, 1998, SED), which is a probabilistic version of Levenshtein distance (Fig. 2). We use traditional multinomial parameterization.
Before starting training the neural transducer, we train a SED model using the Expectation-Maximization algorithm (Dempster et al., 1977). We use the following update in the M-step: where θ is the unnormalized weight computed in the E-step and 0 < α < 1 is a sparse Dirichlet prior parameter associated with this edit. This corresponds to sparse regularization via Dirichlet prior (Johnson et al., 2007), which results in many edits having zero probability. We found this training to lead to more accurate SED models. Furthermore, it dramatically reduces the size of the edit action set that the neural transducer is defined over.
SED is integrated into the expert policy. During training, given a configuration consisting of a partial prediction, a remainder of the input, and the target, we query the expert policy for next optimal edits. We minimize the first part of the objective much like before, and we minimize the second part by decoding SED with the Viterbi algorithm.
Suppose we transduce the French word x = abject ("vile") into the target y = a b Z E k t. Suppose also that the neural transducer currently attends to character x 4 = e and the prediction built so far during training isŷ 1:7 = a b Z e (note the error). We query the SED policy to get the optimal edit action whose likelihood we will maximize. First, much like before, we find that the following edits are optimal with respect to the first term of the training objective (call them permissible) as they do not increase the Levenshtein distance of the prediction from the target (assuming all subsequent edits are permissible too): (This can be verified by looking at the Levenshtein distance prefix matrix for stringsŷ 1:7 and y.) Each such edit starts a suffix that completes the target, e.g. it is "E k t" for SUBS [E] and " k t" for SUBS[ ]. Next, we use SED to rank the permissible edits by cost-to-go. For each of the edits and their corresponding suffixes, the expert needs to execute the edit (e.g. SUBS[E] writes E and moves the attention to x 5 = c) and then decode SED with Viterbi on the the remaining input and the suffix (both possibly modified by the edit). In this way, we obtain that SUBS[ ] is the optimal action with the lowest cost-to-go (=negative sum of the log probabilities of the edit and of the Exploration This time, we also train the transducer with an aggressive exploration schedule: p sampling (i) = 1 1+exp(i) , where i is the training epoch number. After a couple of training epochs, training configurations are generated entirely by executing edit actions sampled from the model.

Submission details
We train separate models for each language on the official training data and use the development set for model selection. 4 Our submission does not use any additional lexical resources.
For most of the models, we employ Unicode decomposition normalization (NFKD) 5 as a data preprocessing step. Importantly, this helps decomposing Unicode syllable blocks used e.g. in Hangul.
The size of the development set is rather small (450 examples), and having examined the data, we suspect that overly relying on the development set for model selection might hurt generalization. For example, the French development set contains three exceptions to the "ill"-/j/ equivalence; thus, a single model that achieves a high score on the development set might, in fact, be overfitting. To counter this, we build an eleven-model-strong majorityvote ensemble. Fortunately, training a neural transducer is fast as one epoch takes just about four minutes on average on a single CPU, due to the relatively small number of model parameters.

Results and Discussion
Our system ranks second among 23 submissions by a total of nine teams (Table 1). It ties for first place on four languages (Hindi, Hungarian, Icelandic, Lithuanian) and outperforms every other submission for Armenian. It achieves strong gains over the neural baselines. Ensembling gains us 16% in error reduction compared to test set averages-a substantial improvement. We leave it for future work to see whether dropout and a larger model size could be used instead as effectively as ensembling. Unicode decomposition normalization boosts the performance of our Korean models. 6 On average, at least one model predicts the output correctly for all but 7.93% of all the words (⊥)-Adyghe, Lithuanian, and Bulgarian being the most difficult languages. For some languages, WER standard deviation is high, likely confirming our hypothesis that model selection on the small-sized development set would lead to poor generalization. Table 2 shows the most frequent errors of our system for each language and helps to qualitatively assess their strongly varying error profiles. We take a closer look at the errors in French and Korean. Additional lexical information could improve our French models. E.g. the word's lexical category feature and/or morphological segmentation would probably help correctly transduce the word-final "-ent" (adverb "vraiement" (truly) /...Ã/ vs verb "viennent" (they come), where the ending is silent). Many errors in French are in English borrowings.

Error analysis
We look in some detail at the errors on the Korean test data that all or almost all of the individual models of the ensemble make. As expected, lexicalized phenomena contribute most of the errors: vowel length (which is neither phonemic nor phonetically realized in the speech of all except elderly speakers (Sohn, 2001)) and tensification. Vowel length is not indicated in Korean orthography, and neither is tensification (with some exceptions). Knowing whether a word is an English borrowing (e.g. 섹스 seksȗ 7 (sex)) or whether a word is a compound and where the morpheme boundary lies (초승달 ch'osȗng-tal (new moon)) could help predict non-automatic tensification correctly in a small number of cases (  How good is SED policy? Somewhat surprisingly, using SED as part of the expert policy results in competitive performance. Yet, SED is a very crude model (e.g. because of the lack of context, when used as a conditional model, SED assigns less probability to any edit sequence containing insertions than the same sequence but with all the insertions removed; this e.g. makes it unusable as a standalone model for G2P). On top of this, we also do not use learned roll-out, which would be recommended when training with a sub-optimal expert (Chang et al., 2015). We leave it for future work to examine whether the neural transducer's performance on G2P would improve from replacing SED with a more powerful model.

Conclusion
This presents the approach taken by the CLUZH team to solving the SIGMORPHON 2020 Multilingual Grapheme-to-Morpheme Conversion challenge. Our submission is based on our successful SIGMORPHON 2018 system, which is a majorityvote ensemble of neural transducers trained with imitation learning. We adapt the 2018 system to work on transduction problems with disjoint input and output alphabets. We add substitution actions (not available in previous versions of the system) and employ a memoryless probabilistic finite-state transducer to define the expert policy for the imitation learning. We use majority-vote ensembling to counter the overfitting to the small development sets. These simple modifications result in a highly 8 https://github.com/eddieantonio/ocreval competitive performance even without the use of any exernal resources or learning a single multilingual model. Our ensemble ranks second out of 23 submissions by a total of nine teams. Our error analysis indicates that addressing many of the errors requires additional information such as knowing the word's lexical category, morphological segmentation, or etymology. We will make our code publicly available.