Do we need bigram alignment models? On the effect of alignment quality on transduction accuracy in G2P

We investigate the need for bigram alignment models and the beneﬁt of supervised alignment techniques in grapheme-to-phoneme (G2P) conversion. Moreover, we quantitatively estimate the relationship between alignment quality and over-all G2P system performance. We ﬁnd that, in English, bigram alignment models do perform better than unigram alignment models on the G2P task. Moreover, we ﬁnd that supervised alignment techniques may perform considerably better than their unsupervised brethren and that few manually aligned training pairs sufﬁce for them to do so. Finally, we estimate a highly signiﬁcant impact of alignment quality on overall G2P transcription performance and that this relationship is linear in nature.


Introduction
Grapheme-to-phoneme (G2P) conversion is the problem of converting a string of letters into a string of phonetic symbols. Closely related to G2P are other string transduction problems in natural language processing (NLP) such as transliteration , lemmatization (Dreyer et al., 2008), and spelling error correction (Brill and Moore, 2000). The classical learning paradigm in each of these settings is to train a model on pairs of strings {(x, y)} and then to evaluate model performance on test data. While there are exceptions (e.g., (Rao et al., 2015)), most state-of-the-art modelings (e.g., (Jiampojamarn et al., 2007;Bisani and Ney, 2008;Jiampojamarn et al., 2008;Novak et al., 2012)) view string transduction as a two-stage process in which string pairs (x, y) in the training data are first aligned, and then a subsequent (e.g., sequence labeling) module is learned on the aligned data. ph oe n i x f i n I ks Table 1: Sample monotone many-to-many alignment between x = phoenix and y = finIks.
State-of-the-art alignments in G2P are characterized by the following properties: (i) Alignments are monotone in that the ordering of characters in input and output sequences is preserved by the alignments. Furthermore, they are many-to-many in the sense that several x sequence characters may be matched up with several y sequence characters as illustrated in Table 1.
(ii) The alignment is a latent variable and learnt in an unsupervised manner from pairs of strings in the training data.
(iii) The unsupervised alignment models are unigram alignment models insofar as the overall score that the alignment model assigns an alignment is the same for all orderings of the matched-up subsequences (context independence).
To illustrate point (iii), consider, in the field of lemmatization, the case of aligning an inflected word form with the extended infinitive in German, such as absagt ('rejects') with abzusagen ('to reject'). Critically, the insertion -zu-appears in infixal position and a plausible alignment might be as in Table 2. Then, correctly aligning certain a b s a g t a b zu s a g en Table 2: Alignment between absagen and abzusagen. Empty string denoted by .
analogous forms such as zusagt ('accepts') with their corresponding extended infinitive zuzusagen ('to accept') is beyond the scope of a unigram alignment model since this cannot distinguish the linguistically correct alignment from the following linguistically incorrect alignment z u s a g t zu z u s a g en precisely because it has no notion of context.
In this work, we firstly address bigram alignment models in G2P. We investigate whether there are phenomena in G2P that require bigram alignment models and, more generally, whether bigram alignment models produce better alignmentswith respect to a human gold standard -than unigram alignment models within the G2P setting. We do so, secondly, in a supervised setting where the model learns from gold-standard alignments. While this may seem an odd scenario at first sight, modern alignment toolkits in the related field of machine translation typically include the possibility to learn both in a supervised and unsupervised manner (Liu et al., 2010;Liu and Sun, 2015). The rationale behind supervised learning models may be that they perform better than unsupervised models, and if alignment quality has a large impact upon subsequent string translation performance, then a supervised model may be a suitable alternative. Thirdly, we investigate how alignment quality affects overall G2P performance. This allows us to address whether it is worthwhile to work on better alignment models, which bigram and supervised alignment models promise to be. To our knowledge, all three outlined aspects of alignments -bigram models, supervised learning, and systematically estimating the relationship between alignment quality and overall string transduction performance -are novel in the G2P setting and its related fields as outlined; however, see also the related work section.
This work is structured as follows. Section 2 presents definitions and algorithms for uni-and bigram alignment models. Section 3 surveys related work. Section 4 presents our data and Section 5 our experiments. We conclude in Section 6.

Uni-and bigram alignment models
We first formally define the problem of aligning two strings x and y over arbitrary alphabets in a monotone and many-to-many manner. Let x = |x| and y = |y| denote the lengths of x and y, respectively. Let N = {0, 1, 2, . . .}, and let S ⊆ N 2 \{(0, 0)} be a set defining the valid match-up operations between x characters and y characters. In other words, when (s, t) ∈ S, then this means we allow matches of subsequences of x of length s and subsequences of y of length t. 1 It is convenient to define a monotone many-tomany alignment of x and y as a 2×k (for k ≥ 1 arbitrary) nonnegative integer matrix A x,y ∈ N 2×k satisfying A x,y 1 k = x y , i.e., the two rows of A x,y sum up to the lengths of the respective strings, 2 and where each column of A x,y lies in S. For any such alignment, we let (x 1 , . . . , x k ) be the corresponding induced segmentation of x and (y 1 , . . . , y k ) be the corresponding induced segmentation of y.
Let A S (x, y) denote the class of all alignments of x and y. We call a function f : A S (x, y) → R an alignment model. We call an alignment model f a unigram alignment model if f takes the form, for any A x,y ∈ A S (x, y), where sim 1 is an arbitrary (real-valued) similarity function measuring similarity of two subsequences. We call an alignment model f a bigram alignment model if f takes the form where sim 2 is an arbitrary (real-valued) similarity function measuring similarity of successive pairs of subsequences.
In statistical alignment modeling, the task is to find an optimal alignment (i.e., one with maximal score) given strings x and y and given the alignment model f . When f is a unigram model, this can be solved efficiently via dynamic programming (DP). When f is a bigram alignment model, then finding the optimal alignment can still be solved via DP, by introducing a variable M ijqw denoting the score of the best alignment of x(1 : i) and y(1 : j) that ends in the matchup of x(q : i) with y(w : j). 3 The variable M ijqw satisfies a recurrence leading to a DP algorithm, shown in Algorithm 1. The actual alignment can be found by storing pointers to the maximizing steps taken. Running time of the algorithm is O( 2 x 2 y |S|). Note also that the sketched algorithm is supervised insofar as it assumes that the similarity values sim 2 (·, ·) are known. Typically, such alignment algorithms can be converted into unsupervised algorithms in which similarity measures sim are learnt iteratively, e.g., in an EM-like fashion (cf., e.g., Eger (2012), Eger (2013)); however, in this paper, we only investigate the supervised base version as indicated.
Probably the most closely related work to ours is . There, older and specialized alignment techniques such as ALINE (Kondrak, 2000) (as well as partly heuristic/semi-automatic alignment methods) are compared with variants of the M2M alignment algorithm, which we also survey. This work does not consider supervised alignments or bigram alignments, as we do. Moreover,  also evaluate the impact of alignment quality on overall G2P system accuracy by running a few experiments, finding that better alignment quality does not always translate into better G2P accuracy, but that there is a "strong correlation" between the two. We more thorougly investigate this question, using, arguably, more heterogeneous aligners, and many more experiments. We also quantitatively estimate how alignment quality influences G2P system accuracy on two different languages via linear regression. Goldwater et al. (2006) study the effect of context in (unsupervised) word/sequence segmentation, which may be considered the onedimensional specialization of sequence alignment, using a Bayesian method. They find that bigram models greatly outperform unigram models for their task.
Of course, our study is also related to the field of machine translation and its studies on the rela- tionship between alignment quality and translation performance (Ganchev et al., 2008). In machine translation, the monotonicity assumption of string transduction does typically not hold, however, rendering alignment and translation techniques different and more heuristic in nature.

Data and systems 4.1 Data
For English, we conduct experiments on the General American (GA) variant of the Combilex data set (Richmond et al., 2009). This contains about 128 000 grapheme-phoneme pairs as exemplified in Table 3. Importantly, Combilex provides goldstandard alignments, which we will make use of for the supervised alignment models as well as for measuring alignment quality. For German, we ran-Grapheme string Phoneme string g-e-n-e-r-a-l dZ-E-n-@-r-@-l Table 3: Sample grapheme-phoneme string pairs in Combilex, using Combilex notation for the phoneme strings. Gold-standard alignments indicated in an intuitive manner.

Alignment toolkits/models
The M2M aligner (Jiampojamarn et al., 2007), which is based on EM maximum likelihood estimation of alignment parameters, is the classical unsupervised unigram many-to-many aligner in G2P. As has been pointed out (Kubo et al., 2011), M2M greatly overfits the data. 5 This means that when the M2M aligner is given the freedom to align two sequences without restrictions, it matches them up as a whole. The reason is that a (probabilistic) unigram alignment model adds log-probabilities of matched-up subsequences, which, if not appropriately corrected for, makes alignments with few match-ups a priori more likely than alignments with many matchups, when probabilities of individual match-ups are uniformly or randomly initialized (as is typically the case for EM maximum likelihood estimation in unsupervised models). To address this, M2M must artifically restrain, in our language, the set S to be {(1, 1), (1, 2), (2, 1)}. In contrast, the Mpaligner (Kubo et al., 2011) introduces a prior (or penalty) in the alignment model which favors 'short' matches (s, t) over 'long' ones. Finally, the Phonetisaurus aligner (Novak et al., 2012) modifies the M2M aligner by adding additional soft constraints. Our own alignment model is, as indicated, supervised. We implement a unigram alignment model where we specify sim 1 (u, v) as α · logp((u, v)) + β · logp((|u|, |v|)) +γ · logp(u) + δ · logp(v).
Here, logp(z) denotes the log-probability -estimated from the training data -of observing the object z, and α, β, γ and δ are parameters. This specification says that the subsequences u and v are similar insofar as (i) u and v have been paired frequently in the training data, (ii) the length of u and the length of v have been paired frequently, (iii)/(iv) u/v by itself is likely. We refer to this unigram alignment model as uni α,β,γ,δ . We also implement a bigram alignment model where we specify sim 2 (u, v), (u , v ) as Here, logp(z | z ) denotes the logarithm of the conditional probability of observing the object z following the object z . We refer to this bigram alignment model as bi α,β,γ,δ .

Transduction systems
We use two string transduction systems for our experiments. The first one is DirecTL+ (Jiampojamarn et al., 2010), a discriminative string-tostring translation system incorporating joint ngram features. DirecTL+ is an extension of the model presented in Jiampojamarn et al. (2008) which treats string transduction as a source sequence segmentation and subsequent sequence labeling task. In addition, we use Phonetisaurus (Novak et al., 2012), a weighted finite state-based joint n-gram model employing recurrent neural network language model N -best rescoring in decoding. Both systems take aligned pairs of strings as input and from this construct a monotone translation model. 6

Measuring alignment quality
We employ two measures of alignment quality. First, we use word accuracy, defined as the fraction of correctly aligned sequence pairs in a test sample. This is a very strict measure that penalizes even tiny deviations from the gold standard. Additionally, we measure the edit distance between the true alignment A x,y and the predicted alignment A x,y . To implement this, we view the two induced segmentations that constitute an alignment -e.g., (ph,oe,n,i,x) and (f,i,n,I,ks) -as strings including splitting signs. Thus, we can compute the edit distance between the gold-standard segmented x string and the predicted segmentation, and analogously for the y sequence. Then, we define the edit distance between A x,y andÂ x,y as the sum of these two string edit distances. For a test sample, we indicate so-defined average edit distance, averaged over all pairs in the sample.

Alignment quality
To measure alignment quality for the different systems, for English, we run experiments on sets of size x+5 000, where x = 1 000, 2 000, 5 000, 10 000, and 20 000. For the supervised models, we consider x as the training data and the 5 000 additional string pairs as test data. 7 To quantify effects when training data is very little, we let x also range over 100 and 500 string pairs for the supervised models. For the unsupervised models, we simply take all x+5 000 string pairs as data to learn from (but evaluate performance only on the 5 000 string pairs, for comparability).
Results are shown in Tables 4, 5, and 6. We first note (Table 4) that the unsupervised models perform decently, obtaining accuracy rates of 80% and beyond under appropriate parametrizations. We also observe the M2M aligner's deterioration in performance as we increase its degrees of freedom (allowing it to match subsequences of larger length), confirming our previous remarks. The Mpaligner does not suffer from this problem as it penalizes large matches. Phonetisaurus suffers from the same problems as M2M, but to a lesser degree. Overall, we find that, under optimal parametrizations, Phonetisaurus produces best alignments, followed by Mpalign and M2M. However, peak performances of all three unsupervised aligners are close. Unsurprisingly, the supervised alignment models perform better than the unsupervised ones (Tables 5 and 6). Surprisingly, however, they do so with very little training data; fewer than 100 aligned string pairs suffice to outperform the unsupervised models under good calibrations. When there is sufficient training data, the supervised models perform splendidly, with a peak accuracy of 99.43% for the bigram alignment model that includes appropriate features (scoring lengths of aligned subsequences, etc.). We also note that the bigram alignment model is almost consistently better than the unigram alignment model, with a surplus of about 1% point, depending on specific parametrizations.
We performed an analogous analysis for the German data. Results are quite similar except that unigram and bigram alignment model have indistinguishable performance on the German data, indicating (the known fact) that G2P is a more complex task in English, apparently not requiring bigram alignment models.
x uni 0,0, 1,1 uni 1,0,0,0 uni 1,1,1   Error analysis Concerning errors that the unigram model commits and the bigram model does not, the majority of errors (roughly 80%) involve match-ups of ed/d and d. For example, the unigram model aligns as in t w i n k le d t w I N k @l d while the gold-standard alignment is t w i n k l ed t w I N k @l d While all match-ups in both alignments are plausible, the bigram model assigns here higher probability to the correct ed/d match-up in terminal position (consistently favored in the data set), which has a particular meaning there, namely, that of a suffix marker for past tense. 8,9 In the German data, there is a single instance where the unigram and bigram alignment model disagree, namely, in the alignment of s-t-o-ff-f-l-a-sch-e/S-t-O-f-f-l-&-S-@, which the unigram model falsely aligns as s-t-o-f-ff-l-a-sch-e/S-t-O-f-f-l-&-S-@; note that in the correct alignment f must follow ff, not vice versa, which depends on context information, e.g., that o/O signifies a short vowel which is followed by a double consonant, not a single consonant.
All remaining errors that the bigram alignment models commits are, for the best considered parametrization and training set size, typically due to match-up types not seen in the training data, and thus mostly concern foreign names or writings (e.g., Bh-u-tt-o/b-u-t-F, falsely aligned as B-hu-tto/b-u-t-F). A few other errors might be corrected when the feature coefficients α, β, γ, δ were optimized on a development set rather than set manually. We find no indication that our G2P data, either for English or German, would further benefit from n-gram alignment models of order n > 2.

Alignment quality vs. overall G2P performance
Next, we estimate the relationship between alignment quality and overall G2P performance (transcription accuracy). To this end, for the English data, we use the 5 000 aligned string pairs from the previous experiment on alignment quality and feed them in -as training data -to either Di-recTL+ or Phonetisaurus as outlined in Section 4. We then evaluate G2P performance -in terms of word accuracy (fraction of correctly transcribed strings) -on a distinct test set of size 10 000. Figure 1 shows a plot of overall G2P accuracy vs. training set size for the aligner (ranging over the x values in the last section); and a second plot that sketches G2P accuracy as a function of corresponding alignment accuracy. We first note that, as the supervised aligner receives more training x Mpalign M2M 2,2 M2M 3,3 M2M 6,6 Phon 2,2 Phon 3,3  Table 4: Unsupervised aligners and their alignment accuracies in % for various data sizes as described in the text. Subscripts a, b denote restrictions on maximal lengths of subsequences allowed in match-ups (a/b corresponds to x/y subsequences). data from which to align the 5 000 string pairs, the overall G2P accuracy of both DirecTL+ and Phonetisaurus increase substantially (and as a convex function of training set size). Apparently, the better alignments produced by more training data for the particular supervised aligner considered directly translate into better overall G2P accuracy. The other plot in the figure shows that, indeed, there seems to be a linear trend coupling alignment quality with overall G2P performance. Table  7 pairs G2P accuracy with alignment accuracy of selected systems, all run in the x = 20 000 setting. While, in the table, better alignments do not necessarily imply better overall G2P performance, the two best alignments also lead to the two best overall G2P performances (although, in this case, the second best alignment is paired with the best overall G2P performance); conversely, the worst alignment quality is coupled with the worst overall G2P performance.
Overall, we ran 249 experiments (including the German data) in which we trained DirecTL+ or Phonetisaurus with alignments of specific quali-  ties obtained from particularly parametrized aligners. In each of these cases, we obtained an alignment quality score and a subsequent overall G2P system performance. The English part of this data is sketched in Figure 2. This figure seems to corroborate the linear relationship (apparently present in Figure 1) between alignment quality and overall G2P system accuracy, particularly, when alignment quality is measured in the more finegrained metric of edit distance. To formally test Right: Alignment quality measured in edit distance. English data only.
this, we regress overall G2P system performance (measured in word accuracy) on edit distance and other variables. 10 This yielded the coefficients as given in Table 8; in each case, the goodness-offit of the linear model was quite large, with R 2 values above 90% for the English data and about 84% for the German data. Also, the coefficients on alignment quality were highly significantly different from zero. The table shows that the coefficients are on the order of about −3.80% to −4.70%, meaning that, all else being equal, increasing alignment quality by 1 edit distance to the gold-standard alignment increases overall G2P by about 3.80 to 4.70%.
DirecTL+ Phonetisaurus English −3.80 * * * −4.14 * * * German -−4.68 * * * Table 8: Coefficients on edit distance in the regression of G2P accuracy on edit distance and further variables. For German, DirecTL+ is omitted due to its long run times.
So far, we have estimated the effects of alignment quality on overall G2P system performance for a fixed size of training data, namely, 5 000 aligned string pairs. To see whether this relationship changes when we vary the amount of training data, we run several more experiments. In these, we align training sets of sizes 100, 500, 1 000, 2 000, 10 000, 20 000, 40 000 and 60 000 via our several alignment systems. Then we feed the aligned data to the Phonetisaurus system (we omit DirecTL+ here because of its long run times) and compute overall G2P accuracy on a disjoint test set of size 28 000 approximately. This time, we only use the unsupervised aligners and the gold-standard alignments directly, omitting results for our various supervised aligners. Note, however, that these aligners could, in principle, imitate the gold-standard alignments with a very high degree of precision, as previously seen.  Table 9: Overall G2P accuracy in % as a function training size of aligned data and alignment system.
shows that training G2P systems from the human gold standard alignments in each case yields better overall G2P transcriptions than training them from either of the three unsupervised alignments considered here. However, we note that the surplus over the unsupervised alignments decreases as training set size increases. This may be due to the fact that the unsupervised aligners themselves create better alignments once they are boot-strapped from larger data sets (cf. Table 4). Additionally, the effect of alignment quality on overall G2P system performance may simply vanish as training set sizes become large enough because the translation modules can better accomodate 'noisy' data as long as its size is sufficiently large. Figure   1.00 Figure 3: Ratio of transcription accuracy when using gold standard alignments (GOLD) and when using alignments generated by T = M2M 3,3 , Mpalign, and Phon 3,3 , respectively, as a function of size of aligned training set.
3 sketches the decreasing influence of alignment system on overall G2P system performance as size of the aligned data increases.

Conclusion
We have investigated the need for bigram alignment models and the benefit of supervised alignment techniques in G2P. We have also quantitatively estimated the relationship between alignment quality and overall G2P system performance.
We have found that, in English, bigram alignment models do perform better than unigram alignment models on the G2P task (we find almost no differences between unigram and bigram models for the German sample of G2P data we considered). Moreover, we have found that supervised alignment techniques may perform considerably better than their unsupervised brethren and that few manually aligned training pairs suffice for them to do so. Finally, we have estimated a highly significant impact of alignment quality on overall G2P transcription performance and that this relationship is linear in nature. At a particular training size, a linear regression model has estimated that improving alignment quality by 1 edit distance toward the gold standard alignments leads to an 3.80-4.70% increase in G2P transcription accuracy. However, we have also found that the importance of good alignments on G2P accuracy appears to dimish as data set size increases, possibly because the translation modules can accomodate more 'noisy' data in this scenario. As a 'policy' implication, we recommend the use of supervised alignment techniques particularly when the size of the G2P corpus is small or when high quality alignments, as an end in themselves, are required. In this case, constructing a few dozen or few hundred alignments in an unsupervised manner and correcting them by hand (to serve as an input for a supervised technique) may be highly beneficial.
In future work, it may be worthwhile to study the impact of alignment techniques on overall system performance in other string transduction problems such as transliteration, lemmatization, and spelling error correction.