Multiple Many-to-Many Sequence Alignment for Combining String-Valued Variables: A G2P Experiment

We investigate multiple many-to-many alignments as a primary step in integrating supplemental information strings in string transduction. Besides outlining DP based solutions to the multiple alignment problem, we detail an approximation of the problem in terms of multiple sequence segmentations satisfying a coupling constraint. We apply our approach to boosting baseline G2P systems using homogeneous as well as heterogeneous sources of supplemental information.


Introduction
String-to-string translation (string transduction) is the problem of converting one string x over an alphabet Σ into another string y over a possibly different alphabet Γ. The most prominent applications of string-to-string translation in natural language processing (NLP) are graphemeto-phoneme conversion, in which x is a letterstring and y is a string of phonemes, transliteration , lemmatization (Dreyer et al., 2008), and spelling error correction (Brill and Moore, 2000). The classical learning paradigm in each of these settings is to train a model on pairs of strings {(x, y)} and then to evaluate model performance on test data. Thereby, all state-of-the-art modelings we are aware of (e.g., (Jiampojamarn et al., 2007;Bisani and Ney, 2008;Jiampojamarn et al., 2008;Jiampojamarn et al., 2010;Novak et al., 2012)) proceed by first aligning the string pairs (x, y) in the training data. Also, these modelings acknowledge that alignments may typically be of a rather complex nature in which several x sequence ph oe n i x f i n I ks Table 1: Sample monotone many-to-many alignment between x = phoenix and y = finIks.
characters may be matched up with several y sequence characters; Table 1 illustrates. Once the training data is aligned, since x and y sequences are then segmented into equal number of segments, string-to-string translation may be seen as a sequence labeling (tagging) problem in which x (sub-)sequence characters are observed variables and y (sub-)sequence characters are hidden states (Jiampojamarn et al., 2007;Jiampojamarn et al., 2010).
In this work, we extend the problem of classical string-to-string translation by assuming that, at training time, we have available (M + 2)-tuples of strings {(x,ŷ (1) , . . . ,ŷ (M ) , y)}, where x is the input string,ŷ (m) , for 1 ≤ m ≤ M , are supplemental information strings, and y is the desired output string; at test time, we wish to predict y from (x,ŷ (1) , . . . ,ŷ (M ) ). Generally, we may think ofŷ (1) , . . . ,ŷ (M ) as arbitrary strings over arbitrary alphabets Σ (m) , for 1 ≤ m ≤ M . For example, x might be a letter-string andŷ (m) might be a transliteration of x in language L m (cf. Bhargava and Kondrak (2012)). Alternatively, and this is our model scenario in the current work, x might be a letter input string andŷ (m) might be the predicted string of phonemes, given x, produced by an (offline) system T m . This situation is outlined in Table 3. In the table, we also illustrate a multiple (monotone) many-to-many alignment of (x,ŷ (1) , . . . ,ŷ (M ) , y). By this, we mean an alignment where (1) subsequences of all M + 2 strings may be matched up with each other (many-to-many alignments), and where (2) the matching up of subsequences obeys monotonicity. Note that such a multiple alignment generalizes classical monotone many-to-many alignments between pairs of strings, as shown in Table 1. Furthermore, such an alignment may apparently be quite useful. For instance, while none of the stringsŷ (m) in the table equals the true phonetic transcription y of x, taking a position-wise majority vote of the multiple alignment of (ŷ (1) , . . . ,ŷ (M ) ) yields y. Moreover, analogously as in the case of pairs of aligned strings, we may perceive the so extended stringto-string translation problem as a sequence labeling task once (x,ŷ (1) , . . . ,ŷ (M ) , y) are multiply aligned, but now, with additional observed variables (or features), namely, (sub-)sequence characters of each stringŷ (m) .
To further motivate our approach, consider the situation of training a new G2P system on the basis of, e.g., Combilex (Richmond et al., 2009). For each letter form in its database, Combilex provides a corresponding phonetic transcription. Now, suppose that, in addition, we can poll an external knowledge source such as Wiktionary for (its) phonetic transcriptions of the respective Combilex letter words as outlined in  tral question we want to answer is: can we train a system using this additional information which performs better than the 'baseline' system that ignores the extra information? Clearly, a system with more information should not perform worse than a system with less information (unless the additional information is highly noisy), but it is a priori not clear at all how the extra information can be included, as Bhargava and Kondrak (2012) note: output predictions may be in distinct alphabets and/or follow different conventions, and simple rule-based conversions may even deteriorate a baseline system's performance. Their solution to the problem is to let the baseline system output its n-best phonetic transcriptions, and then to re-rank these n-best predictions via an SVM reranker trained on the supplemental representations x = schizo s ch i z ô y (1) = skaIz@U s k aI z @Û y (2) = saIz@U s -aI z @Û y (3) = skIts@ s k I ts @ y (4) = Sits@U Si ts @Û y (5) = skIts@ s k I ts @ y = skIts@U s k I ts @U Table 3: Left: Input string x, predictions of 5 systems, and output string y. Right: A multiple many-to-many alignment of (x,ŷ (1) , . . . ,ŷ (5) , y). Skips are marked by a dash ('-').
(see their figure 2). Our approach is much different from this: we character (or substring) align the supplemental information strings with the input letter strings and then sequentially transduce input character substrings as in the standard G2P approach, but where the sequential transducer is aware of the corresponding subsequences of the supplemental information strings.
Our goals in the current work are first, in Section 2, to formally introduce the multiple manyto-many alignment problem, which, to our knowledge, has not yet been formally considered, and to indicate how it can be solved (by standard extensions of well-known DP recursions). Secondly, we outline an 'approximation algorithm', also in Section 2, with much better runtime complexity, to solving the multiple many-to-many alignment problem. This proceeds by optimally segmenting individual strings to align under the global constraint that the number of segments must agree across strings. Thirdly, we demonstrate experimentally, in Section 5, that multiple many-tomany alignments may be an extremely useful first step in boosting the performance of a G2P model. In particular, we show that by conjoining a base system with additional systems very high performance increases can be achieved. We also investigate the effects of using our introduced approximation algorithm instead of 'exactly' determining alignments. We discuss related work in Section 3, present data and systems in Section 4 and conclude in Section 6.

Mult. Many-to-Many Alignm. Models
We now formally define the problem of multiply aligning several strings in a monotone and manyto-many alignment manner. For notational convenience, in this section, let the N strings to align be denoted by w 1 , . . . , w N (rather than x,ŷ (m) , y, etc.). Let each w n , for 1 ≤ n ≤ N , be an arbitrary string over some alphabet Σ (n) . Let n = |w n | denote the length of w n . Moreover, assume that a set S ⊆ N n=1 {0, . . . , n }\{0 N } of allowable steps is specified, where 0 N = (0, . . . , 0 N times ). 1 We interpret the elements of S as follows: if (s 1 , s 2 , . . . , s N ) ∈ S, then subsequences of w 1 of length s 1 , subsequences of w 2 of length s 2 , . . ., subsequences of w N of length s N may be matched up with each other. In other words, S defines the types of valid 'many-to-many match-up operations'. 2 While we could drop S from consideration and simply allow every possible matching up of character subsequences, it is convenient to introduce S because algorithmic complexity may then be specified in terms of S, and by choosing particular S, one may retrieve special cases otherwise considered in the literature (see next section).
w N,1 w N,2 · · · w N,k such that (|w 1,i | , . . . , |w N,i |) ∈ S, for all i = 1, . . . , k, and such that w n = w n,1 · · · w n,k , for all 1 ≤ n ≤ N . Let A S = A S (w 1 , . . . , w N ) denote the set of all multiple alignments of (w 1 , . . . , w N ). For an alignment a ∈ A S , denote by score(a) = f (a) the score of alignment a under alignment model f , where f : A S (w 1 , . . . , w N ) → R. We now investigate solutions to the problem of finding the alignment with maximal score under different choices of alignment models f , i.e., we search to efficiently solve Unigram alignment model For our first alignment model f , we assume that f (a), for a ∈ A S , is the score 1 Here, denotes the Cartesian product of sets. 2 In the case of two strings, this is sometimes denoted in the manner M -N (e.g., 3-2, 1-0), indicating that M characters of one string may be matched up with N characters of the other string. Analogously, we could write here s1-s2-s3-· · · . for a real-valued similarity function sim 1 : N n=1 Σ (n) * → R. We call the model f in (2) a unigram model because f (a) is the sum of the similarity scores of the matched-up subsequences (w 1,i , . . . , w N,i ), ignoring context. Due to this independence assumption, solving maximization problem in Eq. (1) under specification (2) is straightforward via a dynamic programming (DP) recursion. To do so, define by M S,sim 1 (i 1 , i 2 , . . . , i N ) the score of the best alignment, under alignment model f = sim 1 and set of steps S, of (w 1 (1 : This recurrence directly leads to a DP algorithm, shown in Algorithm 1, for computing the score of the best alignment of (w 1 , . . . , w N ); the actual alignment can be found by storing pointers to the maximizing steps taken. If similarity evaluations sim 1 (w 1,i , . . . , w N,i ) are thought of as taking constant time, this algorithm's run time is O( N n=1 n · |S|). When = 1 = · · · = n and |S| = N − 1 ('worst case' size of S), then the algorithm's runtime is thus O( 2N ), which quickly becomes untractable as N , the number of strings to align, increases.
Of course, the unigram alignment model could be generalized to an m-gram alignment model. An m-gram alignment model would exhibit worstcase runtime complexity of O( (m+1)N ) under analogous DP recursions as for the unigram model.

Algorithm 1
1: procedure UNIGRAM-ALIGN(w 1 , . . . , w N ; S, sim 1 ) 2: Separable alignment models For our second model class, assume that, for any a ∈ for some models f w 1 , . . . , f w N and where Ψ : R N → R is non-decreasing in its arguments (e.g., The advantage with separable models is that we can solve the 'subproblems' f w 1 , . . . , f w N independently. Thus, in order to find optimal multiple alignments of (w 1 , . . . , w N ) under such a specification, we would only have to find the best segmentations of sequences w n under models f wn , for 1 ≤ n ≤ N , subject to the constraint that the segmentations must agree in their number of segments (the coupling variable). Let S wn ⊆ {0, 1, . . . , n } denote the constraints on segment lengths, similar to the interpretation of steps in S. If f wn is a unigram segmentation model then the problem of finding the best segmentation of w n with exactly j segments can be solved in time O( n |S wn | j). Thus, if each f wn is a unigram segmentation model, worst-case time complexity for each subproblem would be O( 3 n ) (if string w n can be segmented into at most n segments) and then the overall problem (1) under specification (4) is solvable in worst-case time N · O( 3 ). More generally, if each f wn is an m-gram segmentation model, then worst-case time complexity amounts to N · O( m+2 ). Importantly, this scales linearly with the number N of strings to align, rather than exponentially as the O( (m+1)N ) under the (non-separable) m-gram alignment model discussed above.
Unsupervised alignments The algorithms presented may be applied iteratively in order to induce multiple alignments in an unsupervised (EMlike) fashion in which sim 1 is gradually learnt (e.g., starting from a uniform initialization of sim 1 ). We skip details of this, as we do not make us of it in our current experiments. Rather, in our experiments below, we directly specify sim 1 as a sum of pairwise similarity scores which we extract from alignments produced by an off-the-shelf pairwise aligner.

Related work
Monotone alignments have a long tradition, both in NLP and bioinformatics.
As Gusfield (1997) shows, the Steiner consensus string may be retrieved from a multiple align-ment of s 1 , . . . , s N by concatenating the columnwise majority characters in the alignment, ignoring skips. Since median string computation (and hence also the multiple many-to-many alignment problem, as we consider) is an NP-hard problem (Sim and Park, 2003), designing approximations is an active field of research. For example, Marti and Bunke (2001) ignore part of the search space by declaring matches-up of distant characters as unlikely, and Jiang et al. (2012) apply an approximation based on string embeddings in vector spaces. Paul and Eisner (2012) apply dual decomposition to compute Steiner consensus strings. Via the approach taken in this paper, median strings may be computed in case d is a (distance) function taking substring-to-substring edit operations into account, a seemingly straightforward, yet extremely useful generalization in several NLP applications, as indicated in the introduction.
Our approach may also be seen in the context of classifier combination for string-valued variables. While ensemble methods for structured prediction have been considered in several works (see, e.g., Nguyen and Guo (2007), Cortes et al. (2014), and references therein), a typical assumption in this situation is that the sequences to be combined have equal length, which clearly cannot be expected to hold when, e.g., the outputs of several G2P, transliteration, etc., systems must be combined. In fact, the multiple many-to-many alignment models investigated in this work could act as a preprocessing step in this setup, since the alignment precisely serves the functionality of segmenting the strings into equal number of segments/substructures. Of course, combining outputs with varying number of elements is also an issue in machine translation (e.g., Macherey and Och (2007), Heafield et al. (2009)), but, there, the problem is harder due to the potential non-monotonicities in the ordering of elements, which typically necessitates (additional) heuristics. One approach for constructing multiple alignments is here progressive multiple alignment (Feng and Doolittle, 1987) in which a multiple (typically one-to-one) alignment is iteratively constructed from successive pairwise alignments (Bangalore et al., 2001). Matusov et al. (2006) apply word reordering and subsequent pairwise monotone one-to-one alignments for MT system combination.

Data and systems 4.1 Data
We conduct experiments on the General American (GA) variant of the Combilex data set (Richmond et al., 2009). This contains about 144,000 grapheme-phoneme pairs as exemplarily illustrated in Table 2. In our experiments, we split the data into two disjoint parts, one for testing (about 28,000 word pairs) and one for training/development (the remainder).

Systems
BASELINE Our baseline system is a linear-chain conditional random field model (CRF) 6 (Lafferty et al., 2001) which we apply in the manner indicated in the introduction: after many-to-many aligning the training data as in Table 1, at training time, we use the CRF as a tagging model that is trained to label each input character subsequence with an output character subsequence. As features for the CRF, we use all n-grams of subsequences of x that fit inside a window of size 5 centered around the current subsequence (context features). We also include linear-chain features which allow previously generated output character subsequences to influence current output character subsequences. In essence, our baseline model is a standard discriminative approach to G2P. It is, all in all, the same approach as described in Jiampojamarn et al. (2010), except that we do not include joint n-gram features. At test time, we first segment a new input string x and then apply the CRF. Thereby, we train the segmentation module on the segmented x sequences, as available from the aligned training data. 7 BASELINE+X As competitors for the baseline system, we introduce systems that rely on the predictions of one or several additional (black box/offline) systems. At training time, we first multiply many-to-many align the input string x, the predictionsŷ (1) , . . . ,ŷ (M ) and the true transcription y as illustrated in Table 3 (see Section 4.3 for details). Then, as for the baseline system, we train a CRF to label each input character subsequence with the corresponding output character subsequence. However, this time, the CRF has access to the subsequence suggestions (as the alignments indicate) produced by the offline systems. As features for the extended models, we additionally include context features for all predicted stringsŷ (m) (all n-grams in a window of size 3 centered around the current subsequence prediction). We also include a joint feature firing on the tuple of the current subsequence value of x, y (1) , . . . ,ŷ (M ) . To illustrate, when BASELINE+X tags position 2 in the (split up) input string in Table 3, it sees that its value is ch, that the previous input position contains s, that the next contains i, that the next two contain (i,z), that the prediction of the first system at position 2 is k, that the first system's next prediction is ai, and so forth. At test time, we first multiply many-to-many align x,ŷ (1) , . . . ,ŷ (M ) , and then apply the enhanced CRF.

Alignments
To induce multiple monotone many-to-many alignments of input strings, offline system predictions and output strings, we proceed in one of two manners.
Exact alignments Firstly, we specify sim 1 in Eq. (2), as sim 1 (x i ,ŷ where psim is a pair-similarity function. The advantage with this specification is that the similarity of a tuple of subsequences is defined as the sum of pairwise similarity scores, which we can directly estimate from pairwise alignments of (x,ŷ (m) ) that an off-the-shelf pairwise aligner can produce (we use the Phonetisaurus aligner for this). We set psim(u, v) as log-probability of observing the tuple (u, v) in the training data of pairwise aligned sequences. To illustrate, we define the similarity of (o,@U,@U,@,@U,@,@U) in the example in Table  3 as the pairwise similarity of (o,@U) (as inferred from pairwise alignments of x strings and system 1 transcriptions) plus the pairwise similarity of (o,@U) (as inferred from pairwise alignments of x strings and system 2 transcriptions), etc. At test time, we use the same procedure but drop the term psim(x i , y i ) when inducing alignments. For our current purposes, we label the outlined modus as exact (alignment) modus.
Approx. alignments Secondly, we derive the optimal multiple many-to-many alignment of the strings in question by choosing an alignment that satisfies the condition that (1) each individual string x,ŷ (1) , . . . ,ŷ (M ) , y is optimally segmented (e.g., ph-oe-n-i-x rather than pho-eni-x, f-i-n-I-ks rather than f-inIk-s) subject to the global constraint that (2) the number of segments must agree across the strings to align. This constitutes a separable alignment model as discussed in Section 2, and thus has much lower runtime complexity as the first model. Segmentation models can be directly learned from the pairwise alignments that Phonetisaurus produces by focusing on either the segmented x or y/ŷ (m) sequences; we choose to implement bigram individual segmentation models. This second model type may be considered an approximation of the first, since in a good alignment, we would not only expect individually good segmentations and agreement of segment numbers but also that subsegments are likely correlations of each other, precisely as our first model type captures. Therefore, we shall call this alignment modus approximate (alignment) modus, for our present purposes.

Experiments
We now describe two sets of experiments, a controlled experiment on the Combilex data set where we can design our offline/black box systems ourselves and where the black box systems are trained on a similar distribution as the baseline and the extended baseline systems. In particular, the black box systems operate on the same output alphabet as the extended baseline systems, which constitutes an 'ideal' situation. Thereafter, we investigate how our extended baseline system performs in a 'real-world' scenario: we train a system on Combilex that has as supplemental information corresponding Wiktionary (and PTE, as explained below) transcriptions.
Throughout, we use as accuracy measures for all our systems word accuray (WACC). Word accuracy is defined as the number of correctly transcribed strings among all transcribed strings in a test sample. WACC is a strict measure that penalizes even tiny deviations from the gold-standard transcriptions, but has nowadays become standard in G2P.

A controlled experiment
In our first set of experiments, we let our offline/black box systems be the Sequitur G2P modeling toolkit (Bisani and Ney, 2008) (S) and the Phonetisaurus modeling toolkit (Novak et al., 2012) (P). We train them on disjoint sets of 20,000 grapheme-to-phoneme Combilex string pairs each. The performance of these two systems, on the test set of size 28,000, is indicated in Table 4. Next, we train BASELINE on dis-Phonetisaurus Sequitur WACC 72.12 71.70 Table 4: Word-accuracy (in %) on the test data, for the two systems indicated.
joint sets (disjoint from both the training sets of P and S) of size 2,000, 5,000, 10,000 and 20,000. Making BASELINE's training sets disjoint from the training sets of the offline systems is both realistic (since a black box system would typically follow a partially distinct distribution from one's own training set distribution) and also prevents the extended baseline systems from fully adapting to the predictions of either P or S, whose training set accuracy is an upward biased representation of their true accuracy. As baseline extensions, we consider the systems BASELINE+P (+P), and BASELINE+P+S (+P+S). 8 Results are shown in Figures 1 and 2. We see that conjoining the base system with the predictions of the offline Phonetisaurus and Sequitur models substantially increases the baseline WACC, especially in the case of little training data. In fact, WACC increases here by almost 100% when the baseline system is complemented byŷ (P) andŷ (S) . As training set size increases, differences become less and less pronounced. Eventually, we would expect them to drop to zero, since beyond some training set size, the additional features may provide no new information. 9 We also note that conjoining the two systems is more valuable than conjoining only one system, and, in Figure 2, that the models which are based on exact multiple alignments outperform the models based on approximate alignments, but not 8 We omit BASELINE+S since it yielded similar results as BASELINE+P. 9 In fact, in follow-up work, we find that the additional information may also confuse the base system when training set sizes are large enough. by a wide margin. Concerning differences in alignments between the two alignment types, exact vs. approximate, an illustrative example where the approximate model fails and the exact model does not is ('false' alignment based on the approximate model indicated): r ee n t e r e d r i E n t @' r d r i E n t @' r d which nicely captures the inability of the approximate model to account for correlations between the matched-up subsequences. That is, while the segmentations of the three shown sequences appear acceptable, a matching of graphemic t with phonemic n, etc., seems quite unlikely. Still, it is very promising to see that these differences in alignment quality translate into very small differences in overall string-to-string translation model performance, as Figure 2 outlines. Namely, differences in WACC are typically on the level of 1% or less (always in favor of the exact alignment model). This is a very important finding, as it indicates that string-to-string translation need not be (severely) negatively impacted by switching to the approximate alignment model, a tractable alternative to the exact models, which quickly become practically infeasible as the number of strings to align increases.

Real-world experiments
To test whether our approach may also succeed in a 'real-world setting', we use as offline/black box systems GA Wiktionary transcriptions of our input forms as well as PhotoTransEdit (PTE) transcriptions, 10 a lexicon-based G2P system which offers both GA and RP (received pronunciation) transcription of English strings. We train and test on input strings for which both Combilex and PTE transcriptions are available, and for which both Combilex and Wiktionary transcriptions are available. 11 Test set sizes are about 1,500 in the case of PTE and 3,500 in the case of Wiktionary. We only test here the performance of the exact alignment method, noting that, as before, approximate alignments produced slightly weaker results.
Clearly, Wiktionary and PTE differ from the Combilex data. First, both Wiktionary and PTE use different numbers of phonemic symbols than Combilex, as Table 5   arise from the fact that, e.g., lengthening of vowels is indicated by two output letters in some data sets and only one in others. Also, phonemic transcription conventions differ, as becomes most strikingly evident in the case of RP vs. GA transcriptions - Table 6 illustrates. Finally, Wiktionary has many more phonetic symbols than the other datasets, a finding that we attribute to its crowd-sourced nature and lacking of normalization. Despite these differences in phonemic annotation standards between Combilex, Wiktionary and PTE, we observe that conjoining input strings with predicted Wiktionary or PTE transcriptions via multiple alignments leads to very good improvements in WACC over only using the input string as information source. Indeed, as shown in

Conclusion
We have generalized the task description of string transduction to include supplemental information strings. Moreover, we have suggested multiple 12 To provide, for the interested reader, a comparison with Phonetisaurus and Sequitur: for the Wiktionary GA data, performance of Phonetisaurus is 41.80% (training set size 2,000), 55.70% (5,000) and 62.47% (10,000). Respective numbers for Sequitur are 40.58%, 54.84%, and 61.58%. On PTE, results are, similarly, slightly higher than our baseline, but substantially lower than the extended baseline.  Table 6: Multiple alignments of input string, predicted PTE transcription and true (Combilex) transcription. Differences may be due to alternative phonemic conventions (e.g., Combilex has a single phonemic character representing the sound tS) and/or due to differences in pronunciation in GA and RP, resp. many-to-many alignments -and a subsequent standardly extended discriminative approachfor solving string transduction (here, G2P) in this generalized setup. We have shown that, in a realworld setting, our approach may significantly beat a standard discriminative baseline, e.g., when we add Wiktionary transcriptions or predictions of a rule-based system as additional information to the input strings. The appeal of this approach lies in the fact that almost any sort of external knowledge source may be integrated to improve the performance of a baseline system. For example, supplemental information strings may appear in the form of transliterations of an input string in other languages; they may be predictions of other G2P systems, whether carefully manually crafted or learnt from data; they might even appear in the form of phonetic transcriptions of the input string in other dialects or languages. What distinguishes our solution to integrating supplemental information strings in string transduction settings from other research (e.g., (Bhargava and Kondrak, 2011;Bhargava and Kondrak, 2012)) is that rather than integrating systems on the global level of strings, we integrate them on the local level of smaller units, namely, substrings appropriated to the domain of application (e.g., in our context, phonemes/grapheme substructures). Both approaches may be considered complementary. Finally, another important contribution of our work is to outline an 'approximation algorithm' to inducing multiple many-to-many alignments of strings, which is otherwise an NP-hard problem for which (most likely) no efficient exact solutions exist, and to investigate its suitability for the problem task. In particular, we have seen that exact alignments lead to better overall model performance, but that the margin over the approximation is not wide.
The scope for future research of our modeling is huge: multiple many-to-many alignments may be useful in aligning cognates in linguistic research; they may be the first necessary step for many other ensemble techniques in string transduction as we have considered (Cortes et al., 2014), and they may allow, on a large scale, to boost G2P (transliteration, lemmatization, etc.) systems by integrating them with many traditional (or modern) knowledge resources such as rule-and dictionarybased lemmatizers, crowd-sourced phonetic transcriptions (e.g., based on Wiktionary), etc., with the outlook of significantly outperforming current state-of-the-art models which are based solely on input string information.
Finally, we note that we have thus far shown that supplemental information strings may be beneficial in case of overall little training data and that improvements decrease with data size. Further investigating this relationship will be of importance. Morevoer, it will be insightful to compare the exact and approximate alignment algorithms presented here with other (heuristic) alignment methods, such as iterative pairwise alignments as employed in machine translation, and to investigate how alignment quality of multiple strings impacts overall G2P performance in the setup of additional information strings.