Learning a Lexicon and Translation Model from Phoneme Lattices

Language documentation begins by gathering speech. Manual or automatic transcription at the word level is typically not possible because of the absence of an orthography or prior lexicon, and though manual phone-mic transcription is possible, it is prohibitively slow. On the other hand, translations of the minority language into a major language are more easily acquired. We propose a method to harness such translations to improve automatic phoneme recognition. The method assumes no prior lexicon or translation model, instead learning them from phoneme lattices and translations of the speech being transcribed. Experiments demonstrate phoneme error rate improvements against two baselines and the model’s ability to learn useful bilingual lexical entries.


Introduction
Most of the world's languages are dying out and have little recorded data or linguistic documentation (Austin and Sallabank, 2011). It is important to adequately document languages while they are alive so that they may be investigated in the future. Language documentation traditionally involves one-onone elicitation of speech from native speakers in order to produce lexicons and grammars that describe the language. However, this does not scale: linguists must first transcribe the speech phonemically as most of these languages have no standardized orthography. This is a critical bottleneck since it takes a trained linguist about 1 hour to transcribe the phonemes of 1 minute of speech (Do et al., 2014).
Smartphone apps for rapid collection of bilingual data have been increasingly investigated (De Vries et al., 2011;De Vries et al., 2014;Reiman, 2010;Bird et al., 2014;Blachon et al., 2016). It is common for these apps to collect speech segments paired with spoken translations in another language, making spoken translations quicker to obtain than phonemic transcriptions.
We present a method to improve automatic phoneme transcription by harnessing such bilingual data to learn a lexicon and translation model directly from source phoneme lattices and their written target translations, assuming that the target side is a major language that can be efficiently transcribed. 1 A Bayesian non-parametric model expressed with a weighted finite-state transducer (WFST) framework represents the joint distribution of source acoustic features, phonemes and latent source words given the target words. Sampling of alignments is used to learn source words and their target translations, which are then used to improve transcription of the source audio they were learnt from. Importantly, the model assumes no prior lexicon or translation model.
Experiments demonstrate that our method significantly reduces the phoneme error rate (PER) of transcriptions compared with a baseline recogniser and a similar model that harnesses only monolingual information, by up to 17% and 5% respectively. We also find that the model learns meaningful bilingual lexical items.

Model description
Our model extends the standard automatic speech recognition (ASR) problem by seeking the best phoneme transcriptionφ of an utterance in a joint probability distribution that incorporates acoustic features x, phonemes φ, latent source words f and observed target transcriptions e: assuming a Markov chain of conditional independence relationships (bold symbols denote utterances as opposed to tokens). Deviating from standard ASR, we replace language model probabilities with those of a translation model, and search for phonemes instead of words. Also, no lexicon or translation model are given in training.

Expression of the distribution using finite-state transducers
We use a WFST framework to express the factors of (1) since it offers computational tractability and simple inference in a clear, modular framework. Figure 1 uses a toy German-English error resolution example to illustrate the components of the framework: a phoneme lattice representing phoneme uncertainty according to P (x|φ); a lexicon that transduces phoneme substrings φ s of φ to source tokens f according to P (φ s |f ); and a lexical translation model representing P (f |e) for each e in the written translation. The composition of these components is also shown at the bottom of Figure 1, illustrating how would-be transcription errors can be resolved. This framework is reminiscent of the WFST framework used by Neubig et al. (2012) for lexicon and language model learning from monolingual data.

Learning the lexicon and translation model
Because we do not have knowledge of the source language, we must learn the lexicon and translation model from the phoneme lattices and their written translation. We model lexical translation probabilities using a Dirichlet process. Let A be both the transcription of each source utterance f and its word alignments to the translation e that generated them. The conditional posterior can be expressed as: where c A (f, e) is a count of how many times f has aligned to e in A and c A (e) is a count of how many times e has been aligned to; P 0 is a base distribution that influences how phonemes are clustered; and α determines the emphasis on the base distribution. In order to express the Dirichlet process using the WFST components, we take the union of the lexicon with a spelling model base distribution that consumes phonemes φ i . . . φ j and produces a special unk token with probability P 0 (φ i . . . φ j ). This unk token is consumed by a designated arc in the translation model WFST with probability α c A (e)+α , yielding a composed probability of αP 0 (f ) c A (e)+α . Other arcs in the translation model express the probability c A (e)+α of entries already in the lexicon. The sum of these two probabilities equates to (2).
As for the spelling model P 0 , we consider three distributions and implement WFSTs to represent them: a geometric distribution, Geometric(γ), a Poisson distribution, Poisson(λ), 2 and a 'shifted' geometric distribution, Shifted(α, γ). The shifted geometric distribution mitigates a shortcoming of the geometric distribution whereby words of length 1 have the highest probability. It does so by having another parameter α that specifies the probability of a word of length 1, with the remaining probability mass distributed geometrically. All phonemes types are treated the same in these distributions, with uniform probability.

Inference
In order to determine the translation model parameters as described above, we require the alignments A. We sample these proportionally to their probability given the data and our prior, in effect integrating over all parameter configurations T : P (A|X ; α, P 0 ) = T P (A|X , T )P (T ; α, P 0 )dT , (3) where X is our dataset of source phoneme lattices paired with target sentences. This is achieved using blocked Gibbs sampling, with each utterance constituting one block. To sample from WFSTs, we use forwardfiltering/backward-sampling (Scott, 2002;Neubig et al., 2012), creating forward probabilities using the forward algorithm for hidden Markov models before backward-sampling edges proportionally to the product of the forward probability and the edge weight. 3 3 No Metropolis-Hastings rejection step was used.

Experimental evaluation
We evaluate the lexicon and translation model by their ability to improve phoneme recognition, measuring phoneme error rate (PER).

Experimental setup
We used less than 10 hours of English-Japanese data from the BTEC corpus (Takezawa et al., 2002), comprised of spoken utterances paired with textual translations. This allows us to assess the approach assuming quality acoustic models. We used acoustic models similar to Heck et al. (2015) to obtain source phoneme lattices. Gold phoneme transcriptions were obtained by transforming the text with pronunciation lexicons and, in the Japanese case, first segmenting the text into tokens using KyTea (Neubig et al., 2011).
We run experiments in both directions: English-Japanese and Japanese-English (en-ja and ja-en), while comparing against three settings: the ASR 1best path uninformed by the model (ASR); a monolingual version of our model that is identical except without conditioning on the target side (Mono); and the model applied using the source language sentence as the target (Oracle).
We tuned on the first 1,000 utterences (about 1 hour) of speech and trained on up to 9 hours of the  Figure 2: Japanese phoneme error rates using the shifted geometric prior when training data is scaled up from 1-9 hours, averaged over 3 runs.
remaining data. 4 Only the oracle setup was used for tuning, with Geometric(0.01) (taking the form of a vague prior), Shifted(10 −5 , 0.25) and Poisson (7) performing best. Table 1 shows en-ja and ja-en results for all methods with the full training data. Figure 2 shows improvements of ja-en over both the ASR baseline and the Mono method as the training data increases, with translation modeling gaining an increasing advantage with more training data. Notably, English recognition gains less from using Japanese as the target side (en-ja) than the other way around, while the 'oracle' approach for Japanese recognition, which also uses Japanese as the target, underperforms ja-en. These observations suggest that using the Japanese target is less helpful, likely explained by the fine-grained morphological segmentation we used, making it harder for the model to relate source phonemes to target tokens.

Results and Discussion
The vague geometric prior significantly underperforms the other priors. In the en-ja/vague case, the model actually underperforms its monolingual counterpart. The vague prior biases slightly towards finegrained English source segmentation, with words of length 1 most common. In this case, fine-grained Japanese is also used as the target which results in most lexical entries arising from uninformative alignments between single English phonemes and Japanese syllables, such as [t]⇔す. For similar reasons, the shifted geometric prior gains an advantage over Poisson, likely because of its ability to even further penalize single-phoneme lexical items, which regularly end up in all lexicons anyway due to their combinatorical advantage when sampling.
While many bilingual lexical entries are correct, such as [w2n]⇔一 ('one'), most are not. Some have segmentation errors [li:z]⇔くださ ('please'); some are correctly segmented but misaligned to commonly co-occurring words [w2t]⇔時 ('what' aligned to 'time'); others do not constitute individual words, but morphemes aligned to common Japanese syllables [i:N]⇔く ('-ing'); others still align multi-word units correctly [haUm2tS]⇔いく ら ('how much'). Note though that entries such as those listed above capture information that may nevertheless help to reduce phoneme transcription errors.

Conclusion and Future Work
We have demonstrated that a translation model and lexicon can be learnt directly from phoneme lattices in order to improve phoneme transcription of those very lattices.
One of the appealing aspects of this modular framework is that there is much room for extension and improvement. For example, by using adaptor grammars to encourage syllable segmentation (Johnson, 2008), or incorporating language model probabilities in addition to our translation model probabilities (Neubig et al., 2012).
We assume a good acoustic model with phoneme error rates between 20 and 25%. In a language documentation scenario, acoustic models for the lowresource source language won't exist. Future work should use a universal phoneme recognizer or acoustic model of a similar language, thus making a step towards true generalizability.