Joint Word Segmentation and Phonetic Category Induction

We describe a model which jointly performs word segmentation and induces vowel categories from formant values. Vowel induction performance improves slightly over a baseline model which does not segment; segmentation performance decreases slightly from a baseline using entirely symbolic input. Our high joint performance in this idealized setting implies that problems in unsupervised speech recognition reﬂect the phonetic variability of real speech sounds in context


Introduction
In learning to speak their native language, a developing infant must acquire two related pieces of information: a set of lexical items (along with the contexts in which they are likely to occur), and a set of phonetic categories. For instance, an Englishlearning infant must learn that [i] and [I] are different segments, differentiating between words like beat and bit, while for a Spanish-learning infant, [i] and [I]-like tokens represent realizations of the same category. It is clear that these two tasks are intimately related, and that models of language acquisition must solve both together-but how?
This problem has inspired much recent work in low-resource speech recognition (Lee et al., 2015;Lee and Glass, 2012;Jansen and Church, 2011;Varadarajan et al., 2008), with impressive results. Nonetheless, many of these researchers conclude that their systems learn too many phonetic categories, a problem they attribute to the presence of contextual variants (allophones) of the different sounds. For instance, the [a] in dog is likely longer than the [a] in dock (Ladefoged and Johnson, 2010), but this difference is not phonologically meaningful in English-it cannot differentiate any pair of words on its own. Many unsupervised systems are claimed to erroneously learn these kinds of differences as categorical ones.
Here, we attempt to model the problem in a more controlled setting by extending work in cognitive modeling of language acquisition. We present a system which jointly acquires vowel categories and lexical items from a mixed symbolic/acoustic representation of the input. As is traditional in cognitive models of vowel acquisition, it uses a singlepoint formant representation of the vowel acoustics, and is tested on a simulated corpus in which vowel acoustics are unaffected by context. We find that, under these circumstances, vowel categories and lexical items can be learned jointly with relatively little decrease in accuracy from learning either alone. Thus, our results support the hypothesis that the more realistic problem is hard because of contextual variability. As a secondary point, we show that the results reflect problems with local minima in the popular framework of hierarchical Bayesian modeling.

Related work
This work aims to induce both a set of phonetic vowel categories and a lexical representation from unlabeled data. It extends the closely related model of Feldman et al. (2013a), which performs the same task, but with known word boundaries; this requirement is a significant limitation on the model's cognitive plausibility. Our model infers a latent word segmentation. Another extension, , uses semantic information to disambiguate words, but still with known word boundaries.
A few models learn a lexicon while categorizing all sounds, instead of just vowels. Lee et al. (2015) and Lee and Glass (2012) use hierarchical Bayesian models to induce word and subword units. These models are mathematically very similar to our own, differing primarily using more complex acoustic representations and inducing categories for all sounds instead of just vowels. Jansen and Church (2011) learns whole-word Markov models, then clusters their states into phone-like units using a spectral algorithm. Their system still learns multiple allophonic categories for most sounds.
In the segmentation literature, several previous systems learn lexical items from variable input (Elsner et al., 2013;Daland and Pierrehumbert, 2011;Rytting et al., 2010;Neubig et al., 2010;Fleck, 2008). However, these models use pre-processed representations of the acoustics (phonetic transcription or posterior probabilities from a phone recognizer) rather than inducing an acoustic category structure directly. Elsner et al. (2013) and Neubig et al. (2010) use Bayesian models and sampling schemes similar to those presented here.
Acquisition models like Elsner et al. (2013), Rytting et al. (2010) and Fleck (2008) are designed to handle phonological variability. In particular, they are designed to cope with words which have multiple transcribed pronunciations ([wan] and [want] for "want"); this kind of alternation can insert or delete whole segments, or change a vowel sound from one perceptual category to another. Such variability is common in spoken English (Pitt et al., 2005) and presents a challenge for speech recognition (McAllaster et al., 1998).
In contrast, the system presented here models phonetic variability within a single category. It uses an untranscribed, continuous-valued representation for vowel sounds, so that different tokens within a single category may differ from one another. But it does so within an idealized dataset which lacks phonological variants. Moreover, although the phonetic input to the system is variable, the variation is not predictable; tokens within the category differ at random, independently from their environment.
Several other models also learn phonetic categories from continuous input, either from real or idealized datasets, without learning a lexicon. Varadarajan et al. (2008) learn subword units by incrementally splitting an HMM model of the data to maximize likelihood. Badino et al. (2014) perform k-means clustering on the acoustic representation learned by an autoencoder. Cognitive models using formant values as input are common, many using mixture of Gaussians (Vallabha et al., 2007;de Boer and Kuhl, 2003). Because they lack a lexicon, these models have particular difficulty distinguishing meaningful from allophonic variability.

Dataset and model
Our dataset replicates the previous idealized setting for vowel category induction in cognitive modeling, but in a corpus of unsegmented utterances rather than a wordlist. We adapt a standard word segmentation corpus of child-directed speech (Brent, 1999), which consists of 8000 utterances from Bernstein-Ratner (1987), orthographically transcribed and then phonetically transcribed using a pronunciation dictionary.
We add simulated acoustics (without contextual variation) to each vowel in the Brent corpus. Following previous cognitive models of category induction (Feldman et al., 2013b), we use the vowel dataset given by Hillenbrand et al. (1995), which gives formants for English vowels read in the context h d. We estimate a multivariate Gaussian distribution for each vowel, and, whenever a monophthongal vowel occurs in the Brent corpus, we replace it with a pair of formants (f 1 , f 2 ) drawn from the appropriate Gaussian. The ARPABET diphthongs "oy, aw, ay, em, en", and all the consonants, retain their discrete values. The first three words of the dataset, orthographically "you want to", are rendered: y[380.53 1251.69] w[811.88 1431.96]n t[532.91 1094.14].

Model
Our model merges the Feldman et al. (2013a) vowel category learner with the Elsner et al. (2013) noisychannel framework for word segmentation, which is in turn based on the segmentation model of . In generative terms, it defines a sequential process for sampling a dataset. The observations will be surface strings S, which are divided into (latent) words X i=1:n . We denote the j-th character of word i as S ij . When S ij is a vowel, the observed value is a real-valued formant pair (f 1 , f 2 ); when it is a consonant, it is observed directly.
1. Draw a distribution over vowel categories, 5. Sample word sequences, X i ∼ G X i−1 6. Realize each vowel token in the surface string, S ij ∼ N ormal(µ X ij , Σ X ij ) The initial prior over word forms, CV (π v , p c , p stop ) is the following: sample a word length ≥ 1 from Geom(p stop ); for each character in the word, choose to sample a consonant with probability p c or a vowel otherwise; sample all consonants uniformally, and all vowels according to the (possibly-infinite) probability vector π v . 1 In practice, we integrate out π v , yielding a Chinese restaurant process in which the distribution over vowels in a new word depend on those used in already-seen words. Vowels which occur in many word types are more likely to recur (Goldwater et al., 2006;Teh et al., 2006).
The hyperparameters for the model are α 0 and α 1 (which control the size of the unigram and bigram vocabularies), α v (which weakly affects the number of vowel categories), µ 0 , n, Λ and ν (which affect the average location and dispersion of vowel categories in formant space), and p c and p stop (which weakly affect the length and composition of words). We set α 0 and α 1 to their optimal values for word segmentation (3000 and 100 ) and α v to .001. In practice, no value of α v we tried would produce a useful number of vowels and so we fix the maximum number of vowels (non-probabilistically) to n v ; we explore a variety of values of this parameter below. The mean vector for the vowel category parameters is set to [500, 1500] and the inverse precision matrix to 500I, biasing vowel categories to be near the center of the vowel space and have variances on the order of hundreds of hertz. We set the prior degrees of freedom ν to 2.001. Since ν can be interpreted as a pseudocount determining the prior strength, this means the prior influence is relatively weak for reasonably-sized vowel categories. We set p c = .5 and p stop = .5; based on , we do not expect these parameters to be influential.
These hyperparameter values were mostly taken from previous work. The vowel inverse precision and degrees of freedom differ from those in Feldman et al. (2013a), since our approach requires us to sample from the prior, but the uninformative prior used there was too poor a fit for the data. We chose a variance with units on the order of the overall data variance, but did not tune it.

Inference
We conduct inference by Gibbs sampling, including three sampling moves: block sampling of the analyses of a single utterance, table label relabeling of a lexical item  and resampling of the vowel category parameters µ v and Σ v . We run 1000 iterations of utterance resampling, with table relabeling every 10 iterations. 2 Following previous work, we integrate out the mixing weight distributions G 0 , G 1 and π v , resulting in Chinese restaurant process distributions for unigrams, bigrams and vowel categories in the lexicon (Teh et al., 2006). Unlike Feldman et al. (2013a) and many other variants of the Infinite Mixture of Gaussians (Rasmussen, 1999), we do not integrate out µ v and Σ v , since this would create long-distance dependencies between different tokens of the same vowel category within an utterance and thus complicate the implementation of a whole-utterance block sampling scheme.
To block sample the analyses of a single utterance, we use beam sampling (Van Gael et al., 2008;Huggins and Wood, 2014), an auxiliary-variable sampling scheme in which we encode the model as an (infeasibly large) finite-state transducer, then sample cutoff variables which restrict our algorithm to a finite subset of the transducer and sample a trajectory within it. We then use a Metropolis-Hastings acceptance test to correct for the discrepancy between our finite-state encoding and the actual model probability caused by repetitions of a lexical item within the same utterance.
Specifically, for each vowel s ij , we sample a cutoff c ij ∼ U [0, P (s ij |X ij )]. This cutoff indicates the least probable category assignment we will permit for the surface symbol s ij . This cutoff constrains us to consider only a finite number of vowels at each point; if there are not enough, we can instantiate unseen vowels by sampling their µ and Σ from the prior. We then construct the lattice of possible word segmentations in which s ij is allowed to correspond to any vowel in any lexical entry, as long as all the consonants match up and the vowel assignment density P (s ij |x ij ) is greater than the cutoff. We then propose a new trajectory by sampling from this lattice. See Mochihashi et al. 2 Annealing is applied linearly, with inverse temperature scaling from .1 to 1 for 800 iterations, then linearly from 1.0 to 2.0 to encourage a MAP solution. The Gaussian densities for acoustic token emissions are annealed to inverse temperature .3, to keep them comparable to the LM probabilities (Bahl et al., 1980). (2009) for details of the finite-state construction.
As in Feldman et al. (2013a), we use a table relabeling move ) which changes the word type for a single table in the unigram Chinese restaurant process by changing one of the vowels. This recategorizes a large number of tokens which share the same type (though not necessarily all, since there may be multiple unigram tables for the same word type). The implementation is tricky because of the bigram dependencies between adjacent words, some of which may be tokens of the same lexical item. Nonetheless, this move is necessary because token-level sampling has insufficient mobility to change the representation of a whole word type: if the sampler has incorrectly assigned many tokens to the non-word hAv, moving any single token to the correct haev will raise the transducer probability but also catastrophically lower the lexical probability by creating a singleton lexical item.
Finally, because µ v and Σ v are explicitly represented rather than integrated out, their values must be resampled given the set of formant values associated with each vowel cluster. The use of a conjugate (Normal-Inverse Wishart) prior makes this simple, applying equations 250-254 in Murphy (2007).

Results
Despite using multiple block moves, mobility is a severe issue for the sampler; the inference procedure fails to merge together redundant vowel categories even when doing so would raise the posterior probability significantly. We demonstrate this by running the sampler with various numbers of vowel categories n v . Posterior probabilities peak around the true value of 12, but models with extra categories always use the entire set.
With n v set to 11 or 12 categories, quantitative performance is relatively good, although segmentation is not as good as the  segmenter without any acoustics. In fact, the system slightly outperforms the Feldman et al. (2013a) lexical-distributional model with gold-standard segmentation. Results are shown in Table 1.
Word tokens are correctly segmented (both boundaries correct) with an F-score of 67% 3 (versus 74% in . Individual boundaries are detected with an F-score of 82% 3 The joint model scores are averaged over two sampler runs.
versus 87%. We also evaluate the lexical items, checking whether words are correctly grouped as well as segmented (for example, whether tokens of "is" and "as" are separated). Feldman et al. (2013a) evaluates the lexicon by computing a pairwise Fscore on tokens (positive class: clustered together). Under this metric, their highest lexicon score for English words is 93%. We compute this metric on the subset of words for which the segmentation system performs correctly (it is not clear how to count "misses" and "false alarms" for tokens which were mis-segmented). On this subset, this metric scores our system with n v = 12 at 91%, which indicates that we correctly identify most of the correctly segmented items. We evaluate our phonetic clustering by computing the same pairwise F-score on pairs of vowel tokens. Our score is 83%; the Feldman et al. (2013a) model scores 76%. We conjecture that the improvement results from the use of bigram context information to disambiguate between homophones. Confusion between vowels (attached as supplemental material) is mostly reasonable. We find crossclusters for ah,ao, ey,ih, and uh,uw. The model's successful learning of the vowel categories demonstrates that the high performance of cognitive models in this domain is not due solely to their access to gold-standard word boundaries (see also Martin et al. (2013)). We believe that the idealized acoustic values (sampled from stationary Gaussians reflecting laboratory production) are critical in allowing these models to outperform those which use natural speech.
Though solving the two tasks together is harder than tackling either alone, these results nonetheless demonstrate comparable performance to other models which have to cope with variability while segmenting. Fleck (2008) reports only 44% segmentation scores on transcribed English text including phonological variability; the noisy channel model of Elsner et al. (2013) yields a segmentation token score of 67%. 4 Besides generic task difficulty, we attribute the low scores to the model's inability to mix, which prevents it from merging similar vowel classes. Because table relabeling does not merge tables in the CRP hierarchy, even if it replaces an uncommon word with a more common one, the configurational probability does not change. Thus the model's sparsity preference cannot encourage such moves. The prior on vowel categories, DP (p v ), does encourage changes which reduce the number of lexical types using a rare vowel, but relabeling a table can rearrange at most a single sample from this prior distribution and is easily outweighed by the likelihood.
A hand analysis of one sampler run in which /I/ was split into two categories showed clear mixing problems. Many common words, such as "it" and "this", appeared as duplicate lexical entries (e.g. [I 1 t] and [I 2 t]). These presumably captured some chance variation within the category, but not an actual linguistic feature.
We suspect that this mobility problem is also a likely issue with models like Lee and Glass (2012) which use deep Bayesian hierarchies and relatively local inference moves. Since the problem occurs even in this idealized setting, we expect it to exacerbate the problems caused by contextual variability in more realistic experiments.
Some errors did result from the joint nature of the task itself. We looked for reanalyses involving both a mis-segmentation and a vowel category mistake. For instance, the model is capable of misanalyzing the word "milk" as "me" followed by the phonotactically implausible sequence "lk". Mistakes like these, in which the misanalysis creates a word, are relatively rare as a proportion of the total. The most common words created are "say", "and", "shoe", "it" and "a". More commonly, misanalyses of this type segment out single vowels or nonwords like [luk], [eN], and [mO]. Some such errors could be corrected by incorporating phonotactics into the model ). In general, the error patterns are neither particularly interpretable nor cognitively very plausible. This stands in contrast to the effects on word boundary detection found in a model of phonological variation (Elsner et al., 2013).

Conclusion
The main result of our work is that joint word segmentation and vowel clustering is possible, with relatively high effectiveness, by merging models known to be successful in each setting independently. The finding that success of this kind is possible in an idealized setting reinforces an argument made in previous work: that much of the difficulty in category acquisition is due to contextual variation.
Both phonological and phonetic variability probably contribute to the difficulty of the real task. Phonological processes such as reduction create variant versions of words, splitting real lexical items and creating misleading minimal pairs. Phonetic processes like coarticulation and compensatory lengthening create predictible variation within a category, encouraging the model to split the category into allophones. In future work, we hope to quantify the contributions of these sources of error and work to address them explicitly within the same model.