Blind phoneme segmentation with temporal prediction errors

Phonemic segmentation of speech is a critical step of speech recognition systems. We propose a novel unsupervised algorithm based on sequence prediction models such as Markov chains and recurrent neural network. Our approach consists in analyzing the error profile of a model trained to predict speech features frame-by-frame. Specifically, we try to learn the dynamics of speech in the MFCC space and hypothesize boundaries from local maxima in the prediction error. We evaluate our system on the TIMIT dataset, with improvements over similar methods.


Introduction
One of the main difficulty of speech processing as opposed to text processing is the continuous, time-dependent nature of the signal. As a consequence, pre-segmentation of the speech signal into words or sub-words units such as phonemes, syllables or words is an essential first step of a variety of speech recognition tasks.
Segmentation in phonemes is useful for a number of applications (annotation of speech for the purpose of phonetic analysis, computation of speech rate, keyword spotting, etc), and can be done in two ways. Supervised methods are based on an existing phoneme or word recognition system, which is used to decode the incoming speech into phonemes. Phonemes boundaries can then be extracted as a by-product of the alignment of the phoneme models with the speech. Unsupervised methods (also called blind segmentation) consist in finding phonemes boundaries using the acoustic signals only. Supervised methods depend * This work was done when the author was an intern at LSCP / ENS / EHESS / CNRS on the training of acoustic and language models, which requires access to large amounts of linguistic resources (annotated speech, phonetic dictionary, text). Unsupervised methods do not require these resources and are therefore appropriate for so-called under-resourced languages, such as endangered languages, or languages without consistent orthographies.
We propose a blind phoneme segmentation method based on short term statistical properties of the speech signal. We designate peaks in the error curve of a model trained to predict speech frame by frame as potential boundaries. Three different models are tested. The first is an approximated Markov model of the transition probabilities between categorical speech features. We then replace it by a recurrent neural network operating on the same categorical features. Finally, a recurrent neural network is directly trained to predict the raw speech features. This last model is especially interesting in that it couples our statistical approach with more common spectral transition based methods (Dusan and Rabiner (2006) for instance).
We first describe the various models used and the pre-and post-processing procedures, before presenting and discussing our results in the light of previous work.

Related work
Most previous work on blind phoneme segmentation (Esposito and Aversano, 2005;Estevan et al., 2007;Almpanidis and Kotropoulos, 2008;Rasanen et al., 2011;Khanagha et al., 2014;Hoang and Wang, 2015) has focused on the analysis of the rate of change in the spectral domain. The idea is to design robust acoustic features that are supposed to remain stable within a phoneme, and change when transitioning from one phoneme to the next. The algorithm then define a measure of change, which is then used to detect phoneme boundaries.
Apart from this line of research, three main approaches have been explored. The first idea is to use short term statistical dependencies. In Räsänen (2014), the idea was to first discretize the signal using a clustering algorithm and then compute discrete sequence statistics, over which a threshold can be defined. This is the idea that we follow in the current paper. The second approach is to use dynamic programming methods inspired by text segmentation (Wilber, 1988), in order to derive optimal segmentation (Qiao et al., 2008). In this line of research, however, the number of segments is assumed to be known in advance, so this cannot count as blind segmentation. The third approach consists in jointly segmenting and learning the acoustic models for phonemes (Kamper et al., 2015;Glass, 2003;Siu et al., 2013). These models are much more computationally involved than the other methods. Interestingly they all use a simpler, blind segmentation as an initialization phase. Therefore, improving on pure blind segmentation could be useful for joint models as well.
The principal source of inspiration for our work comes from previous work by Elman (1990) and Christiansen et al. (1998) published in the 90s. In the former, the author uses recurrent neural networks to train character-based language models on text and notices that "The error provides a good clue as to what the recurring sequences in the input are, and these correlate highly with words." (Elman, 1990). More precisely, the error tends to be higher at the beginning of new words than in the middle. In the latter, the author uses Elman recurrent neural networks to predict boundaries between words given the character sequence and phonological cues.
Our work uses the same idea, using prediction error as a cue for segmentation, but with two important changes: we apply it to speech instead of text, and we use it to segment in terms of phoneme units instead of word units.

Pre-processing
We used two kinds of speech features : 13 dimensional MFCCs (Davis and Mermelstein, 1980) (with 12 mel-cepstrum coefficients and 1 energy coefficient) and categorical one-hot vectors de-rived from MFCCs inspired by Räsänen (2014). The latter are computed according to Räsänen (2014) : K-means clustering 1 is performed on a random subset of the MFCCs (10,000 frames were selected at random), with a target number of clusters of 8, then each MFCC is identified to the closest centroid. Each frame is then represented by a cluster number c ∈ {1, . . . , 8}, or alternatively by the corresponding one-hot vector of dimension 8. These hyper-parameters were chosen according to Räsänen (2014). Figure 1 allows for a visual comparison of the three signals (waveform, MFCC, categorical).
The entire dataset is split between a training and a testing subset. A randomly selected subset of the training part is used as validation data to prevent overfitting.

Training phase
A frame-by-frame prediction model is then learned on the training set. The three different models used are described below : Pseudo-markov model When trying to predict the frame x t given the previous frames x t−1 0 := x t−1 , . . . , x 0 , a simplifying assumption is to model the transition probabilities with a Markov chain of higher order K, i.e. p(x t |x t−1 . Provided each frame is part of a finite alphabet, a finite (albeit exponential in K) number of transition probabilities must be learned.
However, as the order rises, the ratio between the size of the data and the number of transition probability being learned makes the exact calculation more difficult and less relevant.
In order to circumvent this issue, we approximate the K-order Markov chain with the mean of 1-order markov chain of the lag-transition proba- . In practice, we chose K = 6, thus ensuring that the markov model's attention is of the same order of magnitude than the length of a phoneme.
Compared to Räsänen (2014), this model only uses information from previous frames and as such is completely online.
Recurrent neural network on categorical features Alternatively to Markov chains, the transition probability p(x t |x t−1 0 ) can be modeled by a recurrent neural network (RNN). RNN can theoretically model indefinite order temporal dependencies, hence their advantage over Markov chains for long sequence modeling.
Given a set of examples {(x t , (x t−1 0 )) | t ∈ {0, . . . , t max }}, the networks parameters are learned so that the error E(x t , RNN(x t−1 0 )) is minimized using back propagation through time (Werbos, 1990) and stochastic gradient descent or a variant thereof (we have found RMSProp (Tieleman and Hinton, 2012) to give the best results).
In our case, the network itself consists of two LSTM layers (Hochreiter and Schmidhuber, 1997) stacked on one another followed by a linear layer and a softmax. The input and output units have both dimension 8, whereas all other layers have the same hidden dimension 40. Dropout (Srivastava et al., 2014) with probability 0.2 was used after each LSTM layer to prevent overfitting.
A pitfall of this method is the tendency of the network to predict the last frame it is fed. This is due to the fact that the sequences of categorical features extracted from speech contain a lot of constant sub-sequences length 2.
As a consequence, around 80% of the data fed to the network consists of sub-sequences where x t = x t−1 . Despite the fact that phone boundaries are somewhat correlated with changes of categories (around 65% of the time), this leads the network to a local minimum where it only tries to predict the same characters.
To mitigate this effect, examples where x t = x t−1 were removed with probability 0.8, so that the number of transitions was slightly skewed towards category transitions. The model still passed over all frames during training but the error was back-propagated for only 46% of them. This change lead to substantial improvement.

Recurrent neural network on raw MFCCs
The recurrent neural network model can be adapted to raw speech features simply by changing the loss function from categorical cross-entropy to mean squared error, which is the direct translation from a categorical distribution to a Gaussian density (2 x − y 2 2 + d is the Kullback-Leibler divergence of two d-dimensional normal distributions centered in x and y with the same scalar covariance matrix).
We used the same architecture than in the categorical case, simply removing the softmax layer and decreasing the hidden dimension size to 20. In this case, no selection of the samples is needed since the sequences vary continuously.

Test phase
Each model is run on the test set and the prediction error is calculated at each time step according to the formula : In each case this corresponds, up to a scaling factor constant across the dataset, to the Kullback-Leibler divergence between the predicted and actual probability distribution for x t in the feature space.
Since all three systems predict probabilities conditioned by the preceding frames, they cannot be expected to give meaningful results for the first frames of each utterance. To be consistent, the first 7 frames (70 ms) of the error signal for each utterance were set to 0. A peak detection procedure is then applied to the resulting error. As we are looking for sudden bursts in the prediction error, a local maximum is labeled as a potential boundary if and only if the difference between its value and the one of the previous minimum is superior to a certain threshold δ.

Dataset
We evaluated our methods on the TIMIT dataset Fischer et al. (1986). The TIMIT dataset consists of 6300 utterances (∼ 5.4 hours) from 630 speakers spanning 8 dialects of the English language. The corpus was divided into a training and test set according to the standard split. The training set contains 4620 utterances (172,460 boundaries) and the test set 1680 (65,825 boundaries).

Evaluation
The performance evaluation of our system is based on precision (P ), recall (R) and F -score, defined as the harmonic mean of precision and recall. A drawback of this metric is that high recall, low precision results, such as the ones produces by hypothesizing a boundary every 5 ms (P : 58%, R : 91%) yield high F -score (70%).
Other metrics have been designed to tackle this issue. One such example is the R-value (Räsänen et al., 2009) : (3) Where OS = R P − 1 is the over-segmentation measure. The R value represents how close the segmentation is from the ideal 0 OS, 1 R point and the P=1 line in the R, OS space. Further details can be found in Räsänen et al. (2009).  Determining whether gold boundary is detected or not is a crucial part of the evaluation procedure. On our test set for instance, which contains 65,825 gold boundaries partitioned into 1,680 files, adding or removing one correctly detected boundary per utterance leads to a change of ± 2.5% in precision. This means that minor changes in the evaluation process (such as removing the trailing silence parts of each file, removing the opening and closing boundary) yield non-trivial variations in the end result.
A common condition for a gold boundary to be considered as 'correctly detected' is to have a proposed boundary within a 20 ms distance on either side. Without any other specification, this means that a proposed boundary may be matched to several gold boundaries, provided these are within 40 ms from each other, leading to an increase of up to 4% F-score in some of our results (74%-78%). Unfortunately this point is seldom detailed in the literature.
We decided to use the procedure described in Räsänen et al. (2009) to match gold boundaries and hypothesized boundaries : overlapping tolerance windows are cropped in the middle of the two boundaries.

Results
The current state of the art in blind phoneme segmentation on the TIMIT corpus is provided by Hoang and Wang (2015). It evaluates to 78.16% F-score and 81.11 R-value on the training part of the dataset, using an evaluation method similar to our own.
In Tables 1 and 2 we compare our best results to the previous statistical approach evoked in Räsänen (2014) and the naive periodic boundaries segmentation (one boundary each 5 ms). Since Räsänen (2014) used an evaluation method allowing for tolerance windows to overlap, we provide our results with both evaluation methods (full windows and cropped windows) for the sake of consistency.
Another main difference with Räsänen (2014) is that its results are given on the core test set of TIMIT, whereas our results are given on the full test set. Figure 2: Precision/recall curves for our various models when varying the peak detection threshold δ Figure 2 provides an overview of the precision/recall scores when varying the peak detection threshold (and, in case of periodic boundaries, the period). This gives some insight about the actual behavior of the various algorithms, especially in the high precision, low recall region where the RNN on actual MFCCs seems to outperform the methods based on discrete features. We provide Figure 3 as a qualitative assessment of the error profiles of all three algorithms on one specific utterance. Notably, the error profile of the markov model contains distinct isolated peaks of similar height. As expected, the error curve is much more noisy in the case of the RNN on MFCCs, due to the greater variability in the feature space.

Discussion
In terms of optimal F-score and R values, the simple Markov model outperformed the previously published paper using short term sequential statistics (Räsänen, 2014), as well as the recurrent neural networks. However, these optimal values may mask the differential behavior of these algorithms in different sections of the precision/recall curve.
In particular, it is interesting to notice that the neural network based model trained on the raw MFCCs gave very good results in the low recall, high precision domain. Indeed, the precision can reach 90% with a recall of 40%. Such a regime could be useful, for instance, if blind phoneme segmentation is used to help with word segmentation.
The reason of the higher precision of neural networks may be that it combines the sensitivity of this model to sequential statistical regularities of the signal, but also to the spectral variations, i.e. the error is also correlated to the spectral changes, meaning that some peaks are associated with a high error because the euclidean distance x t+1 − x t 2 itself is big. This is why the height difference is much more significant in this case. Although we only reported the best results, we also tested our model on two other neural network architectures : a single vanilla RNN and a single LSTM cell. Both architecture did not yield significantly different results (∼ 1-2% F-score, mainly dropping precision). Similarly, different hidden dimension were tested. In the extreme cases (very low -8 -or high -128 -dimension), the output signal proved too noisy to be of any significance, yielding results comparable to naive periodic segmentation.
It is worth mentioning that our approach doesn't make any language specific assumption, and as such similar results are to be expected on other languages. We leave the confirmation of this assumption to future work.
We have presented a lightweight blind phoneme segmentation method predicting boundaries at peaks of the prediction loss of transition probabilities models. The different models we tested produced satisfying results while remaining computationally tractable, requiring only one pass over the data at test time.
Our recurrent neural network trained on speech features in particular hints at a way of combining both the statistical and spectral information into a single model.
On a machine learning point of view, we highlighted the use that can be made of side channel information (in this case the test error) in order to extract structure from raw data in an unsupervised setting.
Future work may involve exploring different RNN models, assessing the stability of these methods on simpler features such as raw spectrograms or waveforms, or exploring the representation of each frame in the hidden layers of the networks.