Cache-Augmented Latent Topic Language Models for Speech Retrieval

We aim to improve speech retrieval performance by augmenting traditional N-gram language models with different types of topic context. We present a latent topic model framework that treats documents as arising from an underlying topic sequence combined with a cache-based repetition model. We analyze our proposed model both for its ability to capture word repetition via the cache and for its suitability as a language model for speech recognition and retrieval. We show this model, augmented with the cache, captures intuitive repetition behavior across languages and exhibits lower perplexity than regular LDA on held out data in multiple languages. Lastly, we show that our joint model improves speech retrieval performance beyond N-grams or latent topics alone, when applied to a term detection task in all languages considered.


Introduction
The availability of spoken digital media continues to expand at an astounding pace. According to YouTube's publicly released statistics, between August 2013 and February 2015 content upload rates have tripled from 100 to 300 hours of video per minute (YouTube, 2015). Yet the information content therein, while accessible via links, tags, or other user-supplied metadata, is largely inaccessible via content search within the speech.
Speech retrieval systems typically rely on Large Vocabulary Continuous Speech Recognition (LVSCR) to generate a lattice of word hypotheses for each document, indexed for fast search (Miller and others, 2007). However, for sites like YouTube, localized in over 60 languages (YouTube, 2015), the likelihood of high accuracy speech recognition in most languages is quite low.
Our proposed solution is to focus on topic information in spoken language as a means of dealing with errorful speech recognition output in many languages. It has been repeatedly shown that a task like topic classification is robust to high (40-60%) word error rate systems (Peskin, 1996;Wintrode, 2014b). We would leverage the topic signal's strength for retrieval in a high volume, multilingual digital media processing environment.
The English word topic, defined as a particular 'subject of discourse' (Houghton-Mifflin, 1997), arises from the Greek root, τ oπoς, meaning a physical 'place' or 'location'. However, the semantic concepts of a particular subject are not disjoint from the physical location of the words themselves.
The goal of this particular work is to jointly model two aspects of topic information, local context (repetition) and broad context (subject matter), which we previously treated in an ad hoc manner (Wintrode and Sanjeev, 2014) in a latent topic framework. We show that in doing so we can achieve better word retrieval performance than language models with only N-gram context on a diverse set of spoken languages.

Related Work
The use of both repetition and broad topic context have been exploited in a variety of ways by the speech recognition and retrieval communities. Cache-based or adaptive language models were 1 some of the first approaches to incorporate information beyond a short N-gram history (where N is typically 3-4 words).
Cache-based models assume the probability of a word in a document d is influenced both by the global frequency of that word and N-gram context as well as by the N-gram frequencies of d (or preceding cache of K words). Although most words are rare at the corpus level, when they do occur, they occur in bursts. Thus a local estimate, from the cache, may be more reliable than the global estimate. Jelinek (1991) and Kuhn (1990) both successfully applied these types of models for speech recognition, and Rosenfeld (1994), using what he referred to as 'trigger pairs', also realized significant gains in WER. More recently, recurrent neural network language models (RNNLMs) have been introduced to capture more of these "long-term dependencies" (Mikolov et al., 2010). In terms of speech retrieval, recent efforts have looked at exploiting repeated keywords at search time, without directly modifying the recognizer (Chiu and Rudnicky, 2013;Wintrode, 2014a).
Work within the information retrieval (IR) community connects topicality with retrieval. Hearst and Plaunt (1993) reported that the "subtopic structuring" of documents can improve full-document retrieval. Topic models such as Latent Dirichlet Allocation (LDA) (Blei et al., 2003) or Probabilistic Latent Semantic Analysis (PLSA) (Hofmann, 2001) are used to the augment the document-specific language model in probabilistic, language-model based IR (Wei and Croft, 2006;Chen, 2009;Liu and Croft, 2004;Chemudugunta et al., 2007). In all these cases, topic information was helpful in boosting retrieval performance above baseline vector space or N-gram models.
Our proposed model closely resembles that from Chemudugunta et al. (2007), with our notions of broad and local context corresponding to their "general and specific" aspects. The unigram cache case of our model should correspond to their "special words" model, however we do not constrain our cache component to only unigrams.
With respect to speech recognition, Florian and Yarowsky (Florian and Yarowsky, 1999) and Khudanpur and Wu (Khudanpur and Wu, 1999) use vector-space clustering techniques to approximate the topic content of documents and augment a Algorithm 1 Cache-augmented generative process baseline N-gram model with topic-specific N-gram counts. Clarkson and Robinson (1997) proposed a similar application of cache and mixture models, but only demonstrate small perplexity improvements. Similar approaches use latent topic models to infer a topic mixture of the test document (soft clustering) with significant recognition error reductions (Heidel et al., 2007;Hsu and Glass, 2006;Liu and Liu, 2008;Huang and Renals, 2008). Instead of interpolating with a traditional backoff model, Chien and Chueh (2011) use topic models with and without a dynamic cache to good effect as a class-based language model. We build on the cluster-oriented results, particularly Khudanpur and Wu (1997) and Wintrode and Khudanpur (2014), but within an explicit framework, jointly capturing both types of topic information that many have leveraged individually.

Cache-augmented Topic Model
We propose a straightforward extension of the LDA topic model (Blei et al., 2003;Steyvers and Griffiths, 2007), allowing words to be generated either from a latent topic or from a document-level cache. At each word position we flip a biased coin. Based on the outcome we either generate a latent topic and then the observed word, or we pick a new word directly from the cache of already observed words. Thus we would jointly learn the underlying topics and the tendency towards repetition.
As with LDA, we assume each corpus is drawn from T latent topics. Each topic is denoted φ (t) , a multinomial random variable in the size of the vocabulary where φ t is the probability P (t|d).
We introduce two additional sets of variables, κ (d) and k d,i . The state k d,i is a Bernoulli variable indicating whether a word w d,i is drawn from the cache or from the latent topic state. κ (d) is the document specific prior on the cache state k d,i .
Algorithm 1 gives the generative process explicitly. We choose a Beta prior κ (d) for the Bernoulli variables k d,i . As with the Dirichlet priors, this allows for a straightforward formulation of the joint probability P (W, Z, K, Φ, Θ, κ), from which we derive densities for Gibbs sampling. A plate diagram is provided in Figure 1, illustrating the dependence both on latent variables and the cache of previous observations. We implement our model as a collapsed Gibbs sampler extending Java classes from the Mallet topic modeling toolkit (McCallum, 2002). We use the Gibbs sampler for parameter estimation (training data) and inference (held-out data). We also leverage Mallet's hyperparameter re-estimation (Wallach et al., 2009), which we apply to α, β, and ν.

Language Modeling
Our primary goal in constructing this model is to apply it to language models for speech recognition and retrieval. Given an LVCSR system with a standard N-gram language model (LM), we now describe how we incorporate the inferred topic and cache model parameters of a new document into the base LM for subsequent recognition tasks on that specific document.
We begin by estimating model parameters on a training corpus: topics φ (t) , cache proportions κ (d) , and hyperparameters, α, β, and ν (the Beta hyperparameter). In our experiments we restrict the training set to the LVCSR acoustic and language model training. This restriction is required by the Babel task, not the model. Using other corpora or text resources certainly should be considered for other tasks.
To apply the model during KWS, we first decode a new audio document d with the base LM, P L and extract the most likely observed word sequence W for inference. The inference process gives us the es- timates for θ (d) and κ (d) , which we then use to compute document-specific and cache-augmented language models.
From a language modeling perspective we treat the multinomials φ (t) as unigram LM's and use the inferred topic proportions θ (d) as a set of mixture weights. From these we compute the documentspecific unigram model for d (Eqn. 1). This serves to capture what we have referred to as the broad topic context.
We incorporate both P d as well as the cache P c (local context) into the base model P L using linear interpolation of probabilities. Word histories are denoted h i for brevity. For our experiments we first combine P d with the N-gram model (Eqn. 2). We then interpolate with the cache model to get a joint topic and cache language model (Eqn. 4).
We expect the inferred document cache probability κ (d) to serve as a natural interpolation weight when combining document-specific unigram model P dc and cache. We consider alternatives to perdocument κ (d) as part of the speech retrieval evaluation (Section 6) and can show that our model's estimate is indeed effective.

Model Analysis
Before looking at the model in terms of retrieval performance (Section 6), here we aim to examine how our model captures the repetition of each corpus and how well it functions as a language model (cf. Equation 3) in terms of perplexity.
To focus on language models for speech retrieval in the limited resource setting, we build and evaluate our model under the IARPA Babel Limited Language Pack (LP), No Target Audio Reuse (NTAR) condition (Harper, 2011). We selected the Tagalog, Vietnamese, Zulu, and Tamil corpora 1 to expose our model to as diverse a set of languages as possible (in terms of morphology, phonology, language family, etc., in line with the Babel program goals).
The Limited LP includes a 10 hour training set (audio and transcripts) which we use for building acoustic and language models. We also estimate the parameters for our topic model from the same training data. The Babel corpora contain spontaneous conversational telephone speech, but without the constrained topic prompts of LDC's Fisher collections we would expect a sparse collection of topics. Yet for retrieval we are nonetheless able to leverage the information.
We estimate parameters φ (t) , κ (d) , α, β, and ν on the training transcripts in each language, then use these parameters to infer θ (d) (topic proportions) and κ (d) (cache usage) for each document in the heldout set. We use the inferred κ (d) and θ (d) to perform the language model interpolation (Eqns. 3, 4). But also, the mean of the inferred κ (d) values for a corpus ought to provide a snapshot of the amount of repetition within.
Two trends emerge when we examine the mean over κ (d) by language. First, as shown in Table 1, the more latent topics are used, the lower the inferred κ values. Regardless of the absolute value, we see that κ for Vietnamese is consistently higher than the other languages. This fits our intuition about the languages given that the Vietnamese transcripts had syllable-level word units and we would expect to see more repetition.
Secondly we consider which words are drawn from the cache versus the topics during the inference process. Examining the final sampling state, we count how often each word in the vocabulary is drawn from the cache (where k d,i = 1). Intuitively, this count is highly correlated (ρ > 0.95) with the corpus frequency of each word (cf. Figure 2). That is, cache states are assigned to word types most likely to repeat.

Perplexity
While our measurements of cache usage corresponds to intuition, our primary goal is to construct useful language models. After estimating parameters on the training corpora, we infer κ (d) and θ (d) then measure perplexity using documentspecific language models on the development set.
We compute perplexity on the topic unigram mixtures according to P d and P dc (Eqns.1 & 3). Here we do not interpolate with the base N-gram LM, so as to compare only unigram mixtures. Table 2 gives the perplexity for standard LDA (P d only) and for our model with and without the cache added (κLDA and κLDA respectively).
With respect to perplexity, interpolating with the cache (κLDA) provides a significant boost in perplexity for all languages and values of T . In general,  perplexity decreases as the number of latent topics increases, excepting certain Zulu and Tamil models. For Tagalog and Vietnamese our cache-augmented model outperforms standard LDA model in terms of perplexity. However, as we will see in the next section, the lowest perplexity models are not necessarily the best in terms of retrieval performance.

Speech Retrieval
We evaluate the utility of our topic language model for speech retrieval via the term detection, or keyword search (KWS) task. Term detection accuracy is the primary evaluation metric for the Babel program. We use the topic and cache-augmented language models (Eqn. 4) to improve the speech recognition stage of the term detection pipeline, increasing overall search accuracy by 0.5 to 1.7% absolute over a typical N-gram language model. The term detection task is this: given a corpus of audio documents and a list of terms (words or phrases), locate all occurrences of the key terms in the audio. The resulting list of detections is scored using Term Weighted Value (TWV) metric. TWV is a cost-value trade-off between the miss probability, P (miss), and false alarm probability, P (F A), averaged over all keywords (NIST, 2006). For comparison with previously published results, we score against the IARPA-supplied evaluation keywords.
We train acoustic and language models (LMs) on the 10 hour training set using the Kaldi toolkit (Povey and others, 2011), according to the training recipe described in detail by Trmal et al. (2014). While Kaldi produces different flavors of acoustic models, we report results using the hybrid HMM-DNN (deep neural net) acoustic models, trained with a minimum phone error (MPE) criterion, and based on PLP (perceptual linear prediction) features augmented with pitch. All results use 3-gram LMs with Good-Turing (Tagalog, Zulu, Tamil) or Modified Kneser-Ney (Vietnamese) smoothing. This AM/LM combination (our baseline) has consistently demonstrated state-of-the art performance for a single system on the Babel task.
As described, we estimate our model parameters φ (t) , κ (d) , α, β, and ν from the training transcripts. We decode the development corpus with the baseline models, then infer θ (d) and κ (d) from the first pass output. In principle we simply compute P Ldc for each document and re-score the first pass output, then search for keywords.
Practical considerations for cache language models are, for example, just how big should the cache be, or should it decay, where words further away from the current word are discounted proportionally. In the Kaldi framework, speech is processed in segments (i.e. conversation turns). Current tools do not allow one to vary the language model within a particular segment (dynamically). With that in mind, our KWS experiments construct a different language model (P Ldc ) for each segment, where P c is computed from all other segments in the current document except that being processed.

Results
We can show, by re-scoring LCVSR output with a cache-augmented topic LM, that both the documentspecific topic (P d ) and cache (P c ) information together improve our overall KWS performance in each language, up to 1.7% absolute. Figure 3 illustrates search accuracy (TWV) for each language under various settings for T . It also captures alternatives to using κ (d) as an interpolation weight for the cached unigrams. To illustrate this contrast we substituted the training mean κ train instead of κ (d) as the interpolation weight when computing P Ldc (Eqn 4). Except for Zulu, the inferred The effect of latent topics T on search accuracy also varies depending on language, as does the overall effect of incorporating the cache in addition to latent topics (κLDA vs. κLDA). For example, in Tagalog, we observe most of the improvement over N-grams from the cache information, whereas in Tamil, the cache provided no additional information over latent topics.
The search accuracy for the best systems from Figure 3 are shown in Table 3 with corresponding choice of T . Effects on WER was mixed under the cache model, improving Zulu from 67.8 to 67.6% and degrading Tagalog from 60.8 to 61.1%, with Vietnamese and Tamil unchanged.

Conclusions and Future Work
With our initial effort in formulating model combining latent topics with a cache-based language model, we believe we have presented a model that estimates both informative and useful parameters from  the data and supports improved speech retrieval performance. The results presented here reinforce the conclusion that topics and repetition, broad and local context, are complementary sources of information for speech language modeling tasks. We hope to address two particular limitations of our model in the near future. First, all of our improvements are obtained adding unigram probabilities to a 3-gram language model. We would naturally want to extend our model to explicitly capture the cache and topic behavior of N-grams.
Secondly, our models are restricted by the first pass output of the LVCSR system. Keywords not present in the first pass cannot be recalled by a rescoring only approach. An alternative would be to use our model to re-decode the audio and realize subsequently larger gains. Given that our re-scoring model worked sufficiently well across four fundamentally different languages, we are optimistic this would be the case.