A Dynamic Programming Algorithm for Computing N-gram Posteriors from Lattices

Efﬁcient computation of n -gram posterior probabilities from lattices has applications in lattice-based minimum Bayes-risk decoding in statistical machine translation and the estimation of expected document frequencies from spoken corpora. In this paper, we present an algorithm for computing the posterior probabilities of all n - grams in a lattice and constructing a minimal deterministic weighted ﬁnite-state automaton associating each n -gram with its posterior for efﬁcient storage and retrieval. Our algorithm builds upon the best known algorithm in literature for computing n - gram posteriors from lattices and leverages the following observations to signiﬁ-cantly improve the time and space requirements: i) the n -grams for which the posteriors will be computed typically comprises all n -grams in the lattice up to a certain length, ii) posterior is equivalent to expected count for an n -gram that do not repeat on any path, iii) there are efﬁcient algorithms for computing n -gram expected counts from lattices. We present experimental results comparing our algorithm with the best known algorithm in literature as well as a baseline algorithm based on weighted ﬁnite-state automata operations.


Introduction
Many complex speech and natural language processing (NLP) pipelines such as Automatic Speech Recognition (ASR) and Statistical Machine Translation (SMT) systems store alternative hypotheses produced at various stages of processing as weighted acyclic automata, also known as lattices. Each lattice stores a large number of hypotheses along with the raw system scores assigned to them. While single-best hypothesis is typically what is desired at the end of the processing, it is often beneficial to consider a large number of weighted hypotheses at earlier stages of the pipeline to hedge against errors introduced by various subcomponents. Standard ASR and SMT techniques like discriminative training, rescoring with complex models and Minimum Bayes-Risk (MBR) decoding rely on lattices to represent intermediate system hypotheses that will be further processed to improve models or system output. For instance, lattice based MBR decoding has been shown to give moderate yet consistent gains in performance over conventional MAP decoding in a number of speech and NLP applications including ASR (Goel and Byrne, 2000) and SMT (Tromble et al., 2008;Blackwood et al., 2010;de Gispert et al., 2013).
Most lattice-based techniques employed by speech and NLP systems make use of posterior quantities computed from probabilistic lattices. In this paper, we are interested in two such posterior quantities: i) n-gram expected count, the expected number of occurrences of a particular n-gram in a lattice, and ii) n-gram posterior probability, the total probability of accepting paths that include a particular n-gram. Expected counts have applications in the estimation of language model statistics from probabilistic input such as ASR lattices (Allauzen et al., 2003) and the estimation term frequencies from spoken corpora while posterior probabilities come up in MBR decoding of SMT lattices (Tromble et al., 2008), relevance ranking of spoken utterances and the estimation of document frequencies from spoken corpora (Karakos et al., 2011;Can and Narayanan, 2013).
The expected count c(x|A) of n-gram x given lattice A is defined as where # y (x) is the number of occurrences of n-gram x in hypothesis y and p(y|A) is the posterior probability of hypothesis y given lattice A. Similarly, the posterior probability p(x|A) of n-gram x given lattice A is defined as where 1 y (x) is an indicator function taking the value 1 when hypothesis y includes n-gram x and 0 otherwise. While it is straightforward to compute these posterior quantities from weighted nbest lists by examining each hypothesis separately and keeping a separate accumulator for each observed n-gram type, it is infeasible to do the same with lattices due to the sheer number of hypotheses stored. There are efficient algorithms in literature (Allauzen et al., 2003;Allauzen et al., 2004) for computing n-gram expected counts from weighted automata that rely on weighted finite state transducer operations to reduce the computation to a sum over n-gram occurrences eliminating the need for an explicit sum over accepting paths. The rather innocent looking difference between Equations 1 and 2, # y (x) vs. 1 y (x), makes it hard to develop similar algorithms for computing n-gram posteriors from weighted automata since the summation of probabilities has to be carried out over paths rather than n-gram occurrences (Blackwood et al., 2010;de Gispert et al., 2013). The problem of computing n-gram posteriors from lattices has been addressed by a number of recent works (Tromble et al., 2008;Allauzen et al., 2010;Blackwood et al., 2010;de Gispert et al., 2013) in the context of lattice-based MBR for SMT. In these works, it has been reported that the time required for lattice MBR decoding is dominated by the time required for computing n-gram posteriors. Our interest in computing n-gram posteriors from lattices stems from its potential applications in spoken content retrieval (Chelba et al., 2008;Karakos et al., 2011;Can and Narayanan, 2013). Computation of document frequency statistics from spoken corpora relies on estimating ngram posteriors from ASR lattices. In this context, a spoken document is simply a collection of ASR lattices. The n-grams of interest can be word, syllable, morph or phoneme sequences. Unlike in the case of lattice-based MBR for SMT where the n-grams of interest are relatively short -typically up to 4-grams -, the n-grams we are interested in are in many instances relatively long sequences of subword units.
In this paper, we present an efficient algorithm for computing the posterior probabilities of all ngrams in a lattice and constructing a minimal deterministic weighted finite-state automaton associating each n-gram with its posterior for efficient storage and retrieval. Our n-gram posterior computation algorithm builds upon the custom forward procedure described in (de Gispert et al., 2013) and introduces a number of refinements to significantly improve the time and space requirements: • The custom forward procedure described in (de Gispert et al., 2013) computes unigram posteriors from an input lattice. Higher order n-gram posteriors are computed by first transducing the input lattice to an n-gram lattice using an order mapping transducer and then running the custom forward procedure on this higher order lattice. We reformulate the custom forward procedure as a dynamic programming algorithm that computes posteriors for successively longer n-grams and reuses the forward scores computed for the previous order. This reformulation subsumes the transduction of input lattices to n-gram lattices and obviates the need for constructing and applying order mapping transducers.
• Comparing Eq. 1 with Eq. 2, we can observe that posterior probability and expected count are equivalent for an n-gram that do not repeat on any path of the input lattice. The key idea behind our algorithm is to limit the costly posterior computation to only those ngrams that can potentially repeat on some path of the input lattice. We keep track of repeating n-grams of order n and use a simple impossibility argument to significantly reduce the number of n-grams of order n + 1 for which posterior computation will be performed. The posteriors for the remaining n-grams are replaced with expected counts. This filtering of n-grams introduces a slight bookkeeping overhead but in return dramatically reduces the runtime and memory requirements for long n-grams.
• We store the posteriors for n-grams that can potentially repeat on some path of the input lattice in a weighted prefix tree that we construct on the fly. Once that is done, we com- pute the expected counts for all n-grams in the input lattice and represent them as a minimal deterministic weighted finite-state automaton, known as a factor automaton (Allauzen et al., 2004;, using the approach described in (Allauzen et al., 2004). Finally we use general weighted automata algorithms to merge the weighted factor automaton representing expected counts with the weighted prefix tree representing posteriors to obtain a weighted factor automaton representing posteriors that can be used for efficient storage and retrieval.

Preliminaries
This section introduces the definitions and notation related to weighted finite state automata and transducers (Mohri, 2009).

Semirings
Definition 1 A semiring is a 5-tuple (K, ⊕, ⊗, 0, 1) where (K, ⊕, 0) is a commutative monoid, (K, ⊗, 1) is a monoid, ⊗ distributes over ⊕ and 0 is an annihilator for ⊗. Table 1 lists common semirings. In speech and language processing, two semirings are of particular importance. The log semiring is isomorphic to the probability semiring via the negative-log morphism and can be used to combine probabilities in the log domain. The tropical semiring, provides the algebraic structure necessary for shortest-path algorithms and can be derived from the log semiring using the Viterbi approximation.

Weighted Finite-State Automata
Definition 2 A weighted finite-state automaton (WFSA) A over a semiring (K, ⊕, ⊗, 0, 1) is a 7tuple A = (Σ, Q, I, F, E, λ, ρ) where: Σ is the finite input alphabet; Q is a finite set of states; I, F ⊆ Q are respectively the set of initial and final states; E ⊆ Q × (Σ ∪ {ε}) × K × Q is a finite set of arcs; λ : I → K, ρ : F → K are respectively the initial and final weight functions.
Given an arc e ∈ E, we denote by i[e] its input label, w[e] its weight, s[e] its source or origin state and t[e] its target or destination state. A path π = e 1 · · · e k is an element of E * with consecutive arcs satisfying t[e i−1 ] = s[e i ], i = 2, . . . , k. We extend t and s to paths by setting t[π] = s[e k ] and s[π] = t[e 1 ]. The labeling and the weight functions can also be extended to paths by defining We denote by Π(q, q ) the set of paths from q to q and by Π(q, x, q ) the set of paths from q to q with input string x ∈ Σ * . These definitions can also be extended to subsets S, S ⊆ Q, e.g.
An accepting path in an automaton A is a path in Π(I, F ). A string x is accepted by A if there exists an accepting path π labeled with x. A is deterministic if it has at most one initial state and at any state no two outgoing transitions share the same input label. The weight associated by an automaton A to a string x ∈ Σ * is given by and A (x) 0 when Π(I, x, F ) = ∅.
A weighted automaton A defined over the probability semiring (R + , +, ×, 0, 1) is said to be probabilistic if for any state q ∈ Q, the sum of the weights of all cycles at q, ⊕ π∈Π(q,q) w[π], is well-defined and in R + and x∈Σ * A (x) = 1.

N-gram Mapping Transducer
We denote by Φ n the n-gram mapping transducer (Blackwood et al., 2010;de Gispert et al., 2013) of order n. This transducer maps label sequences to n-gram sequences of order n. Φ n is similar in form to the weighted finite-state transducer representation of a backoff n-gram language model (Allauzen et al., 2003). We denote by A n the ngram lattice of order n obtained by composing lattice A with Φ n , projecting the resulting transducer onto its output labels, i.e. n-grams, to obtain an automaton, removing ε-transitions, determinizing and minimizing (Mohri, 2009). A n is a compact lattice of n-gram sequences of order n consistent with the labels and scores of lattice A. A n typically has more states than A due to the association of distinct n-gram histories with states.

Factor Automata
Definition 3 Given two strings x, y ∈ Σ * , x is a factor (substring) of y if y = uxv for some u, v ∈ Σ * . More generally, x is a factor of a language L ⊆ Σ * if x is a factor of some string y ∈ L. The factor automaton S(y) of a string y is the minimal deterministic finite-state automaton recognizing exactly the set of factors of y. The factor automaton S(A) of an automaton A is the minimal deterministic finite-state automaton recognizing exactly the set of factors of A, that is the set of factors of the strings accepted by A.
Factor automaton  is an efficient and compact data structure for representing a full index of a set of strings, i.e. an automaton. It can be used to determine if a string x is a factor in time linear in its length O(|x|). By associating a weight with each factor, we can generalize the factor automaton structure to weighted automata and use it for efficient storage and retrieval of n-gram posteriors and expected counts.

Computation of N-gram Posteriors
In this section we present an efficient algorithm based on the n-gram posterior computation algorithm described in (de Gispert et al., 2013) for computing the posterior probabilities of all ngrams in a lattice and constructing a weighted factor automaton for efficient storage and retrieval of these posteriors. We assume that the input lattice is an ε-free acyclic probabilistic automaton. If that is not the case, we can use general weighted automata ε-removal and weight-pushing algorithms (Mohri, 2009) to preprocess the input automaton.
Algorithm 1 reproduces the original algorithm of (de Gispert et al., 2013) in our no-tation. Each iteration of the outermost loop starting at line 1 computes posterior probabilities of all unigrams in the n-gram lattice A n = (Σ n , Q n , I n , F n , E n , λ n , ρ n ), or equivalently all n-grams of order n in the lattice A. The inner loop starting at line 6 is essentially a custom forward procedure computing not only the standard forward probabilities α[q], the marginal probability of paths that lead to state q, but also the label specific forward probabilities α[q][x], the marginal probability of paths that lead to state q and include label x.
Just like in the case of the standard forward algorithm, visiting states in topological order ensures that forward probabilities associated with a state has already been computed when that state is visited. At each state s, the algorithm examines each arc e = (s, x, w, q) and updates the forward probabilities for state q in accordance with the recursions in Equations 4 and 6 by propagating the forward probabilities computed for s (lines 8-12  and adding the resulting value to the accumulator (lines 13-15). It should be noted that this algorithm is a form of marginalization (de Gispert et al., 2013), rather than a counting procedure, due to the conditional on line 11. If that conditional were to be removed, this algorithm would compute n-gram expected counts instead of posterior probabilities.
The key idea behind our algorithm is to restrict the computation of posteriors to only those n-grams that may potentially repeat on some path of the input lattice and exploit the equivalence of expected counts and posterior probabilities for the remaining n-grams. It is possible to extend Algorithm 1 to implement this restriction by keeping track of repeating n-grams of order n and replacing the output labels of appropriate arcs in Φ n+1 with ε labels. Alternatively we can reformulate Algorithm 1 as in Algorithm 2. In this formulation we compute n-gram posteriors directly on the input lattice A without constructing the n-gram lattice A n . We explicitly associate states in the original lattice with distinct n-gram histories which is implicitly done in Algorithm 1 by constructing the n-gram lattice A n . This explicit association lets us reuse forward probabilities computed at order n while computing the forward probabilities at order n + 1. Further, we can directly restrict the n-grams for which posterior computation will be performed.

In Algorithm 2,ά[n][q]
[h] represents the his-tory specific forward probability of state q, the marginal probability of paths that lead to state q and include length n string h as a suffix.   include n-gram x as a substring. {ε}, i.e. the only repeating n-gram of order 0 is the empty string ε, and computinǵ α[0][q][ε] ≡ α[q] using the standard forward algorithm. Each iteration of the outermost loop starting at line 3 computes posterior probabilities of all n-grams of order n directly on the lattice A. At iteration n, we visit the states in topological order and examine each length n−1 history g associated with s, the state we are in. For each history g, we go over the set of arcs leaving state s, construct the current n-gram x by concatenating g with the current arc label i (line 11), construct the length n − 1 history h of the target state q (line 12), and update the forward probabilities for the target state history pair (q, h) in accordance with the recursions in Equations 8 and 10 by propagating the forward probabilities computed for the state history pair (s, g) (lines 14-18). Whenever a final state is processed, the posterior probability accumulator for each n-gram of order n observed on paths reaching that state is updated by multiplying the n-gram specific forward probability and the final weight associated with that state and adding the resulting value to the accumulator (lines 21-24).

α[n][q][h]
We track repeating n-grams of order n to restrict the costly posterior computation operation to only those n-grams of order n + 1 that can potentially repeat on some path of the input lattice. The conditional on line 17 checks if any of the n-grams observed on paths reaching state history pair (s, g) is the same as the current n-gram x, and if so adds it to the set of repeating n-grams. At each iteration n, we check if the current length n − 1 history g of the state we are in is in R[n − 1], the set of repeating n-grams of order n − 1 (line 9). If it is not, then no n-gram x = gi can repeat on some path of A since that would require g to repeat as well. If g is in R[n − 1], then for each arc e = (s, i, w, q) we check if the length n − 1 history h = g[1 : n − 1]i of the next state q is in R[n − 1] (line 13). If it is not, then the n-gram x = g[0]h can not repeat either.
We keep the posteriors p(x|A) for n-grams that can potentially repeat on some path of the input lattice in a deterministic WFSA P that we construct on the fly. P is a prefix tree where each path π corresponds to an n-gram posterior, i.e. i[π] = x =⇒ w[π] = ρ(t[π]) = p(x|A). Once the computation of posteriors for possibly repeating n-grams is finished, we use the algorithm described in (Allauzen et al., 2004) to construct a weighted factor automaton C mapping all n-grams observed in A to their expected counts, i.e. ∀π in C, i[π] = x =⇒ w[π] = c(x|A). We use P and C to construct another weighted factor automaton P mapping all n-grams observed in A to their posterior probabilities, i.e. ∀π in P , i[π] = x =⇒ w[π] = p(x|A). First we remove the n-grams accepted by P from C using the difference operation (Mohri, 2009), then take the union of the remaining automaton C and P , and finally optimize the result by removing ε-transitions, determinizing and minimizing P = Min(Det(RmEps(C ⊕ P ))).

Experiments and Discussion
In this section we provide experiments comparing the performance of Algorithm 2 with Algorithm 1  (Tromble et al., 2008). All algorithms were implemented in C++ using the OpenFst Library (Allauzen et al., 2007). Algorithm 1 implementation is a thin wrapper around the reference implementation. All experiments were conducted on the 88K ASR lattices (total size: #states + #arcs = 33M, disk size: 481MB) generated from the training subset of the IARPA Babel Turkish language pack, which includes 80 hours of conversational telephone speech. Lattices were generated with a speaker dependent DNN ASR system that was trained on the same data set using IBM's Attila toolkit (Soltau et al., 2010). All lattices were pruned to a logarithmic beam width of 5. Figure 1 gives a scatter plot of the posterior probability computation time vs. the number of lattice n-grams (up to 5-grams) where each point 0.5 0.6 0.9 1.6 3.9 16 997 -Algorithm 2 (sec) 0.7 0.8 0.9 1.1 1.2 1.3 1.7 1.0 Expected Count (sec) 0.3 0.4 0.5 0.6 0.7 0.8 1.0 0.5 represents one of the 88K lattices in our data set. Similarly, Figure 2 gives a scatter plot of the maximum memory used by the program (maximum resident set size) during the computation of posteriors vs. the number of lattice n-grams (up to 5-grams). Algorithm 2 requires significantly less resources, particularly in the case of larger lattices with a large number of unique n-grams.
To better understand the runtime characteristics of Algorithms 1 and 2, we conducted a small experiment where we randomly selected 100 lattices (total size: #states + #arcs = 81K, disk size: 1.2MB) from our data set and analyzed the relation between the runtime and the maximum ngram length N . Table 2 gives a runtime comparison between the baseline posterior computation algorithm described in (Tromble et al., 2008), Algorithm 1, Algorithm 2 and the expected count computation algorithm of (Allauzen et al., 2004). The baseline method computes posteriors separately for each n-gram by intersecting the lattice with an automaton accepting only the paths including that n-gram and computing the total weight of the resulting automaton in log semiring. Runtime complexities of the baseline method and Algorithm 1 are exponential in N due to the explicit enumeration of n-grams and we can clearly see this trend in the 3rd and 4th rows of Table 2. Algorithm 2 (5th row) takes advantage of the WFSA based expected count computation algorithm (6th row) to do most of the work for long n-grams, hence does not suffer from the same exponential growth. Notice the drops in the runtimes of Algorithm 2 and the WFSA based expected count computation algorithm when all n-grams are included into the computation regardless of their length. These drops are due to the expected count computation algorithm that processes all n-grams simultaneously using WFSA operations. Limiting the maximum n-gram length requires pruning long ngrams, which in general can increase the sizes of intermediate WFSAs used in computation and result in longer runtimes as well as larger outputs.
When there is no limit on the maximum n-gram length, the output of Algorithm 2 is a weighted factor automaton mapping each factor to its posterior. Table 3 compares the construction and storage requirements for posterior factor automata with similar factor automata structures. We use the approach described in (Allauzen et al., 2004) for constructing both the unweighted and the expected count factor automata. We construct the unweighted factor automata by first removing the weights on the input lattices and then applying the determinization operation on the tropical semiring so that path weights are not added together. The storage requirements of the posterior factor automata produced by Algorithm 2 is similar to those of the expected count factor automata. Unweighted factor automata, on the other hand, are significantly more compact than their weighted counterparts even though they accept the same set of strings. This difference in size is due to accommodating path weights which in general can significantly impact the effectiveness of automata determinization and minimization.

Related Work
Efficient computation of n-gram expected counts from weighted automata was first addressed in (Allauzen et al., 2003) in the context of estimating n-gram language model statistics from ASR lattices. Expected counts for all n-grams of interest observed in the input automaton are computed by composing the input with a simple counting transducer, projecting on the output side, and removing ε-transitions. The weight associated by the resulting WFSA to each n-gram it accepts is simply the expected count of that n-gram in the input automaton. Construction of such an automaton for all substrings (factors) of the input automaton was later explored in (Allauzen et al., 2004) in the con- text of building an index for spoken utterance retrieval (SUR) (Saraclar and Sproat, 2004). This is the approach used for constructing the weighted factor automaton C in Algorithm 2. While expected count works well in practice for ranking spoken utterances containing a query term, posterior probability is in theory a better metric for this task. The weighted factor automaton P produced by Algorithm 2 can be used to construct an SUR index weighted with posterior probabilities.
The problem of computing n-gram posteriors from lattices was first addressed in (Tromble et al., 2008) in the context of lattice-based MBR for SMT. This is the baseline approach used in our experiments and it consists of building a separate FSA for each n-gram of interest and intersecting this automaton with the input lattice to discard those paths that do not include that n-gram and summing up the weights of remaining paths. The fundamental shortcoming of this approach is that it requires separate intersection and shortest distance computations for each n-gram. This shortcoming was first tackled in (Allauzen et al., 2010) by introducing a counting transducer for simultaneous computation of posteriors for all n-grams of order n in a lattice. This transducer works well for unigrams since there is a relatively small number of unique unigrams in a lattice. However, it is less efficient for n-grams of higher orders. This inefficiency was later addressed in (Blackwood et al., 2010) by employing n-gram mapping transducers to transduce the input lattices to n-gram lattices of order n and computing unigram posteriors on the higher order lattices. Algorithm 1 was described in (de Gispert et al., 2013) as a fast alternative to counting transducers. It is a lattice specialization of a more general algorithm for computing n-gram posteriors from a hypergraph in a single inside pass (DeNero et al., 2010). While this algorithm works really well for relatively short n-grams, its time and space requirements scale exponentially with the maximum n-gram length. Algorithm 2 builds upon this algorithm by exploiting the equiv-alence of expected counts and posteriors for nonrepeating n-grams and eliminating the costly posterior computation operation for most n-grams in the input lattice.

Conclusion
We have described an efficient algorithm for computing n-gram posteriors from an input lattice and constructing an efficient and compact data structure for storing and retrieving them. The runtime and memory requirements of the proposed algorithm grow linearly with the length of the n-grams as opposed to the exponential growth observed with the original algorithm we are building upon. This is achieved by limiting the posterior computation to only those n-grams that may repeat on some path of the input lattice and using the relatively cheaper expected count computation algorithm for the rest. This filtering of n-grams introduces a slight bookkeeping overhead over the baseline algorithm but in return dramatically reduces the runtime and memory requirements for long n-grams.