Online Infix Probability Computation for Probabilistic Finite Automata

Probabilistic finite automata (PFAs) are com- mon statistical language model in natural lan- guage and speech processing. A typical task for PFAs is to compute the probability of all strings that match a query pattern. An impor- tant special case of this problem is computing the probability of a string appearing as a pre- fix, suffix, or infix. These problems find use in many natural language processing tasks such word prediction and text error correction. Recently, we gave the first incremental algorithm to efficiently compute the infix probabilities of each prefix of a string (Cognetta et al., 2018). We develop an asymptotic improvement of that algorithm and solve the open problem of computing the infix probabilities of PFAs from streaming data, which is crucial when process- ing queries online and is the ultimate goal of the incremental approach.


Introduction
Weighted automata are a popular weighted language model in natural language processing.They have found use across the discipline both alone (Mohri et al., 2002) and in conjunction with more complicated language models (Ghazvininejad et al., 2016;Velikovich et al., 2018).As such, finding efficient algorithms for weighted automata has become an intensely studied topic (Allauzen and Mohri, 2009;Argueta and Chiang, 2018).
An important subclass of weighted automata are PFAs.Given a PFA, one important task is to calculate the probability of a phrase or pattern.Efficient algorithms exist for this problem when given a PFA or a probabilistic context-free grammar (PCFG) and a pattern that forms a regular language (Vidal et al., 2005a;Nederhof and Satta, 2011).One important special case of this problem * Now at Google Korea.
is to compute the probability of all strings containing a given infix, which was first studied by Corazza et al. (1991).The problem was motivated by applications to phrase prediction and error correction.Several partial results were established with various restrictions on the statistical model or infix (Corazza et al., 1991;Fred, 2000;Nederhof and Satta, 2011).Later, Nederhof and Satta (2011) gave a general solution for PCFGs and proposed the problem of computing the infix probabilities of each prefix of a string incrementally-using the infix probability of w 1 w 2 . . .w k to speed up the calculation for w 1 w 2 . . .w k w k+1 .
Recently, we gave an algorithm for this problem when the language model is a PFA, and suggested an open problem of online incremental infix probability calculation-where one is given a stream of characters instead of knowing the entire input string ahead of time (Cognetta et al., 2018).The online problem is of special practical importance as it is a more realistic setting than the offline problem.Not only do many speech processing tasks need to be performed "on the fly", but also many parsing algorithms can be improved by utilizing an online algorithm.For example, suppose one has calculated the infix probability of all prefixes of the phrase "...be or...", and later wishes to extend that phrase to "...be or not to be..." and retrieve all of the new infix probabilities.Instead of restarting the computation from the beginning, which would lead to redundant computation, an online method can be used to simply start from where the initial algorithm left off.As another example, suppose we have the phrase "...United States of...", and wish to extend it by a word while maximizing the resulting infix probability.An online algorithm can be used to try all extensions in the vocabulary before settling on "America", whereas naively applying an offline algorithm would require repeatedly computing already known values.We first revisit our original incremental infix probability algorithm from (Cognetta et al., 2018) and improve the algorithm based on a careful reanalysis of the dynamic programming recurrence.Then, we develop an algorithm for the online incremental infix problem and demonstrate the practical effectiveness of the two new algorithms on series of benchmark PFAs.

Preliminaries
We assume that the reader is familiar with the definition and basic properties of automata theory.For a thorough overview of PFAs, we suggest (Vidal et al., 2005a,b).
A PFA is specified by a tuple P = (Q, Σ, {M(c)} c∈Σ , I, F), where Q is a set of states and Σ is an alphabet.The set {M(c)} c∈Σ is a set of labeled |Q| × |Q| transition matrices-the element M(c) i,j is the probability of transitioning from state q i to q j reading character c.Likewise, I is a 1 × |Q| initial probability vector and F is a |Q| × 1 final probability vector.PFAs have some conditions on their structure.Specifically, |Q| i=1 I i = 1 and for each state q i , F i + c∈Σ, j∈[1,|Q|] M(c) i,j = 1.Finally, each state must be accessible and co-accessible.When these are met, a PFA describes a probability distribution over Σ * .The probability of a string is given as , where 1 is the identity matrix.We denote this matrix M(Σ * ) and note that IM(Σ * )F = 1.The KMP automaton of w is a DFA with |w|+1 states that accepts the language of strings ending with the first occurrence of w, and can be built in O(|w|) time (Knuth et al., 1977).By convention, the states of a KMP DFA are labeled from q 1 to q |w|+1 , with the transition between q i and q i+1 corresponding to w i .Figure 2 gives an example.

Incremental Infix Algorithm
We now review the method described in (Cognetta et al., 2018).The algorithm is based on state elimination for DFAs (Book et al., 1971).Given a DFA, we add two new states q 0 and q n+1 , where q 0 is connected by λ-transitions (λ is the empty string) to all initial states and all final states are connected to q n+1 by λ-transitions.We then perform a dynamic state elimination procedure to produce regular expressions α k i,j that describe the set of strings that, when read starting at state i, end at state j and never pass through a state with label higher than k.We use the recurrence , with the base case α 0 i,j being the transitions from q i to q j .This method forms a regular expression stored in α n 0,n+1 that describes the same language as the input DFA.Furthermore, this regular expression is unambiguous in that there is at most one way to match a string in the language to the regular expression (Book et al., 1971).We then described a mapping from regular expressions to expressions of transition matrices of a PFA (Table 1) and proved that evaluating the matrix formed by the mapping gives the probability of all strings matching the regular expression (Cognetta et al., 2018).
Table 1: A mapping from regular expressions to expressions of transition matrices.
The basic idea behind the incremental algorithm is the following: the KMP DFA describes the infix language of the input string w.When performing the state elimination procedure, the term a k 0,k+1 is the regular expression for the infix language of w 1 w 2 . . .w k .Further, the term a k+1 0,k+2 = α k 0,k+1 (α k k+1,k+1 ) * α k k+1,k+2 includes the term α k 0,k+1 and so the result from each iteration can be used in the next.The algorithm then performs state elimination while interpret-

Asymptotic Speedup
We now describe an asymptotic speedup for Algorithm 1 based on the following two lemmas.
Lemma 1. Computing α n 0,n+1 only requires knowledge of the terms of the form α k i,j , where i, j ≥ k + 1, or of the form a k 0,k+1 .In other words, only the term α k 0,k+1 and the terms in the bottom right k × k sub-table of α k need to be considered at step k + 1.The new algorithm is faster than the previous known runtime of O(|Σ||w||Q P | 2 +|w| 3 |Q| m ).To † The constant m is such that n × n matrices can be multiplied or inverted in O(n m ) time.In practice, m is often ≈ 2.807 (Strassen, 1969).
implement this speed-up, we change the iteration range in Line 11 to of Algorithm 1 to be for i ∈

Online Incremental Infix Calculation
We now consider the problem of determining the infix probabilities of strings given as a stream of characters.This is in contrast to the setting from Algorithm 1 and (Cognetta et al., 2018) in which the entire string was known ahead of time.
In this setting, we build the KMP automaton step by step (instead of all at once at the beginning), and then eliminate the most recent state to maintain our dynamic programming table.The key difficulty in this method is that when adding a new state, |Σ| − 1 back transitions (and 1 forward transition) are added to the DFA.The label and destination of each back transition cannot be predicted until a new character is added, the back transitions can go to any state up to the current one, and different configurations can arise depending on the newly added character.Together, these make correctly accounting for the paths that are generated at each step non-trivial.
Lemma 4. The term α k k+1,k+1 can be computed as c∈Σ−w k c(α k−1 δ(q k+1 ,c),k+1 ).The basic intuition of Lemma 4 is to concatenate the character from the backwards transition to the front of every string that brings state δ(q i , c) to state q k+1 .When finding α k i,k+1 where i ≤ k, the term can be computed as normal and evaluating α k k+1,k+1 takes O(|Σ||Q P | m ) time.Lemma 5.In the online setting, at each iteration k, only the k + 1th column of table T needs to be evaluated.
In contrast to Lemma 1 in the offline setting, where only the elements in the k + 1-th column below index k need to be computed, all elements of the k + 1-th column need to be evaluated in the online setting.This is due to the sum in Lemma 4 being dependent on the terms α k−1 δ(q k+1 ,c),k because δ(q k+1 , c) can take on any value in [1, k].Nevertheless, this leads to the following result.Theorem 6.Given a stream of characters w = w 1 w 2 . . ., the infix probability of each prefix T0,1 ← 1 5: while w is not exhausted do 10: Extend D with new character 11: T ← re-sizable table 14:

Experimental Results
We now demonstrate the practical effectiveness of the improved and online algorithms.We generate a series of PFAs with varying state space and alphabet size.Because we store transition matrices as dense matrices and the algorithms depend only on |Q| and |Σ| (but not the number of transitions), the underlying structure of the PFA is unimportant.Thus, we can artificially generate the PFAs to control |Q| and |Σ| exactly.We consider PFAs with |Σ| ∈ {26, 100} and |Q| ∈ {500, 1500}.For each test, we use a random string of 10 characters and measure the time to perform each iteration of Algorithm 1, the asymptotic speedup described in Section 4, and Algorithm 2. We list the median of 10 trials for each iteration.The tests were im-plemented using Python 3.5 and NumPy and run on an Intel i7-6700 processor with 16gb of RAM.
Table 2 contains the experimental results.Note that the asymptotic speedup and online algorithm outperform Algorithm 1 in every setting, which is in line with our theoretical analysis.Across all trials, each iteration of the improved algorithm speeds up while the online version slows down.These observations are not unexpected.The improved version only recomputes a k × k sub-table at iteration k and only requires O(|w| − k) multiplications.On the other hand, the online algorithm must perform O(k + |Σ|) multiplications at iteration k so we expect the runtime to slowly increase.Unlike the online version, the number of operations per iteration of Algorithm 1 and the improved version do not depend on |Σ|, so their runtimes do not differ as |Σ| grows.
Consider the second use case for the online algorithm from Section 1, where we have a 500-state PFA with |Σ| = 26 and an input string of length 9, which we wish to extend while maximizing the resulting infix probability.We extrapolate from the timings in Table 2 and anticipate that finding the appropriate extension would take 26 * 0.656 ≈ 17.056 seconds using the faster offline algorithm.On the other hand, we expect the online method to only take 1.271 + 26 * 0.094 ≈ 3.715 seconds.

Conclusion
Building off of our previous work, we have considered the problem of incrementally computing the infix probabilities of each prefix of a given string.We provide an improved analysis of our incremental algorithm that leads to an asymptotic speedup.Furthermore, we solve the open problem of computing the infix probabilities of each prefix of a stream of characters.The problem of adapting this approach to higher order statistical language models (such as PCFGs) remains open.

A Proofs
Lemma 1. Computing α n 0,n+1 only requires knowledge of the terms of the form α k i,j , where i, j ≥ k + 1, or of the form a k 0,k+1 .Proof.This can be seen by expanding the term α n 0,n+1 .As α n 0,n+1 . The term α n−1 0,n+1 is always the empty set as there is no path from state n − 1 to n + 1 that does not go through state n in the KMP DFA.Recursively applying this expansion to α n−1 n,n and α n−1 n,n+1 proves the claim.
Lemma 2. Consider α k i,j where k + 1 ≤ i < j.Then α k i,j = α k−1 i,j .Proof.Let i = k + 1 + x and j = k + 1 + y where x ≥ 0 and y > 0. Consider the expansion of the term α In the KMP DFA, state q i has exactly one transition to state q i+1 and |Σ| − 1 transitions to lower (or equal) states.In other words, there is no path from a state of label i to a state with label at least i + 2 that does not go through state i + 1.Thus, . Theorem 3. In Algorithm 1, the k-th iteration requires only O(|w|) matrix inversions and multiplications to update the dynamic programming table.
Proof.We use Lemmas 1 and 2. At iteration k of Algorithm 1, Lemma 1 states that we only need to update the lower right k × k table as that is all that is required to complete the k + 1-th iteration.Lemma 2 tells us that all of the terms in the lower right k × k table except for the terms in the k-th column are the same as in the previous iteration.Thus, those terms can simply be copied and the O(|w|) terms in the k-th column will be updated normally, with only.

Lemma 4.
The term α k k+1,k+1 can be computed as Proof.For simplicity, we assume there are no self loops in the KMP DFA except on the initial state.
The case where there are can be handled similarly.
Note that there can only be at most one self loop not on the initial state of a KMP DFA.Such a self loop will be on the state corresponding to the last state where w k = w k−1 = . . .w 1 .First, we expand the term α k k+1,k+1 . Since we assume there are no self loops on states k or k + 1, we can simplify the expression to be α k k+1,k+1 = α k−1 k+1,k k−1 k,k+1 .The term α k−1 k,k+1 is whatever character is on the transition from state k to k + 1.On the other hand, α k−1 k+1,k is the set of paths that take state k+1 to state k without passing through states higher than k.
Lemma 5.In the online setting, at each iteration k, only the k + 1th column of table T needs to be evaluated.
Proof.First, we know that α k k+1,k+1 requires knowledge of each term in the kth column of α k−1 .Further, expanding the term α k i,k+1 shows that only terms on the k-th and k + 1-th column of α k−1 are required for any of them.Elements on the k + 1th column of α k−1 are equal to the transitions between state q i and q k+1 per Lemma 2. We then proceed by induction on k and the claim follows.
Theorem 6.Given a stream of characters w = w 1 w 2 . . ., the infix probability of each prefix of w can be computed online in O(|w|(|w| + |Σ|)|Q P | m ) time.
At iteration k, we need only recompute the k-th column in the table.All but the k-th element in the column are computed using the normal recurrence

Figure 1 :
Figure 1: An example PFA.Each state has an initial and final probability, and each transition has a label and transition probability.

Table 2 :
Timings from the experimental analysis of each algorithm.Alg 1 refers to Algorithm 1. "Faster" refers to the speedup described in Theorem 3. Online refers to Algorithm 2. All results are in seconds.