Maximum Likelihood Estimation of Factored Regular Deterministic Stochastic Languages

This paper proves that for every class C of stochastic languages deﬁned with the co-emission product of ﬁnitely many probabilistic, deterministic ﬁnite-state acceptors (PDFA) and for every data sequence D of ﬁnitely many strings drawn i.i.d. from some stochastic language, the Maximum Likelihood Estimate of D with respect to C can be found efﬁciently by locally optimizing the parameter values. We show that a consequence of the co-emission product is that each PDFA behaves like an independent factor in a joint distribution. Thus, the likelihood function de-composes in a natural way. We also show that the negative log likelihood function is convex. These results are motivated by the study of Strictly k-Piecewise (SP k ) Stochastic Languages, which form a class of stochastic languages which is both linguistically motivated and naturally understood in terms of the co-emission product of certain PDFAs.


Introduction
Stochastic languages are probability distributions over all possible strings of finite length. A class C of stochastic languages is often defined parametrically: an assignment of values to the parameters uniquely determines some stochastic language L in C and thus the probabilities that L assigns to strings. An important learning criterion for a class of stochastic languages C is whether there is an algorithm which reliably returns a Maximum-Likelihood Estimate (MLE) of an observed data sample D. The MLE is the parameter values which maximize the probability of D with respect to C.
This paper focuses on regular deterministic stochastic languages. These are stochastic languages that can be defined with a probabilistic, deterministic, finite-state acceptors (PDFA).
The problem of finding the MLE, however, is not only about some single stochastic language L, but also about the class of stochastic languages that L belong to. It is well-understood that each PDFA M naturally defines a class of stochastic languages C M because the transitional probabilities in the PDFA provide a range of possible parameter values, as we explain in detail in section 2. In this case, it is well-understood how to find the MLE of a sequence of strings drawn i.i.d. from L with respect to C M (Vidal et al., 2005a,b). This paper is concerned with finding the MLE for different classes of stochastic languages.
In particular, we consider the case where C is defined by the range of parametric values over finitely many PDFA A = {M 1 . . . M K }, whose co-emission product determines the probabilities each L ∈ C assigns to strings. Essentially, the coemission product of these PDFAs factor the probabilities each L ∈ C assigns to strings. Each L is a complex joint distribution, and each PDFA M j represents a 'more basic' regular stochastic language whose parameter values independently contribute to L. At a high level, the problem we are considering is like those addressed with Bayesian networks and Markov random fields, where complex probability distributions decompose into simpler factors (Bishop, 2006;Koller and Friedman, 2009). We refer to the classes C we study in this paper as factored, regular, probabilistic, and deterministic (FRPD).
The main result is to show how the parameters of a FRPD class C can be efficiently updated to find those parameter values which maximize the likelihood of the observed sequences (Theorem 2). We also show directly that each negative log likelihood associated with each FRPD class C is convex (Theorem 3). Together these results imply that the efficient method we present for updating the parameter values will yield the MLE.
There are several reasons for being interested in such factored classes C. Perhaps the most important from our perspective is expressed by Koller and Friedman (p. 1134) "The ability to exploit structure in the distribution is the basis for providing a compact representation of highdimensional . . . probability spaces." In our case, the size of the representation of the class given by A = {M 1 . . . M K } is simply the sum of the size of each M j . In contrast, the representation of the class given by the co-emission product is in the worst case the product of the sizes of each M j . One direct benefit of this is that the number of parameters is reduced, which makes it possible to more accurately estimate them with less data. Other advantages discussed by Koller and Friedman, such as modularity, we return to in the discussion in the conclusion.
There are also linguistic reasons to be interested in FRPD classes. The Strictly Piecewise (SP) class of languages encode certain types of long-distance dependencies found in natural languages. For example, SP languages can express generalizations like "at most one b per string" and "no b may follow an a" . Generalizations with this formal character are known to occur in the phonologies of the world's languages (Heinz, 2010a;Heinz, 2014Heinz, , 2018. As  explain, Strictly Piecewise languages are characterized by the intersection product of finitely many deterministic finite-state acceptors (DFA).  used this characterization and the co-emission product to define the class of Strictly Piecewise stochastic languages because they were interested in the learnability of long-distance dependencies in natural languages probabilistically. They presented a learning algorithm for a class of SP stochastic languages and claimed (p. 894) that it returns the MLE.
This results in this paper can be seen as providing a more generalized, more meaningful, and more rigorous proof of their basic claim. Theorem 2 establishes how to update the parametric values which locally optimize the model of any FRPD class. Theorem 3 shows the negative log likelihood function of any FRPD class is convex, so there is in fact only one set of optimal parametric values for any sequence of data. Furthermore, we prove these results in terms of the standard definition of co-emission product, and not the variant used in . (While the results here work for both, we only prove the standard case.) These general results make it possible to explore not only the learning of SP k stochastic languages, but also any finite combination of PDFAs that characterize different kinds of local and non-local dependencies which can be expressed with regular grammars. We return to this issue in the discussion.
To our knowledge, such results for FRPD classes have not been previously discussed in the literature. One reason for this is that much work on natural language processing uses probabilistic non-deterministic automata. These describe the same class of stochastic languages as Hidden Markov Models (HMMs) (Vidal et al., 2005a,b). Non-determinism can make a big difference when it comes to parsing and learning. In a deterministic model M, each string w can be associated with at most one path through M, whereas in non-deterministic M, there can be infinitely many paths for w. This is one reason why methods used for learning HMM are not guaranteed to return a MLE. Since the states are 'hidden' one uses methods like Expectation Maximization, which may converge to a local optimum that is not a global optimum (Jurafsky and Martin, 2008;Heinz et al., 2015).
On the other hand, we are showing that, by carefully choosing the class of stochastic languages C-which the MLE which is to be found will be 'with respect to'-we can exploit the structure we assume to be present to guarantee we find a MLE. This paper takes one step in establishing the theoretical soundness of this approach.
Finally, one reviewer commented that these results may follow from fundamental theorems in the literature on probabilistic graphical models (Koller and Friedman, 2009). Regardless of whether this is true, the correctness of the proofs here stand. Also, the general results of Bayesian networks and Markov random fields say nothing about the concrete forms of the algorithm for obtaining the MLE with respect to a FRPD class C given data D, and similarly for its time complexity. Malouf (2002) makes a similar point, writing "While all parameter estimation algorithms we will consider take the same general form, the method for computing the updates . . . differs substantially." Nonetheless, how probabilistic graphical models relate to this line of research ought to be made clear.
The remainder of the paper is organized as follows. In section 2 we review languages, stochastic languages, deterministic finite-state acceptors and probabilistic versions thereof, the intersection and co-emission products, and the statement of the learning problem. Before presenting our main results, section 3 defines Strictly Piecewise (stochastic) languages, which provide a running example to illustrate the main results, which are presented in section 4. The computational complexity of the updates are analyzed in section 5 and section 6 concludes.

Sets of Strings
Σ denotes a finite set of symbols and Σ k , Σ ≤k , and Σ * denote all strings over this alphabet of length k, of length less than or equal to k, and of any finite length, respectively. λ denotes the empty string. The length of a string w is written |w|. The prefixes of a string w are Pref(w) = {v | ∃u ∈ Σ * , vu = w}. A string w = σ 1 . . . σ n is a subsequence of a string v if and only if v ∈ Σ * σ 1 Σ * . . . Σ * σ n Σ * , in which case we write w v.
A language L is a subset of Σ * . The complement of a language L, denoted L is Σ * /L. The shuffle ideal of w is the language of all strings containing w as a subsequence: A stochastic language L is a probability distribution over Σ * . The probability P of word w with respect to L is written P L (w) = p. Thus, all stochastic languages L satisfy w∈Σ * P L (w) = 1.

Probabilistic Deterministic Finite-state Acceptors
A Deterministic Finite-state Acceptor (DFA) is a tuple M = Q, Σ, q 0 , δ, F where Q is the state set, Σ is the alphabet, q 0 is the start state, δ is a deterministic transition function with domain Q × Σ and codomain Q, and F is the set of accepting states. Let δ * : Q × Σ * → Q be the (partial) path function of M. When discussing partial functions, the notation ↑ and ↓ indicates that the function is not defined, respectively, is defined, for particular arguments. Thus δ * (q, w) is the (unique) state reachable from state q via the sequence w, if any, or δ * (q, w)↑ otherwise. The language recognized by a DFA M is where Q, Σ, q 0 , and δ are the same as with DFA, and F and T are partial functions representing the final-state and transition probabilities. In particular, T : The probability a PDFA assigns to w is obtained by multiplying the transition probabilities along w's path if it exists with the final probability, and zero otherwise.
A probability distribution is regular deterministic iff there is a PDFA which generates it. We sometimes write M(w) instead of P L(M) (w).
The structural components of a PDFA M are its states Q, its alphabet Σ, its transitions δ, and its initial state q 0 . By structure of a PDFA, we mean its structural components. The structure of each PDFA M defines a class of stochastic languages given by the possible instantiations of T and F satisfying Equation 1. These distributions have at most |Q|· (|Σ| + 1) independent parameters (since for each state there are |Σ| possible transitions plus the possibility of finality.)

The co-emission product
The intersection product of K DFAs M 1 . . . M K is given by the standard construction over the state space Q 1 × . . . × Q K (Hopcroft et al., 2001). We write The co-emission product of K PDFAs M 1 . . . M K is also given by a construction over the state space Q 1 × . . . × Q K . The probability that σ is co-emitted from q 1 , . . . , q K in Q 1 × . . .] × Q K is the product of the probabilities of its emission at each q j ∈ Q j . Let CoT( σ, q 1 . . . q K ) = K j=1 T j (q j , σ). Similarly, the probability that a word simultaneously ends at be the normalization term. Next we define the coemission product.
1. Q, q 0 , and δ are defined as with DFA product; and 2. For all q ∈ Q and σ ∈ Σ: In other words, the numerators of T and F are defined to be the co-emission probabilities and division by Z ensures that co-emission product A defines a well-formed probability distribution over Σ * .
Observe that A also defines a class of stochastic languages by the possible instantiations of T j and F j for each M j ∈ A. The structural components of A are the structural components of each M j ∈ A. By structure of A, we mean its structural components. The structure of A defines a class of stochastic languages given by the possible instantiations of T j and F j satisfying Equation 1 for each M j ∈ A.
If A = M then the class of stochastic languages induced by the structure of A is a subset of the class of stochastic languages obtained with the structure of the PDFA M. This is another way of saying that a factorized model may have fewer parameters and so the class of stochastic languages it represents can become smaller.

Statement of the Learning Problem
Let D be a finite sequence of |D| i.i.d. drawn examples from a stochastic language L. It follows that the P L (D) = w∈D P L (w).
Let A = {M 1 . . . M K } be a set of PDFAs and let C A denote the FRPD class of stochastic languages induced by the structure of A. The likelihood of D w.r.t. C A is determined by the parameters (the T j and F j functions for each M j ∈ A). Let us group these parameters under the symbol Θ. Each Θ identifies some stochastic language where Θ under the arg max ranges over all possible parameter values of A.
When |A| = 1 the problem has a known solution. As mentioned, a single PDFA M defines a class of stochastic languages given by possible parameter values of M. In this case, it is wellknown how to findΘ. Essentially, each transition probability T (q, σ) equals the relative frequency that symbol σ is emitted at a state q (Vidal et al., 2005a,b). In this paper, we solve this problem when |A| > 1.

Strictly k-Piecewise stochastic languages
In this section, we introduce the Strictly k-Piecewise stochastic languages, which serve as a running example of a FRPD class in the remainder of the paper.  define and provide multiple characterizations of Strictly Piecewise (SP) languages. We review the most relevant ones for this paper here. SP languages are exactly those formal languages that are closed under subsequence. Rogers et al. (2010, p. 260) prove that every SP language L can be associated with a finite set of strings S such that L is the intersection of the complements of the shuffle ideals of S.
The SP languages are parameterized by a value k ∈ N. This number corresponds to the length of the longest string in S. For each SP language L, if there is a set S whose longest string is equal to k, then L belongs to the SP k class of languages.
If k is known a priori then the SP k languages are both PAC-learnable and identifiable in the limit in polynomial time and data (Heinz, 2010b;Heinz et al., 2012). 1 Theorem 1 allows one to construct concrete computational models for SP languages with DFA. For any nonempty string w = σ 1 . . . σ n , SI(w) = L(M w ) where M w is defined as follows. The states are the prefixes of w, the start state is λ, and the final state is w. For all prefixes p of w and σ ∈ Σ, let δ(p, σ) = pσ whenever pσ is a prefix of w and p otherwise. Figure 1 gives an examples of DFA for M abba .
The complement SI(w) is essentially obtained from M w by removing its maximal state and making every state final. In other words, if w = va then the SI(w) can be recognized by an automaton where the states are the prefixes of v, the start state is λ, and each state is a final state. For all prefixes p of v and σ ∈ Σ, δ(p, σ) = pσ whenever pσ is a prefix of v. When pσ is not a prefix of v and σ = a then δ(p, σ) = p. Finally, δ(v, a) is not defined. We denote such a DFA as M w . Figure 2 shows the DFA M abba which recognizes the complement of SI(abba). Both M w and the DFA recognizing its complement are minimal.
It follows that for any L ∈ SP, one can construct a DFA recognizing L by taking the product of the complements of the shuffle ideals of the strings in S.
Note the size of M 1 . . . M K is 1≤i≤K M j whereas the size of M = 1≤j≤K M j is in the worst case 1≤j≤K M j . Therefore, to decide whether a string w belongs to some SP language L, it may be preferable to run w on each M j instead of on M to avoid the potentially large in-1 Also, SP languages suggest a different representation for strings , which inform machine learning in other ways. The winning paper of the SPiCE competition (Balle et al., 2016), in which machine learning models competed to best predict the next symbol in a natural and artificial sequences was won by Shibata and Heinz (2016), who integrated SP-style representations into a neural network. crease in the state space. See  for additional discussion of this point.  use the fact that SP languages are the intersection of the complements of shuffle ideals to define their stochastic counterpart. They define stochastic versions of M w (Figure 2), which they call w-subsequencedistinguishing PDFA. Definition 2 (Subsequence-distinguishing PDFA) Let w ∈ Σ k−1 and w = σ 1 · · · σ k−1 . M w = Q, Σ, q 0 , δ, F, T is a w-subsequencedistinguishing PDFA (w-SD-PDFA) iff F and T satisfy Equation 1 and δ(u, σ) = uσ whenever uσ ∈ Pref(w) and u otherwise.
Apart from the stochastic components T and F , the w-subsequence-distinguishing PDFA differs from M w in one key way. Suppose. w = va. Then δ(v, a) = v in the w-subsequencedistinguishing PDFA is not undefined as was the case with M w . This transition exists and may have a nonzero probability.
A set A of PDFAs is a k-set of SD-PDFAs iff, for each w ∈ Σ ≤k−1 , it contains exactly one w-SD-PDFA. For example, let Σ = {a, b} and consider the 2-set of SD-PDFAs shown in Figure 3. There are three SD-PDFAs in this set corresponding to M λ , M a , and M b .  define SP k stochastic languages as a product of a k-set of SD-PDFAs. Specifically, the adapt the notion of co-emission probability (Vidal et al., 2005a).  actually use what they call the positive coemission product which restricts the standard coemission probability to particular circumstances.
In this work, we define SP stochastic languages with the standard definition of co-emission probability used to define products of PDFA as in Definition 1 (Vidal et al., 2005a). Definition 3 (SP Stochastic Languages) A probability distribution P over Σ * is a SP stochastic language iff there exists a k-set of SD-PDFAs A, whose co-emission product is M = A, such that for all w ∈ Σ * , it is the case that P (w) = M(w).
It follows immediately from this definition that the class of SP stochastic languages is a FRPD class. In this case, the parameters of such a distribution are the T and F values on each wsubsequence-distinguishing PDFA in the k-set. In the example in Figure 3, there are thus 15 parameters of the model, 10 of which are free. This is be- cause there are three actions associated with each state (a, b, and finality); there are five states; but since the probabilities must add to one only two parameters per state are free. More generally, a k-set of SD-PDFAs A has |Σ| · j∈A |Q j | free parameters.

Main Theorem for MLE of FRPD classes
We provide our main results here, using the 2-set of SD-PDFAs shown in Figure 3 as an illustrative example.

The Co-emission Probability Given a Prefix
It is useful to consider the co-emission probability of the symbol σ given the prefix σ 1 · · · σ i−1 , which we denote Coemit(σ, i). It follows from Definitions 1 and 3 that this value is the normalized product of the path through A given by the prefix σ 1 · · · σ i−1 .
Formally, let M 1 = Q 1 , Σ, q 01 , δ 1 , F 1 , T 1 , · · · , M K = Q K , Σ, q 0K , δ K , F K , T K be exactly those PDFAs in A. Suppose that w = σ 1 · · · σ N , where σ i ∈ Σ for all 1 ≤ i ≤ N . Let q(j, i) denote a state in Q j that is reached after M j reads the prefix σ 1 · · · σ i−1 . If i = 1 then q(j, i) represents the initial state of M j . Then it follows from Definition 1 that the probability that a symbol σ is emitted after the product machine 1≤j≤K M j reads the prefix σ 1 · · · σ i−1 is the following: To simplify the notation and analysis, we assume that there is a end marker ∈ Σ which uniquely occurs at the end of words. This lets us replace F j (q) with T j (q, ). Then Coemit(σ, i) is simply written as (5) The probability that the machine 1≤j≤K M j accepts w is obtained by taking the product of the co-emission probabilities for all i: where σ N +1 = .
Since we are concerned with the co-emission probabilities, which is a ratio, it is noteworthy that in fact it does not matter if the sum σ ∈Σ T j (q, σ ) is 1. The ratio Coemit(σ, i) and thus P (w ) are invariant with respect to the scale of T j (q, σ ) and the sum σ ∈Σ T j (q, σ ). Writing this last value as z(j, q), it can easily be confirmed by the fact that multiplying both the denominator and the numerator by 1/z(j, q) does not change the value of Coemit(σ, i) while normalizing T j (q, ·). Thus, we can relax the condition in Equation 1 when discussing co-emission probabilities. The only condition that needs to be satisfied with respect to the transitions is that T j (q, σ ) ≥ 0 for all j, q, σ . Note that relaxing this condition does not affect the number of free parameters. This is because the numerical values associated with the transitions, once normalized, will always sum to 1. In the following, we assume this relaxed condition.

Frequency and Empirical Mean of Co-emission Probability
Before describing the main theorem, we define two terms; the frequency of an emission and the empirical mean of a co-emission probability, which play important roles in estimating transition probabilities for product machines.
Definition 4 (Frequency of Emission) For given w, we define the frequency of σ at q ∈ Q j as follows. Let • m w (M j , q, σ) ∈ Z + denotes how many times σ is emitted at the state q while the machine M j emits w.
• n w (M j , q) ∈ Z + denotes how many times the state q is visited while the machine M j emits w.
Then freq w (σ|M j , q) = m w (M j , q, σ) n w (M j , q) , So freq w (σ|M j , q) represents the relative frequency that M j emits σ at q during emission of w. These concepts can be lifted to a sequence of strings D drawn i.i.d. from some stochastic language. Let It follows that So freq D (σ|M j , q) represents the relative frequency that M j emits σ at q during emission of D.
As an example, consider the 2-set of PDFAs in Figure 3 and consider the sample data D = abb , aba . Figure 4 shows the paths of these strings through each SD-PDFA. Figure 5 shows some of the frequency computations.
If K = 1, i.e., the product machine consists of one PDFA then freq w (σ|M 1 , q) is the MLE of T 1 (q, σ) (Vidal et al., 2005a,b). Meanwhile, if K ≥ 2, the probability of the emission, which equals the co-emission probability, fluctuates with states that other machines are currently at. Thus freq w (σ|M k , q), as a random variable, is not independent from other machines' states. This motivates the following definition.

Definition 5 (Empirical Mean) Let
The empirical mean of a co-emission probability is defined as follows: i.e., the sample average of the co-emission probability when q ∈ Q j is visited.
When a state in M j is visited more than once while emitting w, it does not imply that some other state in M h is also visited more than once. In other words, if there are positions i = such that q(j, i) = q(j, ) then it does not have to follow that q(h, i) = q(h, ) for another machine M h . Thus, even when M j and the value of q(j, i) are fixed, Coemit(σ, i) fluctuates. The empirical mean is the average taken over such fluctuating co-emission probabilities.

Main Theorem and Convexity
Theorems 2 and 3 are our main results. We simplify the proofs by assuming that D consists of a single sentence. That is, in both theorems, we consider D = {w }. We can do this without loss of generality because any finite sequence of strings D drawn i.i.d. from a stochastic language can be converted into a single sentence freq D (a|M λ , λ) = 1/8 freq D (a|M a , λ) = 1/5 freq D (a|M a , a) = 0/3, freq D (b|M λ , λ) = 5/8 freq D (b|M a , λ) = 3/5 freq D (b|M a , a) = 2/3, without changing the probability of its production.
To see why, we can adjust the transition function of each PDFA M j so that δ j (q, ) = q 0j for each q ∈ Q j . In other words, once is emitted, the machines reset to their start states. Then for any D = {w 1 , · · · , w k }, we have P (D) = P (concat(D)) where concat(D) = w 1 w 2 · · · w k . Thus, w in both theorems can be understood as concat(D). Theorem 2 Suppose that P (w ) is defined as Equation 6 for a product machine 1≤j≤K M j and a word w. Then, ∂P (w )/∂T j = 0 holds for all j if and only if the following equation is satisfied for all 1 ≤ j ≤ K: freq w (σ|M j , q) = Coemit w (σ|M j , q) .
From Theorem 3, it will then follow that T 1 , . . . T K are the MLE.
Proof By taking the log of Eq. 6 , we have We differentiate this by a log emission probabil- First, we calculate A. Since where I[ · ] denotes the indicator function and m w (M h , q, σ) is defined as in Definition 4. Figure 6: Initial calculation of B in the proof of Theorem 2.
Second, we calculate B as shown in Figure 6. There are two large terms in the large parentheses in the last line of the calculation of B in Figure 6. The first one is is the co-emission probability by Recall that

This indicator function equals
Coemit(a, i) I 1 I 2 I 3 We conclude that By plugging our calculations of A (Eq. 9) and B (Eq. 10) into A = B and dividing the both sides by n w (M h , q), we obtain the result freq w (σ|M h , q) = Coemit w (σ|M h , q) from the definitions of the relative frequency of an emission (Eq. 7) and the empirical mean of a co-emission probability (Eq. 8). This concludes the proof of Theorem 2.
Next we prove that maximizing P (w) is a convex optimization problem to ensure that the solution is the maximum point.
Following Boyd and Vandenberghe (2004), A set of points C in R n is convex if the line segment between any two points in C also lies in C. Formally, C is convex provided for any x 1 , x 2 ∈ C and any t with 0 ≤ t ≤ 1, we have tx 1 + (1 − t)x 2 ∈ C. A function f : R n → R is convex if the domain of f is a convex set and if for all x, y in the domain of f , and t with 0 ≤ t ≤ 1, we have Recall from section 2.4 that the likelihood of a sequence of data D to a stochastic language L belonging to a class with parameters Θ is lhd(D | Θ) = w∈D P L (w). The likelihood function is a function f : R n → R where n is the number of parameters |Θ|.
Let τ j,q,σ denote log T j (q, σ); i.e. the log of some parameter in Θ. There are n = |Σ| K j=1 |Q j | parameters in Θ since σ ∈ Σ, 1 ≤ j ≤ K, and q ∈ Q j . This τ can be thought of as a vector in R n .
The problem of maximizing P (w ) is the same as minimizing − log P (w ) as a function of τ . We show that log P (w ) is concave with respect to log T j (q, σ) (Theorem 3). If so, it is true that the solution shown in Theorem 2 is a global maximum.
Theorem 3 log P (w ) is concave with respect to τ ∈ R n .
Generally speaking, a composition f (x) = h(g 1 (x), · · · , g k (x)) obeys the following rule: f is convex if h is convex, h is non-decreasing in each argument, and g i is convex (see vector composition in Boyd and Vandenberghe, 2004, section 3.2.4)). Furthermore, it is known that log exp(·) is convex (see section 3.1.5), and log exp(·) is non-decreasing in each argument since both exp(·) and log(·) are non-decreasing. In addition, g a (·) is both convex and concave since every linear function is so from the definition (see section 3.1.1). Thus, log a exp(g a (·)) is convex, and − log a exp(g a (·)) is concave.
Finally, from the fact that non-negative weighted sum preserves convexity and concavity (Boyd and Vandenberghe, 2004, section 3.2.1), log P (w ) is concave. It follows that the negative log of P (w ) is convex.
It is noteworthy to point out that establishing concavity does not mean the solution is unique. In fact, the solutions can be a set of points. An example FRPD class illustrating this is one which contains two PDFA M 1 and M 2 with the same structure. For example suppose each had exactly one state with self-loop transitions for every symbol in Σ. The co-emission product M 1 M 2 does not uniquely factorize though the above theorem establishes its convexity.
Of course it is also of interest to know when the solution is unique. In this case, we have to show the negative log probability is strictly convex except for multiplying the emission probability by a constant. We leave this as an area of future research.

Optimization and Time Complexity
In this section, we discuss the time complexity and also how to optimize. From the proof of Theorem 2, we have the following fact immediately.
Corollary 1 The update equation for maximization of log P (w ) is represented as: if the simplest gradient method is applied, and where η is the step size. The time complexity for each update is O(N K|Σ|).
The time complexity for freq w (σ|M j , q) and Coemit w (σ|M j , q) are shown in Lemma 1 and Lemma 2.
The time complexity for Coemit w (σ|M j , q) is a little higher than that of freq w (σ|M j , q).
Lemma 1 For all M j and q ∈ Q j , freq w (σ|M j , q) are computed in the time O(N K).
Proof We trace all machines while they are emitting σ 1 , · · · , σ N . Suppose that machines are at q(1, i), · · · , q(K, i) after σ 1 , · · · , σ i−1 are emitted sequentially. For each step i, for all machines M j , we have to update the counting for the pair of q(k, i) and σ i , in order to calculate m w (M j , q, σ). So the computational cost for each step i is O(K).
Lemma 2 For all M j and q ∈ Q j , Coemit w (σ|M j , q) are computed in the time O(N K|Σ|).
Proof We trace all machines while they are emitting σ 1 , · · · , σ N . Suppose that machines are at q(1, i), · · · , q(K, i) after σ 1 , · · · , σ i−1 are emitted sequentially. The critical part is calculating sumCoemit(σ) M j ,q (w) . For each step i, we have to update emission probabilities for all pairs of M j and σ ∈ Σ. This update is in the time O(K|Σ|). Thus, the time complexity for calculating sumCoemit w (σ, M j , q) is O(N K|Σ|).

Conclusion
The negative log likelihood function associated with a FRPD class C is convex, and it is possible to efficiently find a MLE of any sequences of data generated i.i.d. with respect to C. Essentially, the parameters of the model are found by running the corpus through each of the individual factor PDFAs and calculating the relative frequencies. While this was the approach adopted by  for SP stochastic languages, we have generalized it to sets of finitely many PDFAs.
There are several directions for future research, both theoretical and applied. On the theoretical side, one clear avenue is to better understand these results in terms of probabilistic graphical models (Koller and Friedman, 2009). As a reviewer pointed out, the application of those methods to formal language theory and grammatical inference (de la Higuera, 2010) appears fruitful.
On the applied side, there are several different opportunities. One area of interest is language modeling. The results here permit a modular approach to constructing language models, where certain primitive factors are included or excluded. For example, we expect that language models which incorporate both n-gram models (Jurafsky and Martin, 2008) (which cannot describe longdistance dependencies) and SP stochastic languages (which can describe some kinds of longdistance dependencies) will have lower perplexity, a hypothesis under current investigation. More generally, researchers can use aspects of the subregular hierarchies of languages (Thomas, 1997; to identify a range of 'primitive factors' whose DFA models can form the basis of various FRPD classes. Finally, we are also interested in extending these results to weighted deterministic automata for computing regular relations (Beros and de la Higuera, 2016) or elements of other monoids (Gerdjikov, 2018).