Extracting Forbidden Factors from Regular Stringsets

,

We introduce algorithms that, given a Finite-State Automaton, compute a set of forbidden words, units, initial factors, free factors and final factors that define a Strictly Local (SL) approximation of the stringset recognized by the FSA, along with a minimal DFA that recognizes the residue set: the set of strings in the approximation that are not in the stringset recognized by the FSA. If the FSA recognizes an SL stringset, then the approximation is exact (otherwise it overgenerates).
We have applied these tools to the 106 lects that have associated DFAs in the StressTyp2 database, a wide-coverage corpus of stress patterns that are attested in human languages. The results include a large number of strictly local constraints that have not been included in prior work categorizing these patterns with respect to the Local and Piecewise Sub-Regular hierarchies of Rogers et al. (2012), although, of course, they do not contradict the central result of that work, which establishes an upper bound on their complexity that includes strictly local constraints.

Introduction
A stringset L is Strictly k-Local if and only if (iff) it is completely determined by its k-factors: the substrings of length at most k that occur in strings ⋊·w·⋉ for w ∈ L. (The '⋊' and '⋉' are endmarkers.) That is to say, L contains all and only the strings that are generated by the substring relation from that set of k-factors. The class of stringsets that are Strictly k-local for some k is known as SL. This is at the bottom of the local side of a collection of classes of stringsets, all strict subclasses of the class of Regular stringsets, which are hierarchically related and are characterized by finite sets of either substrings (the Local Hierarchy) or subsequences (the Piecewise Hierarchy) or by combinations of the two. In Rogers et al. (2012) we argue that these hierarchies provide a robust notion of cognitive complexity for constraints on strings.
The long-term project of our group is to characterize all of the stress patterns collected in Goedemans et al. (2015)-a wide-coverage database of stress patterns occurring in human languageswith respect to this hierarchy. In Edlefsen et al. (2008), we established that roughly 75% of these patterns are SL k for k ≤ 6 and that half are SL k for k ≤ 3. Subsequently, we derived a set of "primitive" constraints sufficient to define all of the patterns by co-occurrence and classified them into abstract categories (Fero et al., 2014). Most of these constraints were, in fact, SL, and it turned out that all of the patterns could be defined by cooccurrence of constraints at the bottom two levels of the hierarchies. This is significant, since at these levels it is possible to determine whether a string satisfies a constraint solely on the basis of the information that is explicitly contained in the string, without inferring any additional structure. Recent work by Heinz and his co-workers (Heinz, forthcoming;Heinz, 2010;Chandlee, 2014;Jardine, 2016) suggests that much of phonology may be characterizable by correspondingly simple sets of structures or functions.
The work on primitive constraints, however, did not include any of the factors from the SL stringsets because the algorithm for determining if a given Finite State Automaton (FSA) recognizes an SL stringset, and determining k if it does, does not yield the set of k-factors that define the stringset. We resolve that problem in this work.
In Section 2 we introduce our notation and basic formal definitions. In Section 3 we formally define Strictly Local stringsets and discuss their formal properties. In Section 4 we distinguish five types of forbidden factors-factors in the complement of the set of factors that generate the stringset. In Section 5 we develop our algorithms for extracting those factors given a Finite State Automaton. In Section 7 we extend these algorithms in a way that allows them to be used to partition non-SL stringsets in a way that provides a set of SL constraints that approximates it (to varying degrees of closeness) and an automaton that captures the non-SL aspects of the stringset. We close with thoughts about where these results lead.

Formal Preliminaries
A finite state automaton (FSA) is an edge-labeled directed graph with distinguished vertices that we will represent by a five-tuple Σ, Q, δ, I, F where Σ is the alphabet of the language of the automaton, Q is the set of states, δ ⊆ (Σ × Q × Q) is a transition relation where σ, q 1 , q 2 ∈ δ iff there is an edge labeled σ from q 1 to q 2 , I is the set of initial states, and F is the set of accepting states. Let A = Σ, Q, δ, I, F . Let w = σ 1 σ 2 . . . σ n ∈ Σ * be a string and let q 1 , q n ∈ Q. Then there is a path q 1 w ❀ q n iff there exists some sequence of edges σ i , q i , q i+1 ∈ δ | 0 < i < n, w = σ 1 σ 2 . . . σ n−1 . This is an accepting path on w if q n is in F , else it is a non-accepting path.
The automaton A is total iff for every symbol σ ∈ Σ and for every state q ∈ Q, there exists some q ′ such that σ, q, q ′ ∈ δ. It is (partial) functional iff δ is functional in its first two places. That is, given a state q ∈ Q and a symbol σ ∈ Σ, there is at most one q ′ ∈ Q such that σ, q, q ′ ∈ δ.
An FSA is (fully) deterministic (a proper DFA) iff it has exactly one initial state and it is both total and functional. We also consider trim functional automata to be deterministic, where A is trim iff for all states q ∈ Q there is some accepting path from q.
An automaton is minimal iff it is deterministic and no two states are Nerode-equivalent 1 . Further, it is normalized iff it is both minimal and trim.
Given a string w, the factors of w are those v that are substrings of w (notation: v w). If k is the length of v (notation: |v| = k) then v is a k-factor of w.
The powerset graph of the automaton A, PSG(A) = V, E , is another edge-labeled directed graph where: Often we are interested only in the subgraph of this generated from a given set of initial states.
Lemma 1 If A is deterministic, then the sizes of the sets along any path in PSG(A) are monotonically non-increasing.
This is because if A is deterministic δ maps each state in S 1 to at most one state in S 2 .
Corollary 1 All sets in any cycle are equal in size.
Corollary 2 All in-edges to Q and all out-edges from ∅ are self-edges.

Strictly Local Stringsets
L is Strictly k-Local (L ∈ SL k ) iff it is completely characterized by its k-factors. Let Σ be the alphabet of L and define F k (Σ) = {v ∈ Σ * | |v| = k} and F ≤k (Σ) = 1≤i≤k [F i (Σ)]. For any string w ∈ Σ * , the k-factors of w are Similarly for F ≤k (w). This lifts to sets of strings in the obvious way.
Let G ⊆ F ≤k ({⋊} · Σ * · {⋉}) be the set of permitted factors in L. Then the stringset generated by G is Since Σ is assumed to be finite, F ≤k (Σ) is also finite, and an SL k language can equivalently be defined in terms of its forbidden factors: G = F ≤k (Σ) − G. This is more natural in many applications, including many linguistic ones (as in "no pair of unstressed syllables occur adjacently").
A stringset is said to be SL if it is SL k for any finite k.
The following proposition characterizes SL k . Proposition 1 (Suffix Substitution Closure) (SSC) This is because if a symbol σ can follow x in some string of L(A) then x · σ is a permitted factor and σ can follow x in any string of L(A).
One consequence of this is that if L(A) ∈ SL k and A is deterministic, then for each length k − 1 string x, all states in the set are Nerode Equivalent. If A is minimal as well, then all paths that end with the same (k − 1)-factor lead to the same state. The computations of the automaton synchronize after at most k − 1 steps. This is the basis of the algorithm used by Edlefsen et al. (2008) 2 to determine if a given A recognizes an SL stringset and, if it does, to find the parameter k.
Proposition 2 Suppose A is a normalized DFA. Then L(A) ∈ SL k iff every path from Q in PSG(A) that is of length k − 1 leads to a singleton vertex. If that is the case, then k is one plus the length of the longest path from Q to a singleton (that does not include other singletons). If there is no such longest path (i.e., there is an infinite path) then there is some cycle of non-singleton vertices, L(A) does not satisfy SSC for any k and it is not SL.
In practice, it is not necessary to build even just the subgraph of PSG(A) generated by Q. All that one needs for a counter-example to SSC is a single pair of strings in which SSC fails. So it suffices to just explore the subgraph of PSG(A) that is generated by doubleton subsets of Q. The size of this subgraph is only Θ(card(Q) 2 ), in contrast to the subgraph generated by Q, which is Θ(2 card(Q) ).
The following is an immediate consequence of this proposition. 2 The pair-graph algorithm was first published in Caron (2000).
Lemma 2 If if A is a normalized DFA and L(A) ∈ SL k then all cycles in PSG(A) are cycles of singletons.

Classes of Forbidden Factors
Factors may or may not include either a left-end marker at the beginning or a right-end marker at the end. In the case that a factor contains neither, it can occur anywhere in a string (including, possibly, at the beginning or end) and we say that it is a free factor or, if forbidden, free forbidden factor. If the length of a free forbidden factor is one, then it has somewhat different status than free forbidden factors of greater length; it is, in essence, a restriction to the alphabet. We will refer to these as forbidden units. If the first symbol of a forbidden factor is '⋊', then it can only occur at the left end of the word; this is an initial forbidden factor. If the last symbol is '⋉', then it can only occur at the right end of the word; it is a final forbidden factor. Note that the length of the string that these anchored factors match is k − 1. An SL k definition can restrict length k − 1 prefixes and suffixes, but not, in general length k prefixes and suffixes. 3 Finally, if a factor contains both end-markers it is a forbidden word, where the word it forbids is actually of length k − 2.

Free Forbidden Factors
Suppose A is a DFA. A factor w is a free forbidden factor of L(A) iff there is no path in the transition graph of A from q 0 to an accepting state that includes w as a substring. If A is normalized, this will be the case iff there is no path at all that is labeled w from any state of A, as all such paths would necessarily lead to the sink state which has been trimmed. Thus, in PSG(A) the path from Q that is labeled w leads to ∅. Again, the converse holds.
So the set of all labels of paths Q to ∅ in PSG(A) are free forbidden factors of L(A), moreover, that set includes all free forbidden factors of L(A). Since in general PSG(A) may include cycles and even in the case that L(A) is SL it may include cycles of singleton vertices, in general this set of paths will be infinite. (In fact, since PSG(A) invariably includes a trivial cycle on ∅ for each σ ∈ Σ, it will always be infinite.) The paths including trivial cycles on ∅ are labeled with strings in w · Σ * , where w is a free forbidden factor. We are interested in the set of paths that are minimal in the sense that the label of the path does not include the label of any other such path as a substring.
Note that, by Corollary 2, any such path that includes an in-edge to Q or an out-edge from ∅ includes another path from Q to ∅ that is strictly shorter. Thus none of those paths are minimal free forbidden factors. Note, also, that if L(A) ∈ SL, then there are no cycles on Q, although there will always be trivial cycles on ∅ for each σ ∈ Σ.
The next two lemmas establish that if L(A) is SL then there is some bound such that all cyclic paths from Q to ∅ in PSG(A) with length greater than that bound will be labeled with a string that includes, as a suffix, the label of an acyclic path from Q to ∅. Thus the set of minimal free forbidden factors of L(A) is just the set of labels from paths from Q to ∅ in PSG(A) that do not include the label of any other such path as a suffix and that do not include self-edges on ∅. This allows us to collect forbidden factors with a breadth-first bottom-up traversal of PSG(A).

Lemma 3
If v and w label acyclic paths from Q to ∅ in PSG(A) and v w, then w = uv for some u ∈ Σ * .
Hence x is either ε or the path it labels is a self-loop on ∅, contradicting the assumption of acyclicity. ⊣ Lemma 4 If a path from Q to ∅ in PSG(A), with L(A) ∈ SL includes a cycle other than a trivial cycle on Q or ∅, then there is a finite bound on the number of times the cycle can be taken before the label of the path includes the label of an acyclic path from Q to ∅ as a suffix.
Proof: Since L(A) is SL, any cycle must be a cycle of singletons. Suppose, then that there is a path: where, possibly, v may be a prefix of x. Since q 0 , q 1 ∈ Q there must be a path: where q 0 ∈ S 2i and q 1 ∈ S 2i+1 for i ≥ 0. Since there are no cycles of non-singletons, by Lemma 1 the sequence of S i s must ultimately be decreasing in size. Thus, for some n it resolves to: So (vw) n x labels a path from Q to ∅ and will be a suffix of all paths Q to ∅ that take the {q 0 } ❀ {q 1 } cycle at least 2n times. ⊣ Theorem 1 If L(A) ∈ SL then a string w is a free forbidden factor of L(A) ∈ SL iff it labels a path in PSG(A) from Q to ∅. It is minimal if that path does not include any cycles other than cycles of singletons and w does not include the label of any other such path as a suffix.
Note that if L(A) ∈ SL then the only cycles of non-singletons will be trivial cycles on ∅. Labels of paths including these will include some free forbidden factor as a prefix and are, thus, not minimal. Paths including cycles of singletons are necessary since none of the paths labeled u(vw) i x as in the proof of Lemma 4 is labeled with a factor of any of the others; they are minimal with respect to each other. It is only the label of the acyclic path that subsumes the labels of further iterations.

Final Forbidden Factors
Suppose A is a DFA. A factor w is a final forbidden factor of L(A) iff there is no path from q 0 to an accepting state in the transition graph of A that includes w as a suffix but there is some path from q 0 to an accepting state that includes w as a proper substring. (If no there is no such accepting path, then w is a free forbidden factor.) If A is normalized then w is a final forbidden factor iff all paths labeled w from any state in Q end at a nonaccepting state and there is some such path. This will be the case iff the path from Q in PSG(A) labeled w ends at a non-empty vertex that is disjoint with F . This is because if v is a free forbidden factor of L(A) then the path from Q in PSG(A) leads to ∅ and, hence, the path labeled v from any vertex of PSG(A) leads to ∅ as well.
Note that a final forbidden factor may include another as a suffix. (It is irrelevant whether it includes an final forbidden factor as a non-suffix, since final forbidden factors are, by definition, only relevant as suffixes.) Theorem 2 If a path from Q to a non-empty vertex disjoint from F in PSG(A), with L(A) ∈ SL, includes a cycle other than a trivial cycle on Q, then there is a finite bound on the number of times the cycle can be taken before the label of the path includes the label of an acyclic path from Q to a non-empty vertex disjoint from F as a suffix.
The proof is essentially the same as the proof of Lemma 4.

Initial Forbidden Factors
Suppose A is a DFA. A string w is an initial forbidden factor of L(A) iff it is w R (w reversed) for some w, a final forbidden factor of L(A R ), where A R is the DFA that recognizes the reversal of L(A).

Forbidden Words
Suppose A is a DFA and L(A) ∈ SL k . Then w is a forbidden word of L(A) iff it labels a path of length less than or equal to k that leads from q 0 to a state in Q − F .

Algorithms
Theorem 1 guarantees that if we do a breadthfirst bottom-up traversal of PSG(A) then we will discover each minimal forbidden factor before we discover any of its proper suffixes. Expanding the frontier of the search in discrete stages, every (reverse) path from ∅ to Q found in the k th stage will be a minimal forbidden k-factor.
There may be more than one such path so we do need to avoid gathering more than one instance of the factor. In general, there will be open paths (not reaching Q) that are labeled with the same factor. Extended to Q, they would include the factor as a proper suffix. So we exclude these from the frontier for the next stage.
We structure the bottom-up traversal of PSG(A) as a top-down traversal of PSG R (A), in which each of the edges of PSG(A) is reversed. For convenience (and convergence) we trim self-edges on ∅ and Q while reversing the graph. Since we are traversing bottom-up, we actually find w R of each factor w, but we gather these in a list structure, inserting at the head, which reverses the factor again as we construct it.
For the purposes of the algorithm, a Path in an edge-labeled graph V, E as a computational structure, is a 3-Tuple: v, S, w , where v ∈ V is the final vertex of the path, S ⊆ V is the (unordered) set of vertices along the path and w ∈ Σ * is the sequence of labels of the edges in the path, in reverse order. A Frontier is a set of paths. Forbidden factors are gathered in stages, with Stage i expanding Frontier i−1 to Frontier i , gathering the set FF i of all minimal forbidden i-Factors in the process.
The initial frontier Frontier 0 for finding free forbidden factors includes just the trivial (0-length) path from ∅. For finding final forbidden factors Frontier 0 includes the trivial path from each vertex that is a subset of Q − F .
Theorem 1 guarantees that, if we eliminate paths labeled with a forbidden i-Factor from Frontier i the search will converge after finitely many iterations, k, with Frontier k empty. (Note it is an empty set of Paths, not a set including a path ending at ∅.) The set of minimal free forbidden factors will be the union of the sets of factors gathered at stages 2 through k, where L(A) ∈ SL k . (Forbidden 1-factors are not included, since they are forbidden units.) The search for final forbidden factors will terminate after k − 1 iterations, with the minimal k-final forbidden factors including the right-end marker.
Pseudo-code for the algorithms is given in Figures 1 and 2.

Forbidden Words for SL Stringsets
If L(A) ∈ SL k and A is deterministic, then the words it forbids are just the labels of paths of length k − 2 (to allow for the endmarkers) from the (single) initial state to a state in Q − F . These can be gathered by doing a bounded traversal of A.

Forbidden Units
If A is normalized (minimal and trim), the forbidden units of L(A) are just the symbols of Σ that do not label any edge in δ. In PSG(A) these will label edges Q to ∅ and will be gathered in Stage 1 while gathering free forbidden factors. But these may not be the only forbidden units of interest. In many applications there will be an alphabet that includes all symbols that occur in any of a collection of stringsets and the subset of that alphabet that is not included in the alphabet of the FSA will also be significant. This is the case in most linguistic applications, for example (as in "this lect forbids unstressed heavy syllables").
In those applications we need to include the difference between some default alphabet and the set of symbols that label edges in A. Since we are building PSG(A) anyway, the simplest way of doing this is to just take the difference between the default alphabet and the labels of the out-edges from Q. If we union that with the labels of the subset of those edges that lead to ∅ we get the free forbidden 1-factors as well. We can avoid gathering the latter in both the set of free forbidden factors and the set of forbidden units by not including the forbidden factors gathered in Stage 1 . (Or, in order to simplify the code, by removing them from the set of free forbidden factors.)

Forbidden Factors of non-SL Stringsets
Every non-SL stringset can be fully defined by the conjunction of a set of SL constraints (possibly trivial: Σ * , ∅ and Σ + are SL 1 and SL 2 , respectively) along with a set of properly non-SL constraints. In applications that are exploring constraints across a collection of stringsets, most linguistic applications for instance, these SL constraints are significant. We would like to be able to factor the constraints so that the non-SL constraints capture, to the extent possible, just the non-strictly-local aspects of the patterns. The problem isn't finding factors that characterize the stringset, the problem is that there are too many of them. Σ * − L(A), augmented with left and right endmarkers, is a set of forbidden factors that characterizes L(A) exactly. It is, of course, in general infinite and necessarily so if L(A) is not SL.
The algorithms for SL stringsets are still partially correct for non-SL stringsets. The problem is that if L(A) is non-SL then there will be nonsingleton cycles (in addition to those on ∅) and the traversal will not terminate.
These non-singleton cycles actually localize the reason that the stringset is not SL. They capture circumstances under which the automaton fails to synchronize ever; they identify places in which SSC (Proposition 1) fails for L(A).
As with the set of forbidden words, the set of labels of the paths in PSG(A) that include nonsingleton cycles are all legitimate forbidden factors of L(A), but again there are infinitely many of them. The stringset they define is what we would like to isolate as the non-SL fragment of L(A).
It is tempting to try modifying the traversal so it follows only singleton cycles. But, unfortunately, if there are non-singleton cycles the chain of the proof of Lemma 4 may be infinite, so there is no guarantee of termination even when following only singleton cycles.
Another approach would be to modify A, working backward from PSG(A), in a way that would eliminate the non-singleton cycles. We have not really pursued this idea, but our sense is that it is likely to fail for the same reason as simply not following non-singleton cycles fails.
In any case, we are looking for a set of forbidden factors that approximates L(A). Since none of our algorithms introduces constraints that are not manifest in the automaton, the approximation will overgenerate. The issue is how close do we need it to be.

SL Approximations
First of all, as we noted above, Σ * is an SL approximation of every stringset over Σ. But it's a particularly licentious one. Another possibility is to only gather the forbidden factors that label non-cyclic paths in PSG(A). This will miss many forbidden factors that may well be significant-all those factors labeling paths with singleton cycles that would have eventually been subsumed if there were no non-singleton cycles. On the other hand, it gives the smallest set of forbidden factors that comprise a reasonable approximation of L(A).
Another way of bounding the traversal is to note that no acyclic path from Q to ∅ in PSG(A) can be 2 card(Q) − 2 or longer. But the set of factors gathered by a traversal with this bound, although arguably the largest justifiable set of forbidden factors, is almost certainly unreasonably large.
SL approximations that are too large are misleading both in terms of the apparent complexity of the SL aspects of the constraints and in terms of the their non-SL aspects, which will appear to need to include many exceptions in order to account for the strings excluded by the SL approximation. When the SL approximation overesti-mates, the non-SL residue undergeneralizes.
In some applications, there may be a theoretically justified bound on how long the relevant factors are, that is, on how many times a cycle should be followed in the traversal. As we noted in the introduction, all of the SL stress patterns in StressTyp2 are SL k for k ≤ 6. Thus one may well be justified in limiting the SL fragment to factors of length no more than six. Even assuming the bound is well-justified, this is still likely to generate too close an approximation. Forbidden factors that should properly be captured by the non-SL constraints, that involve non-singleton cycles that are not needed to terminate the traversal of singleton cycles, will be included. If the goal is to explore the nature of the constraints across a collection of stringsets these will likely be misleading, particularly since half of the patterns in StressTyp2 are SL 3 (or less, SL is an inclusive hierarchy in k).
It is straightforward to modify the algorithms given above for either of these approaches. Cycles can be completely excluded by modifying the definition of Extensions. Limits on the size of the factor are just depth limits on the traversal. It is also straightforward to combine these, only following singleton cycles and only doing it up to a depth limit. To bound the search for forbidden words we first compute the sets of forbidden initial, free and final factors and then bound the depth to max(|frFF| − 2, |inFF| − 1, |fiFF| − 1), where |frFF| , |inFF| , |fiFF| are the maximum width of the free, initial and final factors, respectively.
As our goal in developing these algorithms is to provide tools that phonologists can use productively in exploring systems of phonotactic constraints the third approach to bounding the traversal seems most useful, although we have currently only implemented the acyclic path approach.

Residue Automata
When the algorithms are run on automata that recognize non-SL stringsets the result is a set of forbidden factors for the approximated stringset. We are just as interested in the characteristics of the stringset that these forbidden factors miss. Most work on approximating stringsets with stringsets in a weaker complexity class has focused on approximating CFLs with regular stringsets (Nederhof (2000) includes a good survey) or Tree-Adjoining Stringsets (TALs) with CFLs (Schabes and Waters, 1993;Rogers, 1994). Whenever the class of stringsets that is being approximated includes CFLs the (symmetric) difference between the approximation and the target will not be a decidable set. Consequently, there is little that can be determined about that difference.
We have the advantage that all of our stringsets are regular and so the difference is not only decidable but an automaton recognizing it is effectively constructible. Moreover, in this case, we know that every string excluded by the approximation is necessarily excluded by the target. The approximation never undergenerates. To isolate the non-SL characteristics of the target we construct an automaton that recognizes exactly the set of strings that are overgenerated by the SL approximation.
Using well-known algorithms for combining automata, it is straightforward to construct an automaton A FF that recognizes the set of strings licensed by the set of forbidden factors. One starts with deterministic automata that recognize each of the given factors, complements them and then builds the automaton that recognizes the intersection of those complements. It is then straightforward to construct A res , the residue automaton 4 which recognizes exactly L(A FF ) − L(A). This residue automaton captures exactly the non-SL aspects of L(A), up to the degree to which the forbidden factors approximate the strictly SL aspects of L(A).

Results and Prospectus
We have designed and implemented algorithms that, given a Finite-State Automaton, compute a set of forbidden words, units, initial factors, free factors and final factors that define an SL approximation of the stringset recognized by the FSA, along with a minimal DFA that recognizes the residue set: the set of strings in the approximation that are not in the stringset recognized by the FSA. If the FSA recognizes a stringset that is SL, then the approximation is exact.
As we explain in Section 7.1, the closeness of the approximation is a parameter that may be varied depending on the application. As we have implemented it, we obtain the smallest set of factors that is arguably a reasonable approximation.
We have also implemented an algorithm that collects the union of the forbidden factors of each type from a collection of these results, although we don't present it here, the algorithm being obvious.
We have applied these tools to the 106 lects that have associated DFAs in the StressTyp2 database. For the individual lects the maximum number of forbidden words is 20. Since the size of our default alphabet is 15 (five degrees of weight and three degrees of stress) and some lects have only one weight and two levels of stress, the maximum number of forbidden units is 13. The maximum number of forbidden initial factors is 15. The maximum number of forbidden free and final factors is 386 and 117, respectively, but these are all due to Pirahã, an outlier. Without Pirahã they are 185 and 32, respectively.
For the union factor types, there are 14 distinct forbidden units (only unstressed light syllables occur in every lect), 44 distinct forbidden words, 35 distinct forbidden initial factors, 904 distinct forbidden free factors and 230 distinct forbidden final factors. The maximum width of forbidden words, initial factors and free factors is 5. The maximum width of forbidden final factor is 6, due to a single lect (Içuã Tupi) which is also the only example of a properly SL 6 stringset, the other SL patterns all being SL 4 or less.
That is still a lot of factors, too many to draw much insight from. But these are all in ground form, with each syllable type represented by a distinct alphabet symbol. In future work we plan to adapt the alphabet type to be tuples of features or perhaps non-re-entrant feature structures (adding full feature structures we will leave for others), which will provide opportunities to generalize across those features. We know, just from the phonology, that this will reduce the total number of exemplars significantly.
The algorithms we have presented here, are asymptotically exponential-time in the size of the automaton, but that is actually optimal for algorithms that construct sets of ground factors: the worst case size of the set of factors of the stringset of an automaton with card(Q) states is Ω(card(Σ) card(Q) ). Nevertheless these algorithms are actually quite effective in practice. We have incorporated them into a Haskell workbench for manipulating automata with a particular focus on logical descriptions of sub-regular con-straints. With only minimal optimization the algorithm computes the forbidden factors and the residue automaton for all 106 lects in our corpus in less than an hour, which is practical as it stands, but can be improved significantly. The asymptotic bound is due to the potential size of the powerset graph as well as the potential size of the set of factors. These are not, however, the dominant factor in the practical performance. Rather it is the time it takes to generate a minimal DFA from the forbidden factors. This is an easy target for optimization; the intersection step, a critical path in the construction, can be done in time logarithmic in the number of factors, for example. There are many other easy opportunities for optimization and Haskell provides a particularly powerful platform form implementing them.   STAGE i Given: The frontier of the search, a set of Path Where: v ∈ V is the final vertex, S ⊆ V is the (unordered) set of vertices in the path, w ∈ Σ * is the sequence of labels of the edges in the path, in reverse order Goals ⊆ V is the set of goal vertices Extensions is a function taking a Path to its qualified extensions Construct: Front i , FF i 1 ForEach Path ∈ Front i−1 2 ForEach v ′ , S ∪ {v}, σ · w ∈ Extensions(Path) 3 if σ · w ∈ F F i ✄ σ · w has not already been found to be an i-FF 4 then if v ′ ∈ Goals ✄ σ · w is an i-FF then 5 Front i ← Front i − { −, −, σ · w ∈ Front i } ✄ Remove any paths labeled with this factor from Front i 6 FF i ← FF i ∪ {σ · w} ✄ Add σ · w to F F i else 7 Front i ← Front i ∪ { v ′ , S ∪ {v}, σ · w } ✄ Add extension to Front i End of STAGE i .