Efficient learning of Output Tier-based Strictly 2-Local functions

This paper characterizes the Output Tier-based Strictly k-Local (OTSL k ) class of string-to-string functions, which are relevant for modeling long-distance phonological processes as input-output maps. After showing that any OTSL k function can be learned when k and the tier are given, we present a new algorithm that induces the tier itself when k = 2 and provably learns any total OTSL 2 function in polynomial time and data—the ﬁrst such learner for any class of tier-based functions.


Introduction
In this paper, we investigate the class of Output Tier-based Strictly k-Local (OTSL k ) functions. In terms of finite state transducers, OTSL k functions are those for which the output symbol(s) to be written at each timestep depends on the k − 1 most recent symbols on the output tape that belong to the relevant 'tier' (a subset of the output alphabet; Heinz et al., 2011), without regard for any nontier symbols that might have been written between them or after them. We show that they are learnable when the contents of the tier are provided as input to the learner, and introduce an algorithm that provably and efficiently learns any total OTSL function when k = 2.
Recent research investigating the computational properties of phonological patterns observed in natural language has shown that many attested processes can be characterized as Strictly Local (SL) functions (Chandlee, 2014;Chandlee et al., , 2015. That is, the output at any given timestep is dependent on the previous k − 1 symbols from either the input string (Input Strictly k-Local; ISL k ) or the output string (Output Strictly k-Local; OSL k ). Multiple characterizations of these classes exist and their properties are wellunderstood. One important distinction between the two is that non-iterative processes are ISL, whereas processes that apply iteratively to multiple targets are OSL. Moreover, efficient learning algorithms exist for both the ISL and OSL functions. The OSL Function Inference Algorithm (OSLFIA; Chandlee et al., 2015) is of particular importance to this paper, as we will show that many of their theoretical results can be generalized to OTSL functions in a natural way.
Long-distance phonological processes, for which a potentially unbounded number of segments may intervene between the trigger and target without being affected in any way, are neither ISL nor OSL for any value of k. For example, Samala has a long-distance process of sibilant harmony in which an underlying /s/ surfaces as [S] if another [S] appears anywhere later in the word. This is seen when, e.g., the perfective suffix /-waS/ is added to a root containing /s/, as in /has-xintila-waS/→[haSxintilawaS] 'his former gentile name' (Applegate, 1972). This process can be understood as applying iteratively to multiple targets, as in /s-lu-sisin-waS/→[SluSiSinwaS] 'It is all grown awry'. Indeed, it seems that the vast majority of attested long-distance processes are enforced iteratively (Kaplan, 2008;Hansson, 2010). As such, we focus this paper on OTSL k functions in particular, which generalize the OSL k class in a way that allows us to model these kinds of long-distance processes. We note that ITSL k functions can be characterized in a similar way and that the learning strategy outlined below could likely be extended to total ITSL 2 functions.
While the notion of a tier has long been incorporated into phonological theory (e.g., Clements, 1980;Goldsmith, 1990;Odden, 1994;Heinz et al., 2011;McMullin, 2016), the range of possible tiers is typically assumed to be available to the learner a priori. Each possible tier could, for example, be defined in terms of feature specifications or natural classes of segments (e.g., Hayes and Wilson, 2008). Though algorithms have been developed for inducing a relevant tier from a sample of positive training, their success is limited to phonotactic co-occurrence restrictions. This is true both for constraint-based maximum entropy learners (Gouskova and Gallagher, 2019) as well as for algorithms that learn grammars for Tier-based Strictly Local formal languages (Jardine and Heinz, 2016;Jardine and McMullin, 2017). To our knowledge, the algorithm presented below, which we call the Output Tier-based Strictly 2-Local Function Inference Algorithm (OTSL2FIA), is the first algorithm which learns the relevant tier for transformations of underlying representations (strings of input segments) to surface forms (strings of output segments).
The remainder of this paper is organized as follows. Notation and relevant concepts are presented in Section 2. In Section 3, we define the OTSL functions and characterize them in terms of finite state transducers. In Section 4, we highlight several important properties of OTSL 2 functions in particular that can be taken advantage of during learning. All aspects of the learning algorithm, along with the theoretical learning results, are described in Section 5. Section 6 discusses how OTSL 2 functions can model various phonological processes and identifies several avenues for future research. Section 7 concludes.

Strings and sets
Given a set S, we write card(S) to denote its cardinality. For a string w made of symbols from some alphabet Σ, |w| denotes the length of the string. We write Σ * to denote all possible strings made from the alphabet Σ, while Σ n denotes all possible strings made from that alphabet with a length of n, and Σ ≤n denotes all such strings with a length up to n. The unique string of length 0 (the empty string) is written as λ. Given two strings u and v, we write u · v to denote their concatenation, but often shorten this to uv when context permits. We write fac k (w) to denote all the contiguous substrings of length k (the k-factors) contained in a string w.
We assume a fixed but arbitrary total order ≺ over the letters of Σ, an order which we extend to all strings in Σ * by defining the lengthlexicographical order (Oncina et al., 1993;Chan-dlee et al., 2015) as follows. String w 1 occurs length-lexicographically before w 2 (written as w 1 w 2 ) when |w 1 | < |w 2 | or, if |w 1 | = |w 2 |, when a i ≺ b i where a i is the i th letter in w 1 , b i is the i th letter in w 2 , and i is the first position on which w 1 and w 2 differ. For example, given Σ = {a, b} where a ≺ b, we have λ a b aa ab ba bb aaa and so on.
A prefix of some string w is any string u such that w = ux and x ∈ Σ * . Similarly, a suffix of some string w is any string u such that w = xu and x ∈ Σ * . Note that any string is a prefix and suffix of itself, and that λ is a prefix and suffix of every string. When |w| ≥ n, Pref n (w) and Suff n (w) denote the unique prefix and suffix of w with a length of n; when |w| < n, they simply denote w itself. We write Pref * (w) to denote the set of all prefixes of w. Also, Suff n (Suff n (w 1 )w 2 )) = Suff n (w 1 w 2 ). Given a string w, one of its prefixes p, and one of its prefixes s, we write p −1 · w to represent the string w without that prefix p and write w · s −1 to represent the string w without that suffix s. For example, a −1 · aba = ba and aba · a −1 = ab. Finally, given a set of strings S, we write lcp(S) to denote the longest common prefix of S, which is the string u such that u is a prefix of every w ∈ S, and there exists no other string v such that |v| > |u| and v is also a prefix of every w ∈ S.

Functions and transducers
This paper deals exclusively with string-to-string functions, relations that pair every w ∈ Σ * with at most one y ∈ ∆ * , where Σ and ∆ are the input alphabet and output alphabet respectively. The input language and output language of such a function are pre image respectively. An important concept is that of the tails of an input string w with respect to a function f .

Definition 1. (Tails) Given a function f and an
In words, tails f (w) pairs every possible string y ∈ Σ * with the portion of f (wy) that is directly attributable to y. That is, it describes the effect that w has on the output of any subsequent string of input symbols. When tails f (w 1 ) = tails f (w 2 ) we say that w 1 and w 2 are tailequivalent with respect to f .
A related concept to tails and tail-equivalency is the contribution of a symbol a ∈ Σ relative to a string w ∈ Σ * with respect to a function f .
In words, for an input string x that has the prefix wa, the contribution of the a in wa is the portion of f (x) that is uniquely and directly attributable to that instance of a.
The Output Tier-based Strictly Local functions that will be introduced below are a proper subclass of the subsequential functions. Oncina and García (1991) show that when a function is subsequential, tail-equivalency will partition Σ * into finitely many blocks, allowing us to construct a finitestate transducer that computes f . In this paper we use delimited subsequential finite state transducers (DSFSTs; see Jardine et al., 2014), to characterize the class of Output Strictly Local (OSL) functions. The following definition is drawn directly from Chandlee et al. (2015).
Definition 3. A delimited subsequential finite state trasnducer (DSFST) is a 6-tuple Q, q 0 , q f , Σ, ∆, δ where Q is a finite set of states, q 0 ∈ Q is the unique initial state, q f ∈ Q is the unique final state, Σ is the finite input alphabet, ∆ is the finite output alphabet, and δ ⊆ Q × (Σ ∪ { , }) × ∆ * × Q is the transition function (where / ∈ Σ indicates the start of the input and / ∈ Σ indicates the end of the input), and the following hold: 1. if (q, a, u, q ) ∈ δ then q = q f and q = q 0 2. if (q, a, u, q f ) ∈ δ then a = and q = q 0 3. if (q 0 , a, u, q ) ∈ δ then a = and if (q, , u, q ) ∈ δ then q = q 0 4. if (q, a, u, q ), (q, a, u , q ) ∈ δ then q = q and u = u Each transition (q, a, u, q ) ∈ δ can be seen as an instruction to append u to the end of the output tape and to move to state q upon reading a while in state q. This transition function may be partial, and its recursive extension δ * is the smallest set containing δ closed under the following conditions: (q, λ, λ, q) ∈ δ * , and (q, w, u, q ), (q , a, v, q ) ∈ δ * ⇒ (q, wa, uv, q ) ∈ δ * . The initial state of a DS-FST has no incoming transitions and has exactly one outgoing transition, which will be for the input and does not land in the final state. Furthermore, the final state of a DSFST has no outgoing transitions, and every transition into the final state is for the input . DSFSTs are also deterministic on the input, such that each state has at most one outgoing transition per input symbol.
The size of a DSFST T = Q, q 0 , q f , Σ, ∆, δ is |T | = card(Q) + card(δ) + (q,a,u,q )∈δ |u|, and the relation defined by a DSFST is The DSFSTs we will use below have a special property known as onwardness, which informally means that the writing of the output is never delayed. The following formal definition of onwardness and a related lemma are borrowed from Chandlee et al. (2015).
Below we will frequently make reference to the length-lexicographically earliest input string that can lead to a state q in a given transducer T , which we will denote as w q . A formal definition is provided here for reference.
A distinction that will be important throughout the rest of this paper is that between the writing that occurs in a DSFST as it is reading letters from Σ, and the writing that occurs at the very end (when the DSFST reads ). To make this distinction, Chandlee et al. (2015) defined the prefix function f p associated with a subsequential function f as follows.
Definition 6. (Prefix function) Given a subsequential function f , its associated prefix function f p is such that f p (w) = lcp(f (wΣ * )).
Remark 1. Given a subsequential function f , some a ∈ Σ, and some input string w ∈

Strict locality and tiers
Chandlee (2014) and  originally introduced the Input Strictly Local (ISL) and Output Strictly Local (OSL) functions, both of which generalize Strictly Local (SL) stringsets to functions based on one of the defining properties of SL languages, the Suffix Substitution Closure (Rogers and Pullum, 2011). The definitions of the ISL and OSL functions exploit a corollary of this defining property, which Chandlee et al. (2015) call Suffix-defined Residuals. For reasons of space, we only discuss the OSL functions below.
that is w 1 and w 2 have the same residuals (tails) with respect to L.

Definition 7. (Output Strictly Local functions)
Chandlee (2014) and Chandlee et al. ( , 2015 show that most iterative phonological processes can be modelled with an OSL function, with an important exception being long-distance iterative processes like consonant harmony. This is parallel to the fact that long-distance phonotactics cannot be represented with an SL stringset, which motivated Heinz et al. (2011) to define the Tier-based Strictly Local (TSL) languagesstringsets that are SL after an erasure function has applied, masking all symbols that are irrelevant to the restrictions that the language places on its strings.
Definition 8. (Erasure function) Given an alphabet Σ, a tier Θ ⊆ Σ, and a string w = a 1 ...a n , Informally, Erase Θ (w) returns the string w with all non-tier elements removed. For convenience, we will write Suff n Θ (w) to mean Suff n (Erase Θ (w)) in what follows.
Definition 9. (Tier-based Strictly Local languages) A language L is Tier-based Strictly k-Local (TSL k ) if there is a tier Θ ⊆ Σ and a subset S ⊆ fac k ( Θ * ) such that:

Output Tier-based Strictly Local functions and transducers
In this section, we define the OTSL functions, which generalize the TSL stringsets to functions in the same way that the OSL functions generalize SL stringsets to functions (see Chandlee, 2014;Chandlee et al., 2015).
The OTSL class properly contains the OSL functions, since every OSL k function can be described as an OTSL k function whose tier is equal to the entire output alphabet. Note that it is possible for a single OTSL function to be described with more than one tier. For example, the identity function (where Σ = ∆ and f (w) = w) can be described with any subset of ∆ as its tier. We use the term k-tier to describe a tier Θ for which f is OTSL k .
Like the OSL k functions, the OTSL k functions can be characterized in automata-theoretic terms. First, we define OTSL k finite state transducers as follows.
Definition 11. An onward DSFST T = Q, q 0 , q f , Σ, ∆, δ is OTSL k for the tier Θ ⊆ ∆ if: Lemmas 2 and 3, together with Theorem 2, show that the OTSL k functions and the functions represented by OTSL k transducers exactly correspond.
Lemma 2. Let T = Q, q 0 , q f , Σ, ∆, δ be an OTSL k transducer for the tier Θ. The following holds: Lemma 3. Any OTSL k transducer corresponds to an OTSL k function.
Theorem 2. Given an OTSL k function f and one of its k-tiers Θ, the DSFST T = Q, q 0 , q f , Σ, ∆, δ defined as follows computes f : We note that these are trivial extensions of Lemmas 3, 4, and Theorem 2 in Chandlee et al. (2015). Indeed, only two minor changes are necessary for this generalization to OTSL k functions. First, each instance of Suff k−1 must be replaced with Suff k−1 Θ . Second, in order to account for the fact that non-tier elements may come between relevant tier elements, certain references to a string q = t 1 t 2 ...t k−1 must be rewritten as As the proofs are otherwise identical in structure to those found in Chandlee et al. (2015), we do not provide them here.
It is therefore the case that any OTSL k function can be represented by an OTSL k transducer. Informally, this will be an onward DSFST in which the non-initial and non-final states represent the most recent k − 1 tier symbols written thus far, meaning that this is the only information that will dictate what the DSFST writes upon reading the next input symbol.
As an example, Figure 1 presents an OTSL 2 transducer that models the unbounded sibilant harmony in Samala from Section 1. Note that in order to achieve the regressive directionality of the process, we assume that this transducer reads input strings from right-to-left (following, e.g., Heinz and Lai, 2013;Chandlee et al., 2015). Directionality will be further discussed in Section 6.

Useful properties of OTSL 2 functions
The main goal of this paper is to demonstrate how OTSL functions can be learned from positive data, even without prior knowledge of the tier itself. We note that the tier-induction strategy adopted below :λ :λ :λ Figure 1: An OTSL 2 transducer that models unbounded sibilant harmony, where ? represents any symbol that is not s or S relies on certain properties that hold when k = 2, but not necessarily for greater values of k. These are outlined below. First, when an OTSL 2 function f can be described with more than one 2-tier, the union of any two or more such 2-tiers is also a 2-tier for f .

Lemma 4. Given an OTSL
It is this property that allows us to identify a unique target tier for an OTSL 2 function, which the algorithm can find by flagging and removing elements of ∆ from its hypothesis when evidence is found that they cannot be on a relevant tier. We define this canonical 2-tier as follows.
Definition 12. (Canonical 2-tier) Given an OTSL 2 function f , Θ is the canonical 2-tier for f iff there is no other 2-tier Ω ⊆ ∆ for f such that card(Ω) ≥ card(Θ). Remark 2. Given an OTSL 2 function f , its canonical 2-tier Θ is a superset of any 2-tier for f . (This follows immediately from Lemma 4.) There is therefore a unique canonical 2-tier (i.e., the largest one) for each OTSL 2 function. Interestingly, this can be exploited during the learning process, since it leads to the following useful property of OTSL 2 functions.
Proof. By contradiction. Suppose that the lemma is false. This means that ∀a ∈ (Ω−Θ), ∀w 1 , w 2 ∈ Σ * , and ∀x ∈ Σ ∪ { }, we have Suff 1 Ω (f p (w 1 )) = Suff 1 Ω (f p (w 2 )) = a ⇒ cont f (x, w 1 ) = cont f (x, w 2 ). Now, since Θ is a 2-tier for f , it is also the case that ∀b ∈ Θ, ∀w 1 , w 2 ∈ Σ * , and ∀x ∈ Σ ∪ { }, we have Suff 1 Ω (f p (w 1 )) = Suff 1 Ω (f p (w 2 )) = b ⇒ cont f (x, w 1 ) = cont f (x, w 2 ). Together these imply that ∀c ∈ Ω, ∀w 1 , w 2 ∈ Σ * , and ∀x ∈ Σ ∪ { }, we have Suff 1 for all y ∈ Σ ∪ { }. This applies recursively, giving us Suff 1 Ω (f p (w 1 )) = Suff 1 Ω (f p (w 2 )) ⇒ tails f (w 1 ) = tails f (w 2 ), which means that Ω is a 2-tier for f . However, card(Ω) > card(Θ), contradicting the fact that Θ is the canonical 2-tier for f . Importantly, it follows from Lemma 5 that for any set Ω which is a strict superset of Θ (the canonical 2-tier), we will always be able to find evidence that some member of Ω could not be a member of any 2-tier for f . It is this property of OTSL 2 functions that our algorithm makes use of to determine which output symbols are in Θ. Once again, when k > 2, this property does not necessarily hold. 1 Accordingly, we restrict ourselves to k = 2 when discussing the learning of OTSL functions without prior knowledge of the tier. While OTSL 2 functions seem sufficient for modelling a wide range of long-distance phonological processes, we discuss certain exceptions in Section 6. 1 For example, if ∆ = {a, b, c}, there could be an OTSL3 function for which Θ1 = {a, b} and Θ2 = {a, c} are both 3-tiers, but Ω = {a, b, c} is not.

Learning paradigm
We adopt the criterion for successful learning that requires exact identification in the limit from positive data (Gold, 1967), with polynomial bounds on time and data (de la Higuera, 1997). We first define what it means for a class of functions to be represented by a class of representations.
Definition 13. A class T of functions is represented by a class R of representations if every r ∈ R is of finite size and there is a total and surjective naming function L : R → T such that L(r) = t if and only if for all w ∈ pre image(t), r(w) = t(w), where r(w) is the output produced by r given the input w.
The notions of a sample and a learning algorithm are defined as follows.
Definition 14. (Sample) A sample S for a function t ∈ T is a finite set of data consistent with t, that is to say (w, u) ∈ S iff t(w) = v. The size of a sample is the sum of the length of the strings it is composed of: |S| = (w,u)∈S |w| + |u|.
Definition 15. (Learning algorithm) A (T, R)learning algorithm A is a program that takes as input a sample for a function t ∈ T and outputs a representation from R.
The paradigm relies on the notion of a characteristic sample, adapted here for functions as in Chandlee et al. (2015).
Definition 16. (Characteristic sample) For a (T, R)-learning algorithm A, a sample CS is a characteristic sample of a function t ∈ T if for all samples S ⊇ CS, A returns a representation r such that L(r) = t.
The learning paradigm can now be defined as follows.
Definition 17. (Identification in polynomial time and data) A class T of functions is identifiable in polynomial time and data if there exists a (T, R)learning algorithm A and two polynomial equations p() and q() such that: 1. For any sample S of size m for t ∈ T, A returns a hypothesis r ∈ R in O(p(m)) time.

Learning when the tier is given
Prior to describing the approach we take to inducing the contents of a tier when k = 2, we note that learning any OTSL k function from positive data is relatively straightforward if the value of k and the tier Θ are known beforehand. In particular, although the OSLFIA presented in Chandlee et al. (2015) was designed only to learn OSL functions, it turns out that a minor modification allows us to extend their result to OTSL functions, so long as k and Θ are known beforehand. We summarize how this can be done below. In its original form, the OSLFIA inevitably fails to learn any OTSL function that is not itself OSL (i.e., where Θ = ∆). Specifically, since the algorithm labels each landing state of a transition with the k − 1 suffix of its associated output, it will always incorrectly determine the landing state of one of more transitions when there is a longdistance dependency. Moreover, the exact way in which the resulting OSL transducer differs from the target OTSL transducer is somewhat unpredictable. As such, there does not seem to be a general approach for transforming the OSLFIA's output into an appropriate OTSL k transducer.
In cases where Θ is known beforehand, however, we can circumvent this issue by simply specifying that non-tier elements should be skipped over when labelling a state. In doing so, the algorithm will be able to find all of the necessary states as well as the correct landing state for each transition in the target OTSL k transducer. This modification of the OSLFIA is incorporated into the function build fst, which is detailed in Algorithm 1. While this constitutes one important aspect of learning OTSL k functions, it is nonetheless a major challenge to determine the actual contents of Θ without a priori knowledge. Although inducing a k-tier for any value of k remains as an open problem, in the following section we describe how his can be done when k = 2.

Learning the contents of a 2-tier
Having shown that the OSLFIA can be modified to learn an OTSL k function f once Θ (the tier) is known, we now describe our approach to inducing Θ itself when k = 2. After this is done, Θ can simply be fed into the build fst function in order to produce an OTSL 2 transducer that represents f . The first step toward learning the contents of a 2-tier is to gain as much information as possible Function build fst(S, Θ, k): return C, q 0 , q f , Σ, ∆, δ Algorithm 1: Building an OTSL k transducer when given Θ about the prefix function f p corresponding to f , based only on the evidence provided in the training sample. To do this, the function estimate fp, shown in Algorithm 2, goes through every string x that is the prefix of at least one input string in the training data, and for every a ∈ Σ, it checks whether xa is also a prefix of some input string. If this is the case, there is enough information to determine f p (x). The function estimate fp will then add the pair (x, z) to the set P , where z is the longest common prefix of f (w) for all (w, f (w)) ∈ S such that x is a prefix of w. We note that this z will be equal to f p (x) provided that the training data come from a subsequential function, and so this technique may be useful for learning other types of functions as well.
We further note, however, that by using this strategy, estimate fp is only guaranteed to produce all the pairs (x, f p (x)) necessary to discover the tier for total functions. Function estimate fp(S): where y ∈ Pref * (w)}); P ← P ∪ {(y, z)} return P ; Algorithm 2: Prefix function estimation This is because, when there is no pair with the shape ( pax , f (pax)) in the training data, estimate fp does not know whether this is accidental (i.e., due to the finite nature of the training data) or because the function is undefined for all inputs of the shape pax . While the ability to accommodate partial functions would have practical applications for learning from natural language data, at present we leave the task of extending estimate fp in this way to future research.
The full learning algorithm, which we call the OTSL 2 Function Inference Algorithm (OTSL2FIA) is shown in Algorithm 3. We assume that Σ and ∆ are fixed and not part of the input to the learning problem (and that k = 2). Given a finite sample of training data, it first estimates the relevant prefix function with the set P , as described above, and begins with the hypothesis that Θ = ∆ (i.e., that all members of the output alphabet are on the target tier). Then, for each a ∈ Θ, it looks through P for any evidence that a needs to be removed from Θ. To do this, it builds an auxiliary set Match that contains every (p, f p ) ∈ P for which Suff 1 Θ (f p (p)) = a under the current hypothesis for Θ. For each x ∈ Σ ∪ { }, it then checks whether cont f (x, p) is the same for all (p, q) ∈ Match. If this is the case, a is added to the set Keep. However, if there is more than one value found for the contribution of some x ∈ Σ ∪ { }, it will instead remove a from Θ, since it cannot possibly be a member of the target 2-tier. If at any point some symbol gets removed from Θ, the set Keep is immediately emptied. This portion of the algorithm will run until every a in the current hypothesis for Θ gets added to the set Keep, in which case it knows it has found the canonical 2-tier of the target function.
Once the OTSL2FIA converges on the canon- ical 2-tier Θ, the final step is simply to feed Θ and the sample S into the function build fst shown above in Algorithm 1 (further specifying that k = 2). Under the assumption that the training sample contains the appropriate evidence, as described in the following section, this will produce an OTSL 2 transducer which represents the target OTSL 2 function.

Theoretical results
Here we establish several theoretical results, which culminate in the theorem that the OTSL2FIA identifies the class of total OTSL 2 functions in polynomial time and data.
In what follows, we let f be the target OTSL 2 function, Θ be its canonical 2-tier, and T = Q , q 0 , q f , Σ, ∆, δ be its target transducer as defined by Theorem 2. We furthermore let Θ be the OTSL2FIA's final tier hypothesis, and T = Q, q 0 , q f , Σ, ∆, δ be the transducer that is constructed on the input. Proof. Let n = (w,u)∈S |w|, m = max{|u| : (w, u) ∈ S}, p = max{|w| : (w, u) ∈ S}, and s = card(S). We note that these are all linear in the size of the sample.

The
OTSL2FIA starts by calling estimate fp.
This function first determines all of the input prefixes present in S, which takes n steps.
Then estimate fp checks, for each prefix x and all a ∈ Σ, whether xa is also an input prefix in S. There are at most sm prefixes in S, so this takes at most card(Σ) · (sm)n steps. Finally, for a subset of the input prefixes, estimate fp determines lcp({u | (w, u) s.t. x ∈ Pref * (w)}), which with an appropriate data structure (for instance a prefix tree) can be done in nm steps. The overall computation time of estimate fp is thus O(n + (sm)n + (sm)(nm)), which is quartic in the size of the learning sample.
The portion of the OTSL2FIA that determines the tier is now run. After i elements have been removed from Θ, the combined while/for loop can run up to |∆| − i times, and can only remove up to |∆| items, so the loop will be used fewer than |∆| 2 times, which is a constant. This main loop first gathers all (w, u) ∈ P that meet a certain criterion into the set Match, which can be done in card(P )m = (sm)m = sm 2 steps. Next, the main loop enters a for loop that is used card(Σ) times (a constant) and which attempts to calculate the contribution of σ ∈ Σ using each (w, u) ∈ Match if it can find (wσ, v) ∈ P . We note that card(Match) will be at most sm, that finding (wσ, v) ∈ P takes at most smp steps, and that calculating the contribution takes at most m steps. The main loop then attempts to calculate the contribution of using each (w, u) ∈ Match if it can find (w, v) ∈ S. We note that finding (w, v) ∈ S takes at most n steps, and that calculating the contribution takes at most m steps. The overall computation time of this portion of the algorithm is thus O(sm 2 +sm(smp+m)+sm(n+m)), which is quintic in the size of the learning sample.
Finally, the OTSL2FIA feeds Θ and S to the function build fst. As noted above, this fuction incorporates a simple modification to the state-labelling process in Chandlee et al.'s (2015) OSLFIA. While this change allows it to build an OTSL transducer once the tier is known, it does not affect computation time. This final step of the OTSL2FIA therefore runs in time quadratic in the size of the learning sample (for OSLFIA time complexity proofs, see Chandlee et al., 2015). Since each portion of the OTSL2FIA runs in time polynomial in the size of the sample, with the highest complexity being quintic, the overall computation time of the algorithm is therefore polynomial in the size of the learning sample.
The remaining lemmas of this section will show that for each total OTSL 2 function f , there is a finite kernel of data consistent with f that is a characteristic sample for OTSL2FIA, which we call an OTSL2FIA seed.
Definition 18. (Seed) Given T , a sample S contains a seed if: 1. For all q ∈ Q , ( w q , f (w q )) ∈ S.
2. For all (q, a, u, q ) ∈ δ such that q = q f and a ∈ { } ∪ Σ, and for all pairs b, c ∈ Σ: Lemma 7. If a learning sample S contains a seed, then the OTSL2FIA can determine cont f (x, w) for all w ∈ Σ * and all x ∈ Σ ∪ { }.
total function. For each pair, we have | w | ≤ card(Q ) + 3 and |f (w)| ≤ (q,a,u,q )∈δ |u| + 3m, where m = max{|u| : (q, a, u, q ) ∈ δ }. With this last quantity denoted y , we note that y = O(|T |). The overall length of the inputs in the portion of the seed contributed by item 2 is therefore in O((3 · card(δ ))(card(Q ) + 3) = O(card(δ ) · card(Q ) + card(δ )), and the overall length of the outputs in the portion of the seed contributed by item 2 is in O(3 · card(δ ) · y ). Both of these are quadratic in the size of T . Altogether, then, the size of the seed is quadratic in the size of the target transducer.
Theorem 3. The OTSL2FIA identifies the OTSL 2 functions in polynomial time and data.

Discussion
The OTSL functions introduced in this paper are capable of modelling many of the attested long-distance phonological processes. These processes can be assimilatory like sibilant harmony in Samala (see Section 1), but can also be dissimilatory. For example, Georgian exhibits a pattern of liquid dissimilation, in which /r/ surfaces as [l] when preceded at any distance by another [r] (e.g., /aprik'-uri/ → [aprik'uli] 'African'; Odden, 1994). Interestingly, the dissimilation does not occur if there is an intervening [l] (e.g., /kartl-uri/ → [kartluri] 'Kartvelian'). The OTSL functions are fully capable of representing such blocking effects, as shown in Figure 2. To avoid cluttering the figure, we omit the final state and all of its incoming transitions (which would be labelled :λ).
It is worth pointing out that the processes in Samala and Georgian apply in opposite directions. In Samala, the trigger is the rightmost sibilant, whereas in Georgian it is the leftmost liquid. This distinction can be captured by assuming that input strings are read from left-to-right in the Georgian case (i.e., the process is progressive), but from right-to-left in the Samala case (i.e., the process is regressive). The direction of reading, then, divides the OTSL functions into two overlapping but distinct classes which we call L-OTSL (which read from the left) and R-OTSL (which read from the right), following Heinz and Lai (2013) and Chandlee et al. (2015) who make the same distinction for the subsequential and OSL functions, respectively.
As mentioned above, the OTSL2FIA outlined in Section 5 only succeeds in learning total functions and is designed specifically to learn OTSL 2 functions. The algorithm exploits the fact that the largest possible 2-tier for an OTSL 2 function f is a superset of every other 2-tier for f , and will accordingly never run the risk of removing an element that would need to be subsequently readded to the tier. However, it is not clear that this strategy will succeed for higher values k, which may be needed to model certain types of patterns. For example, a reviewer raises the complex case of retroflexion harmony targetting /n/ in Sanskrit (also known as nati) as one such pattern. A formal analysis provided by Graf and Mayer (2018) uses a class of stringsets that they call Input-Output Tier-based Strictly Local (IO-TSL). IO-TSL formal languages are like TSL languages except that input symbols are projected onto the tier based on (i) the surrounding context of input symbols and (ii) the symbols that precede it on the tier that has been projected so far. Under this analysis, Sanskrit n-retroflexion requires k = 3 on the projected tier.
Finally, while the OTSL class can model longdistance processes, it can only do so when no more than a single tier is required. That is, a language that simultaneously exhibits patterns of, e.g., sibilant harmony and liquid dissimilation would not be OTSL for any value of k. Further exploration of these issues will allow us to better understand the computational properties of phonological transformations and to establish a boundary of complexity that is both necessary and sufficient for capturing the full range of possible phonological systems.

Conclusion
This paper has provided both a language-theoretic and an automata-theoretic characterization of the OTSL class of functions, which is relevant for modelling long-distance phonological processes as string-to-string transformations. We further demonstrated that by generalizing previous research on OSL functions to the OTSL class, any OTSL k function can be learned once the tier is known. Finally, we introduced an algorithm for efficiently learning any total OTSL 2 function from positive data, even when a relevant tier is not given a priori. To our knowledge, this is the first algorithm to accomplish this for input-output mappings rather than phonotactics. In future research, we aim to extend this result in multiple ways: to partial functions, to any value of k, and to processes requiring multiple tiers.