Implicational Universals in Stochastic Constraint-Based Phonology

This paper focuses on the most basic implicational universals in phonological theory, called T-orders after Anttila and Andrus (2006). It shows that the T-orders predicted by stochastic (and partial order) Optimality Theory coincide with those predicted by categorical OT. Analogously, the T-orders predicted by stochastic Harmonic Grammar coincide with those predicted by categorical HG. In other words, these stochastic constraint-based frameworks do not tamper with the typological structure induced by the original categorical frameworks.

More recently, phonology has extended its empirical coverage from categorical alternations to patterns of phonologically conditioned variation and gradient phonological (or phonotactic) judgements (see for instance Anttila, 2012 andPater, 2011). This extension of the empirical coverage has required a corresponding extension of the theoretical framework. A phonological grammar cannot be construed anymore as a categorical function from URs to SRs. Instead, it must be construed as a function from URs to probability distributions over the entire set of SRs. Constraintbased implementations of this stochastic theory include partial order OT (Anttila, 1997b), stochastic OT (SOT; Boersma, 1997Boersma, , 1998, and stochastic HG (SHG; Boersma and Pater, 2016) 1 , recalled below in sections 4 and 5. Another framework explored in the recent literature on probabilistic constraint-based phonology is MaxEnt (ME; Goldwater and Johnson, 2003;Hayes and Wilson, 2008). Its T-orders are discussed in a companion paper (Anttila and Magri, 2018).
How can we investigate and understand the typological structure encoded by a probabilistic phonological framework? In the case of a categorical framework such as OT or HG, the predicted typological structure can be investigated directly by exhaustively listing all the grammars predicted for certain constraint and candidate sets. That is possible because the predicted typology of grammars is usually finite. The situation is rather different for probabilistic frameworks: the predicted typology always consists of an infinite number of probability distributions which therefore cannot be exhaustively listed and directly inspected. A more indirect strategy is needed to chart the predicted typological structure.
A natural indirect strategy that gets around the problem raised by an infinite typology is to enumerate, not the individual languages in the typology, but the set of implicational universals predicted by the typology. An implicational universal is an implication P T −→ P which holds of a given typology T whenever every language in the typology that satisfies the antecedent property P also satisfies the consequent property P ( Greenberg, 1963). Since implicational universals take into account every language in the typology, they chart the boundaries and measure the richness of the typological structure predicted by T.
Which antecedent and consequent properties P and P should we focus on? To start from the simplest case, let us consider a typology T of categorical phonological grammars, construed traditionally as mappings from URs to SRs. Within this categorical framework, the simplest, most basic, most atomic antecedent property P is the property of mapping a certain specific UR x to a certain specific SR y. Analogously, the simplest consequent property P is the property of mapping a certain specific UR x to a certain specific SR y. We thus focus on the following class of implications: Definition 1 The implicational universal (x, y) T → ( x, y) holds relative to a categorical typology T provided each grammar in T which succeeds at the antecedent mapping (i.e., it maps the UR x to the SR y), also succeeds at the consequent mapping (i.e., it maps the UR x to the SR y). 2 The relation T → thus defined over mappings is a partial order (under mild additional assumptions). It is called the T-order induced by the typology T (Anttila and Andrus, 2006). For example, any dialect of English that deletes t/d at the end of a coda cluster before a vowel also deletes it before a consonant (Guy, 1991;Kiparsky, 1993;Coetzee, 2004). The implication (/cost.us/, [cos.us]) → (/cost.me/, [cos.me]) thus holds relative to the typology T of English dialects.
Implicational universals can also be statistical. For instance, in dialects of English where t/d deletion applies variably, deletion has been found to be more frequent before consonants than before vowels. To model these frequency effects, we need to consider a typology T of probabilistic phonological grammars, construed as functions from URs to probability distributions over SRs. We propose to extend the notion of T-orders from the categorical to the probabilistic setting as follows: Definition 2 The implicational universal (x, y) T → ( x, y) holds relative to a probabilistic typology T provided each grammar in T assigns a probability to the consequent mapping ( x, y) which is at least as large as the probability it assigns to the antecedent mapping (x, y). 2 To illustrate, the implication (/cost.us/ , [cos.us]) → (/cost.me/ , [cos.me]) also holds relative to the typology T of English dialects with variable deletion because the probability of the consequent (/cost.me/, [cos.me]) (i.e., the frequency of deletion before a consonant) in any dialect is at least as large as the probability of the antecedent (/cost.us/, [cos.us]) (i.e., the frequency of deletion before a vowel).
The original categorical definition 1 of T-orders is a special case of the probabilistic definition 2. In fact, suppose that a categorical grammar succeeds on the antecedent mapping (x, y). That grammar construed probabilistically thus assigns probability 1 to the antecedent mapping. Definition 2 then requires that grammar to also assign probability 1 to the consequent mapping ( x, y). In other words, the grammar construed categorically succeeds on the consequent mapping, as required by the original definition 1 of categorical T-orders.
T-orders are defined at the level of mappings from URs to SRs. They thus allow for crossframework comparisons, even bridging across categorical and probabilistic frameworks. This paper (together with the companion Anttila and Magri 2018) thus uses T-orders to compare the probabilistic implementations of constraint-based phonology with the original categorical implementations.
The main result reported in this paper is that the T-orders predicted by stochastic OT (and by partial order OT) coincide with those predicted by categorical OT, no matter what the candidate and constraint sets look like, as shown in section 4. Analogously, the T-orders predicted by stochastic HG coincide with those predicted by categorical HG, as shown in section 5. In other words, these stochastic frameworks do not tamper with the typological structure induced by the original categorical frameworks, at least when that structure is measured in terms of T-orders. These specific results about OT and HG are derived as a special case of a more general result on stochastic typologies, developed in sections 2 and 3.
As discussed in a companion paper (Anttila and Magri, 2018), the situation is very different for ME. Both ME and stochastic HG can be construed as probabilistic variants of categorical HG. Stochastic and categorical HG share the same Torders. The ME T-orders instead obey a rather different underlying convex geometry and turn out to be much sparser. In other words, ME yields a much richer probabilistic extension of categorical HG than stochastic HG does. Section 6 concludes the paper by discussing these results in the context of the recent literature on probabilistic constraintbased phonology.

Categorical and stochastic phonology
We assume a relation Gen which pairs each UR x with a set Gen(x) of candidate SRs. As recalled above, a categorical phonological grammar G takes a UR x and selects a corresponding SR y = G(x) from the candidate set Gen(x). A stochastic phonological grammar G instead takes a UR x and returns a probability distribution G(·| x) over Gen(x) which assigns a probability G(y| x) to each candidate SR y in Gen(x). This section illustrates a general method to leverage a given typology T of categorical grammars into a typology of stochastic grammars. Sections 4 and 5 will then show that various stochastic frameworks in the recent constraint-based literature (such as partial order OT, stochastic OT, and stochastic HG) all fit within this general scheme.
Following common practice in constraint-based phonology, we assume that the categorical typology T only contains a finite number of grammars. 2 We consider a probability mass function p over T. Thus, p assigns to each categorical grammar G in T a nonnegative probability mass p(G) ≥ 0 and these masses sum up to 1, namely G∈T p(G) = 1. We can then define the stochastic grammar G p corresponding to the probability mass function p as the function which takes a UR x and returns the probability distribution G p (·| x) over the candidate set Gen(x) defined as in (1). It says that the probability G p (y | x) that the UR x is mapped to the SR y is he probability mass allocated by p to the region {G ∈ T | G(x) = y} of the typology T consisting of those categorical grammars which succeed on the mapping (x, y).
We assume next that each categorical grammar in the typology T returns a unique SR y for each UR x. 3 This assumption suffices to ensure that G p is indeed a probability distribution, namely that the sum of the probabilities G p (y | x) over the candidates y in Gen(x) is equal to 1, as shown in (2). (2) In step (2a), we have used the definition (1) of G p (y | x). In step (2b), we have used the fact that every grammar in T maps x to a unique SR y, so that the sets {G ∈ T | G(x) = y} partition the typology T into disjoint sets as y spans the candidate set Gen(x). In step (2c), we have used the fact that p is a probability mass function over T and thus adds up to 1. A family P of probability mass functions p 1 , p 2 , . . . over the finite categorical typology T thus induces a typology {G p 1 , G p 2 , . . .} of stochastic grammars. It is called the stochastic typology corresponding to the categorical typology T and the probability family P, and it is denoted by T P . We denote by T −→ the T-order relative to the categorical typology T in the sense of definition 1 and by T P −→ the T-order relative to the stochastic typology T P in the sense of definition 2. We want to investigate the relationship between these categorical and stochastic T-orders.

Relationship between categorical and stochastic T-orders
Let us suppose that the implication (x, y) T → ( x, y) holds between an antecedent mapping (x, y) and a consequent mapping ( x, y) relative to a categorical typology T. By definition 1, this means that every categorical grammar G in the typology T that maps the antecedent UR x to the antecedent SR y (namely, G(x) = y) also maps the consequent UR x to the consequent SR y (namely, G( x) = y), yielding the inclusion (3).
By (1), this inclusion (3) entails that the probability assigned by G p to the consequent mapping ( x, y) is at least as large as the probability assigned to the antecedent mapping (x, y), as stated in (4). This entailment follows from the sheer fact that probabilities are monotonic relative to set inclusion. The entailment from the inclusion (3) to the inequality (4) thus holds under no assumptions whatsoever on the probability mass function p used to define the stochastic grammar G p .
probability of the antecedent mapping probability of the consequent mapping (4) The latter inequality (4) finally says that the implication (x, y) T P −→ ( x, y) holds also relative to the stochastic typology T P in the sense of definition 2. In conclusion, a categorical T-order always entails the corresponding stochastic T-order, no matter the shape of the family P of probability mass functions used to derive the stochastic typology T P from the categorical typology T.
We now turn to the reverse entailment. Suppose that an implication (x, y) T P −→ ( x, y) holds between an antecedent mapping (x, y) and a consequent mapping ( x, y) relative to the stochastic typology T P . By definition 2, this means in turn that the inequality (4) holds between the probabilities G p (y | x) and G p ( y | x) of the antecedent and the consequent mappings relative to any probability mass function p in the family P. Suppose by contradiction that the corresponding implication (x, y) T −→ ( x, y) relative to the original categorical typology T instead fails. By definition 1, this means that the set inclusion (3) fails because there exists some grammar G 0 with the properties in (5): G 0 succeeds on the antecedent mapping, namely it maps x to y; but G 0 fails on the consequent mapping, namely it maps x to some loser candidate z different from the intended winner candidate y.
We would like to derive a contradiction from the assumption (4) that the stochastic implication (x, y) T P −→ ( x, y) holds and the assumption (5) that fails. Yet, no contradiction arises in the general case. Indeed, suppose that the probability mass functions in the family P all happen to assign zero (or tiny) probability mass to this grammar G 0 which flouts the categorical implication because of (5). This problematic grammar G 0 thus bears no (or only a tiny) effect on the total probabilities G p (y | x) and G p ( y | x) of the two mappings (x, y) and ( x, y). The probability inequality (4) is therefore not necessarily compromised by the offensive behavior (5) of G 0 , as long as the other grammars in the typology comply.
In order to derive a contradiction from these two conditions (4) and (5), we need to make some assumptions on the family P of probability mass functions. Indeed, the problem just discussed arises when every probability mass function p in P assigns zero (or tiny) probability to the problematic grammar G 0 . We need to rule out this scenario. We propose to achieve that through the assumption that the family P satisfies the following Definition 3 The family P of probability mass functions over the finite categorical typology T is sufficiently rich in the sense that for every categorical grammar G in T and for any two URs x and x, the following inequalities hold for some probability mass function p in P. 2 Here is the intuition behind this definition. Suppose that for every categorical grammar G, the family P contains a probability mass function p which assigns all the probability mass to that grammar G. By (1), the corresponding stochastic grammar G p assigns probability 1 to the mappings enforced by G, as stated in (7).
In other words, the stochastic grammar G p "coincides" with the categorical grammar G and the stochastic typology T P thus "contains" or "extends" the original categorical typology T. In this special case, we obviously expect the stochastic implication (x, y) , y), as desired. Condition (6) required by definition 3 is a weaker version of the latter condition (7). First, it is weaker because the requirement G p (G(x)| x) = 1 is replaced with the weaker requirement G p (G(x)| x) > 1/2: the probability assigned to the mappings enforced by G needs not be 1, as long as it is large enough, namely larger than 1/2. Second, this requirement G p (G(x)|x) > 1/2 needs not be satisfied by a unique mass p for all URs: it suffices to look at just two URs at the time.
If the family P is sufficiently rich in the sense of definition 3, the two conditions (4) and (5) are indeed contradictory. In fact, condition (5) now ensures that P contains a probability mass function p 0 such that the corresponding stochastic grammar G p 0 maps x to y with probability larger than 1/2 and it maps x to z with probability larger than 1/2. The latter fact means in turn that G p 0 maps x to y with probability smaller than 1/2, because the probabilities of the various candidates y, z, . . . in Gen( x) must add up to 1. In conclusion, we have obtained G p 0 (y | x) > 1/2 and G p 0 ( y | x) < 1/2, in blatant contradiction of (4).
The preceding reasoning is summarized in the following proposition 1, which says that the Torder relative to a categorical typology T and the T-order relative to the corresponding stochastic typology T P coincide, no matter what the family P of probability mass functions looks like, as long as it is sufficiently rich, in the sense of definition 3. Identity of T-orders holds even when the family P is infinite, so that the stochastic typology T P contains an infinite number of stochastic grammars, while the categorical typology T contains only a finite number of grammars.
Proposition 1 Consider a finite typology T of categorical grammars and a family P of probability mass functions on T. Let T P be the typology of the corresponding stochastic grammars, defined through (1). If P is sufficiently rich in the sense of definition 3, the T-order T −→ relative to the categorical typology T and the T-order T P −→ relative to the stochastic typology T P coincide. 2 In the rest of the paper, we apply this result to various categorical and stochastic frameworks for constraint-based phonology.

Categorial OT, partial order OT, and stochastic OT induce the same T-orders
In this section, we focus on categorical and stochastic OT. We assume a set of n constraints C 1 , . . . , C k , . . . , C n and some candidacy relation Gen. We recall that a constraint C k prefers a mapping (x, y) to another mapping (x, z) provided C k assigns less violations to the former than to the latter, namely C k (x, y) < C k (x, z). A constraint ranking is an arbitrary linear order over the constraint set. A constraint ranking prefers a mapping (x, y) to another mapping (x, z) provided the highest -ranked constraint which distinguishes between the two mappings (x, y) and (x, z) prefers (x, y). The categorical OT grammar corresponding to a ranking maps a UR x to that SR y such that prefers the mapping (x, y) to the mapping (x, z) corresponding to any other candidate z in Gen(x) (Prince and Smolensky, 2004). We denote by OT → the T-order corresponding to the typology T of the categorical OT grammars corresponding to all constraint rankings, in the sense of definition 2.
To illustrate, consider the following three constraints (from Kiparsky, 1993) for the process of t/d deletion mentioned in section 1: C 1 = SYLLABLEWELLFORMEDNESS (SWF) penalizes codas and tautosyllabic consonant clusters; C 2 = ALIGN penalizes resyllabification across word boundaries; and C 3 = MAX penalizes segment deletion. Suppose that the UR /cost us/ comes with the three candidate SRs [cost.us]  OT → (/cost.me/, [cos.me]) holds relative to the OT typology generated by constraints C 1 , C 2 , C 3 in the sense of definition 1: every ranking of the three constraints which succeeds on the antecedent mapping (/cost.us/, [cos.us]) also succeeds on the consequent mapping (/cost.me/, [cos.me]). In other words, t/d deletion before a vowel entails deletion before a consonant.
We now turn to the stochastic counterpart of this categorical framework. A ranking vector θ = (θ 1 , . . . , θ k , . . . , θ n ) ∈ R n assigns a numerical ranking value θ k to each constraint C k . The stochastic ranking vector θ + = (θ 1 + 1 , . . . , θ n + n ) is obtained by adding to the ranking values θ 1 , . . . , θ n some numbers 1 , . . . , n sampled independently from each other according to some distribution D on R. If the distribution D is continuous, the probability that two stochastic ranking values θ h + h and θ k + k coincide is equal to zero. The stochastic ranking vector θ + thus describes the unique ranking θ+ which respects the relative size of the stochastic ranking values: a constraint C h is ranked above a constraint C k according to θ+ (namely, C h θ+ C k ) if and only if the stochastic ranking value of the former is larger than that of the latter (namely, θ h + h > θ k + k ). A ranking vector θ thus induces the probability mass function p D θ defined in (8)  ∼ D such that the OT grammar corresponding to the ranking θ+ is indeed G The typology of stochastic grammars T P obtained as in section 2 from the categorical OT typology T and the family P = {p D θ | θ ∈ R n } of probability mass functions p D θ corresponding to all ranking vectors θ is called stochastic OT (SOT; Boersma, 1997Boersma, , 1998. We denote by SOT −→ the T-orders corresponding to SOT in the sense of definition 2. What is the typological structure encoded by SOT's T-orders? Given that the original OT typology is finite (because there are only a finite number of constraint rankings) while the SOT typology is infinite (it contains an infinite number of grammars which assign different probabilities), how much of OT's typological structure is preserved in SOT? These questions are crucial for phonological theory but technically nontrivial. To illustrate, figure 1 plots the SOT probability of the mappings (/cost.us/, [cos.us]) and (/cost.me/, [cos.me]) relative to the three constraints C 1 , C 2 , C 3 listed above as a function of the ranking value θ 1 of constraint C 1 (horizontal axis) and the ranking value θ 2 of constraint C 2 (vertical axis) for three choices of the rank-ing value θ 3 of constraint C 3 . 4 These plots suggest that the implication (/cost.us/, [cos.us]) SOT → (/cost.me/, [cos.me]) holds in SOT: the probability of the consequent (/cost.me/, [cos.me]) (plotted in the bottom row) seems to be always larger than the probability of the antecedent (/cost.us/, [cos.us]) (plotted in the top row). But how can this conjecture be checked, given that SOT probabilities seem not to admit a closed-form expression?
The result obtained in section 3 provides a straightforward solution to this problem. Suppose that there exists a positive constant ∆ large enough that the distribution D concentrates most of the probability mass on the interval [−∆, +∆], as stated in (9). This assumption holds in particular when D has a bounded support or it is defined through a density (such as a gaussian, as assumed in Boersma, 1997Boersma, , 1998. For any constraint ranking , consider a ranking vector θ such that the top -ranked constraint has the largest ranking value; the second topranked constraint has the second largest ranking value; and so on. Furthermore, assume that these ranking values are spaced apart by more than 2∆. Since the numbers 1 , . . . , n are all bounded between −∆ and +∆ with probability at least 1/2 and since the ranking values are spaced apart by more than 2∆, the constraint ranking θ+ corresponding to the stochastic ranking vector θ + coincides with the original ranking with probability at least 1/2. In other words, the probability mass function p D θ corresponding to this ranking vector θ according to (8) assigns more than half of the probability mass to the OT grammar corresponding to the ranking . The family P = {p D θ | θ ∈ R n } is therefore sufficiently rich in the sense of definition 3. Proposition 1 thus yields the following Corollary 1 Under the mild assumption (9) on the distribution D, the T-order SOT −→ relative to SOT is identical to the T-order OT −→ relative to categorical OT for any constraint and candidate set. 2 In conclusion, despite the SOT typology being infinite, SOT induces the same typological structure as categorical OT, at least when typological structure is measured in terms of T-orders. Furthermore, the technical problem of computing Torders relative to SOT is reduced to the much easier problem of computing T-orders relative to categorical OT, which indeed admits an efficient solution (Magri, 2018a). This result extends to partial order OT (Anttila, 1997a), as the latter is a special case of SOT.

Categorial HG and stochastic HG induce the same T-orders
This section shows that completely analogous considerations hold for HG. A weight vector w = (w 1 , . . . , w k , . . . , w n ) ∈ R n + assigns a nonnegative weight w k ≥ 0 to each constraint C k . The w-harmony of a mapping (x, y) is the weighted sum of the constraint violations multiplied by −1, namely − n k=1 w k C k (x, y). Because of the minus sign, mappings with a large harmony have few constraint violations. The categorical HG grammar corresponding to a weight vector w maps a UR x to the surface form y such that the mapping (x, y) has a larger w-harmony than the mapping (x, z) corresponding to any other candidate z in Gen(x) (Legendre et al., 1990;Smolensky and Legendre, 2006). We denote by HG → the T-order corresponding to the typology T of the categorical HG grammars corresponding to all nonnegative weight vectors, in the sense of definition 2.
To illustrate, it is easy to verify that the implication (/cost.us/, [cos.us]) HG → (/cost.me/, [cos.me]) considered above holds also relative to the HG typology in the sense of definition 1: every weighting of the three constraints which succeeds on the antecedent mapping (/cost.us/, [cos.us]) also succeeds on the consequent mapping (/cost.me/, [cos.me]). In general, the HG typology is a proper superset of the OT typology (when the set of URs is finite). The HG T-order is therefore a subset of the corresponding OT T-order.
We now turn to the stochastic counterpart of this categorical framework. The stochastic weight vector w + = (w 1 + 1 , . . . , w n + n ) is obtained by adding to the weights w 1 , . . . , w n some numbers 1 , . . . , n sampled independently from each other according to some distribution D on R. 5 A weight vector w induces the corresponding probability mass function p D w on the categorical HG typology T defined in (10). Obviously, this definition yields a probability mass, namely the sum of the masses p D w (G) over all the categorical grammars G in the HG typology T is equal to 1. ∼ D such that the HG grammar corresponding to the weight vector w + is G The typology of stochastic grammars T P obtained as in section 2 from the categorical HG typology T and the family P = {p D w | w ∈ R n + } of probability mass functions p D w corresponding to all nonnegative weight vectors w is called stochastic HG (SHG; Boersma and Pater, 2016). We denote by SHG −→ the T-orders corresponding to SHG in the sense of definition 2.
To illustrate, figure 2 plots the SHG probability of the mappings (/cost.us/, [cos.us]) and (/cost.me/, [cos.me]) as a function of the ranking values of the three constraints C 1 , C 2 , C 3 listed above. These plots suggest that the implication (/cost.us/, [cos.us]) SHG → (/cost.me/, [cos.me]) holds in SHG as well: the probability of the consequent (/cost.me/, [cos.me]) (plotted in the bottom row) seems to be always larger than the probability of the antecedent (/cost.us/, [cos.us]) (plotted in the top row). The result obtained in section 3 makes sense of this observation, as follows. We consider two URs x and x. We assume that x comes with a finite number m + 1 of candidates y, z 1 , . . . , z m and that x comes with a finite number m + 1 of candidates y, z 1 , . . . , z m . This assumption is nonrestrictive. In fact, each UR admits only a finite number of optima in HG (Magri, 2018b). Candidate sets can thus be assumed to be finite without loss of generality. We consider a categorical HG grammar in the typology T and assume that it maps x and x to y and y, respectively. This means that any weight vector w = (w 1 , . . . , w n ) corresponding to this HG grammar assigns a larger harmony to the winner mappings (x, y) and x, y) than to any of the loser mappings (x, z i ) and ( x, z j ) respectively, as stated in (11).
Let B be an upper bound on the constraint violation differences, so that |C(x, z i )−C(x, y)| ≤ B and |C( x, z j ) − C( x, y)| ≤ B for every i = 1, . . . , m and j = 1, . . . , m. Suppose again that there exists a positive constant ∆ large enough that the distribution D concentrates most of the proba-bility mass on [−∆, +∆], in the sense that it satisfies the inequality (9). We consider the weight vector λw = (λw 1 , . . . , λw n ) obtained by rescaling the weight vector w by a positive scalar λ > 0 sufficiently large, in the sense of (12).
Whenever ∈ [−∆, +∆] n , the HG grammar corresponding to the stochastic rescaled weight vector λw + maps the UR x to the SR y, as shown in (13). An analogous reasoning shows that it also maps x to y. In step (13a), we have used the definition (11) of ξ. In step (13b), we have lower bounded C(x, z i ) − C(x, y) with −B. In step (13c), we have used the definition (12) of λ.
The intuition behind this reasoning (13) is as follows. The rescaled weight vector λw generates Figure 3: Solid arrows are entailments that hold in OT, HG, SOT, SHG, and ME; dotted arrows are entailments that fail in ME.
the same HG grammar as the weight vector w. If λ is large, the nonzero weights of the rescaled vector λw are very large (in absolute value). On the other hand, the stochastic values 1 , . . . , n are instead small (because bounded between −∆ and +∆) and therefore negligible relative to the rescaled weights. The original weight vector w and the stochastic rescaled vector + λw thus generate the same HG grammar.
In conclusion, the SOT grammar G p D λw corresponding to the probability mass function p D λw (corresponding to the rescaled weight vector λw) satisfies the identities G p D λw (y | x) > 1/2 and G p D λw ( y | x) > 1/2. This shows that the family P = {p D w | w ∈ R n + } of probability masses is sufficiently rich in the sense of definition 3. Proposition 1 thus yields the following: Corollary 2 Under the mild assumption (9) on the distribution D, the T-order SHG −→ relative to SHG is identical to the T-order HG −→ relative to categorical HG for any constraint set and any candidate set which assigns a finite number of candidates to each UR (while the number of URs can be infinite). 2

Conclusions
Phonology has traditionally focused on patterns of categorical alternations modeled within categorical frameworks such as OT and HG. More recently, phonology has extended its empirical coverage to quantitative data such as gradient judgments and patterns of variation. This move has required a parallel extension from categorical to stochastic frameworks, such as partial order OT, stochastic OT, and stochastic HG. These stochastic frameworks are "extensions" of the original categorical frameworks in the sense discussed in section 3. One might thus expect the stochastic frameworks to be typologically less restrictive than the original categorical frameworks. This pa-per has shown that is not the case, at least when typological restrictiveness is measured in terms of the most basic implicational universals, namely Torders. Indeed, the T-orders induced by partial order and stochastic OT coincide with those induced by categorical OT. Analogously, the T-orders induced by stochastic HG coincide with those induced by categorical HG.
As discussed in a companion paper (Anttila and Magri, 2018), the situation is very different in ME. To illustrate, consider the basic syllable system of Prince and Smolensky (2004). The set of forms consists of the four syllable types CV, CVC, V, and VC. Each of them is a candidate of each other. The constraint set consists of the four constraints ON-SET, NOCODA, MAX, and DEP. The HG and OT T-orders coincide and consist of 16 entailments with a feasible antecedent, plotted in figure 3. These entailments extend to SOT and SHG, by virtue of the corollaries 1 and 2 obtained above. ME instead misses the eight dotted entailments. Of the eight entailments which do survive in ME, seven are such that the antecedent and the consequent surface form coincide, plus the entailment (VC, VC) → (CV, CV), which is a quirk due to the fact that VC is the most marked syllable type. This restriction to entailments whose antecedent and consequent surface forms coincide is not phonologically plausible. Anttila and Magri (2018) conclude that the ME formalism imposes typological restrictions at odds with phonological intuition.