Equiprobable mappings in weighted constraint grammars

We show that MaxEnt is so rich that it can distinguish between any two different mappings: there always exists a nonnegative weight vector which assigns them different MaxEnt probabilities. Stochastic HG instead does admit equiprobable mappings and we give a complete formal characterization of them.


Introduction
This paper compares two frameworks for probabilistic constraint-based phonology: Stochastic Harmonic Grammar (SHG; Boersma and Pater, 2016) 1 and Maximum Entropy (ME; Goldwater and Johnson, 2003;Hayes and Wilson, 2008). Recent literature has documented a few realistic quantitative patterns which seem to admit a better fit in ME than in SHG (Smith and Pater, 2017;Zuraw and Hayes, 2017;Hayes, 2017). These findings suggest that ME is a richer probabilistic framework than SHG (relative to the same constraint set). But how much richer? Can these anecdotal observations reported in the literature be systematized into a principled formal comparison between SHG and ME probabilistic typologies? This paper is part of a larger project trying to address this question. In particular, this paper compares ME and SHG from the perspective of their equiprobable mappings. That is phonological mappings which are always assigned the same probability and are therefore phonologically equivalent despite being distinguished by the constraint set. 1 Boersma and Pater (2016) actually use the term "noisy HG" instead of "stochastic HG". We prefer "stochastic HG" to stress the complete analogy with Boersma's (1997;1998) earlier framework of stochastic OT. Furthermore, we prefer to use "stochastic" to describe a property of the framework, reserving "noisy" to describe a property of the learning scenario (as opposed to noise-free).
Section 2 motivates this notion of equiprobability within phonological theory. Section 4 then shows that the ME typology is so rich that it admits no equiprobable mappings: for any two mappings distinguished by the constraints, there exists an ME grammar that distinguishes between them, namely assigns them different probabilities. This typological richness is peculiar to ME and does not extend to other implementations of probabilistic constraint-based phonology such as SHG. Indeed, Section 5 shows that the equiprobable SHG mappings are exactly those mappings which are indistinguishable by categorical Harmonic Grammars (HG; Legendre et al., 1990a,b;Smolensky and Legendre, 2006) and thus provides a complete characterization of SHG equiprobability.
These formal results are presented informally. A detailed proof of the ME result is provided in a final appendix. The proof of the SHG result is analogous and it is omitted for reasons of space (see the longer version of this paper available on the authors' websites). Our discussion rests on some earlier results on uniform SHG and ME probability inequalities from Anttila and Magri (2018), recalled in Section 3.
Is the richness of ME relative to SHG typologies an empirical advantage or a case of unmotivated overgeneration? Section 6 provides some preliminary evidence that the latter might be the case, by looking at the case of Finnish stress. We compute SHG equiprobable mappings using the formal characterization obtained in Section 5. We show that a large corpus of Finnish provides preliminary empirical support for these mappings indeed being equiprobable. Finally, we show that ME breaks up these equiprobabilities in a way that is phonologically counterintuitive.

Equiprobability
A typical phonological process applies uniformly to all forms that share some relevant property, but ignores the irrelevant ways in which they differ. For example, in Latin, stress targets heavy syllables, but ignores vowel quality; in English, aspiration targets voiceless stops, but ignores place of articulation; in Finnish, vowel harmony targets [±back], but ignores the number of syllables. This means that words with the same distribution of heavy and light syllables are stressed alike; voiceless stops are aspirated alike; and words of any length harmonize alike. These phonological equivalences are a key property of phonological systems.
Derivational phonology captures these equivalences straightforwardly: phonological rules are allowed to refer to only the shared property that defines a natural class, ignoring everything else. To illustrate, the Finnish vowel harmony rule can be simply written as V → [αback]/V[αback]C 0 . This rule directly encodes the fact that harmony targets [±back] but ignores any other properties such as, say, the number of syllables. Thus, the monosyllabic /maa/ 'country' and the disyllabic /kaava/ 'formula' trigger back harmony on the suffix /-nä/ 'ESSIVE' in exactly the same way. In other words, they are equivalent for vowel harmony.
The situation is prima facie less obvious in constraint-based phonology. A candidate may contain multiple constraint violations, some relevant, some irrelevant, but all simultaneously visible and potentially interacting. Yet, categorical implementations of constraint-based phonology are well known to readily predict these desired phonological equivalences. To illustrate, consider an HG grammar for Finnish vowel harmony based on the constraints in Table 1, from Ringen and Heinämäki (1999). The back harmony mappings /maa-nä/ → [maana] and /kaava-nä/ → [kaavana] can be shown to be HG equivalent: no matter the weighting, no HG grammar succeeds on one but fails on the other.
How should phonological equivalence be extended from the categorical to the probabilistic setting? We submit that equiprobability provides an answer to this question. In fact, let us recall that a probabilistic phonological grammar is a function which assigns to each underlying representation (UR) x a probability distribution P(y | x) over the corresponding set of candidate surface representations (SRs) y. We consider two mappings (x, y) and ( x, y) of the two URs x, x to the two SRs y, y. We say that these two mappings  are (uniformly) equiprobable provided there is no probabilistic grammar in the typology considered which assigns a different probability to those two mappings, namely such that P(y | x) = P( y | x).
To illustrate, the equivalence between the two mappings /maa-nä/ → [maana] and /kaava-nä/ → [kaavana] is captured in a probabilistic setting through the requirement that their probabilities P([maana] | /maa-nä/) and P([kaavana] | /kaava-nä/) always coincide. In other words, the probability of vowel harmony does not depend on the number of syllables. 2 As we will see in Section 5, two mappings are equivalent according to categorical HG if and only if they are equiprobable in SHG. This result suggests that equiprobability is indeed the right extension of the notion of phonological equivalence from the categorical to the probabilistic setting. Surprisingly, we will see in Section 4 that ME instead allows for no equiprobable mappings and thus fails to capture the notion of phonological equivalence.

Formal background
Our characterization of ME and SHG equiprobability in sections 4-5 rests on some results from Anttila and Magri (2018; A&M) recalled here.
HG A weight vector w = (w 1 , . . . , w n ) assigns nonnegative weights w 1 , . . . , w n ≥ 0 to n underlying phonological constraints C 1 , . . . , C n . The phonological quality of a phonological mapping (x, y) of a UR x and a candidate SR y is quantified by its harmony H w (x, y). This quantity is defined as the weighted sum of the constraint vi-2 Note that this is quite different from the well-known case of Hungarian vowel harmony where suffixes show different degrees of back-front variation after stems with both back and neutral vowels depending on the number of neutral vowels; see, e.g., Hayes and Londe (2006), Hayes et al. (2009), andZymet (2015). In our Finnish example, all the stem vowels are unambiguously back, yet our Proposition 1 below says that ME fails to guarantee that the suffix harmony is invariably back. olations multiplied by −1, namely H w (x, y) = − n k=1 w k C k (x, y). Mappings with large harmony have small constraint violations. The HG grammar corresponding to a weight vector w maps a UR x to the candidate SR y such that the mapping (x, y) has a larger harmony than the mapping (x, z) corresponding to any other candidate z of x. In this case, we say that y is the winner while any other candidate z is a loser.
HG thus has an intrinsic comparative nature: absolute numbers of violations are irrelevant, what matters is only the comparison between the violations of the loser and those of the winner. To bring out this intuition, we define the difference vector C(x, y, z) for a UR x, an intended winner candidate y, and an intended loser candidate z as in (1). This vector has a component for each constraint C k defined as the difference between the number C k (x, z) of violations assigned by C k to the loser mapping (x, z) minus the number C k (x, y) of violations assigned to the winner mapping (x, y).
SHG and ME are two probabilistic extensions of this underlying categorical HG model.

SHG
The SHG probability P SHG w (y | x) that a UR x is mapped to a SR y according to the weight vector w is the probability of sampling n numbers = ( 1 , . . . , n ) independently according to a distribution D in such a way that the HG grammar corresponding to the weight vector w + = (w 1 + 1 , . . . , w n + n ) indeed maps x to y. A&M prove the following Lemma 1 about uniform probability inequalities in SHG, namely inequalities which hold for every choice of the weight vector.
Lemma 1 Consider two mappings (x, y) and ( x, y). Assume that the UR x comes with only a finite number m of loser candidates z 1 , . . . , z m (besides the winner candidate y) and that the mapping (x, y) is possible in HG (namely, y beats the losers z 1 , . . . , z m relative to some nonnegative weight vector). The SHG probability inequality P SGH w (y | x) ≤ P SGH w ( y | x) holds uniformly for every choice of the nonnegative weight vector w if and only if for every loser candidate z of the UR x, there exist m nonnegative coefficients λ 1 , . . . , λ m ≥ 0 (one for each loser candidate z 1 , . . . , z m of the UR x) such that namely the difference vector C( x, y, z) is at least as large (constraint by constraint) as the sum of the difference vectors C(x, y, z i ) each rescaled by a corresponding nonnegative coefficient λ i . 3 2 Lemma 1 admits the following geometric interpretation, which will be used below. Suppose there are only n = 2 constraints and m = 4 losers z i . The difference vectors C(x, y, z i ) which appear on the right hand side of (2) can therefore be represented as the four black dots in Fig. 1 is the convex cone generated by these four difference vectors C(x, y, z i ), depicted in dark gray in Fig. 1a. The region in light gray singles out the points which are at least as large as some point in this cone. Condition (2) thus says that the difference vector C( x, y, z) belongs to this light gray region. ME The ME probability P ME w (y | x) that a UR x is mapped to a SR y according to a nonnegative weight vector w is the exponential of the harmony 3 The two assumptions made by the lemma--that the UR x comes with only a finite number of losers and that the mapping (x, y) is possible in HG--are non-restrictive. In fact, if a mapping (x, y) is impossible in HG, then its SHG probability P SGH w (y | x) can be shown to be equal to zero for every choice of the nonnegative weight vector w. The probability inequality P SGH w (y | x) ≤ P SGH w ( y | x) thus holds uniformly, because its left hand side is always equal to zero. The assumption made by the lemma that the mapping (x, y) is possible in HG is therefore non-restrictive. Furthermore, HG has the property that only a finite number of candidates of any given UR win according to some weights (Magri, 2019). All other candidates are redundant because impossible no matter how the weights are chosen. Since HG impossible mappings have zero SHG probability, the candidate set of any underlying form can always be assumed to be finite without loss of generality in SHG. The assumption made by the lemma that the UR x comes with only a finite number of losers is therefore non-restrictive.
H w (x, y) of that mapping, normalized through a constant Z = Z(w, x), namely P ME w (y | x) = e Hw(x,y) /Z. A&M show that also in ME uniform probability inequalities can be characterized in terms of difference vectors, as stated by Lemma 2 below. This ME Lemma is analogous to the SHG Lemma 1 above, but for two differences. The first difference is that condition (2) is only necessary in ME while it is also sufficient in SHG. The second difference is that ME requires the normalization condition (3) on the coefficients λ i .
Lemma 2 Consider two mappings (x, y) and ( x, y). Assume that the UR x comes with a finite number m of loser candidates z 1 , . . . , z m (besides the winner candidate y). If the ME probability inequality P ME w (y | x) ≤ P ME w ( y | x) holds uniformly for every choice of the nonnegative weight vector w, then for every loser candidate z of the UR x, there exist m nonnegative coefficients λ 1 , . . . , λ m ≥ 0 (one for each loser candidate z 1 , . . . , z m of the UR x) which add up to 1 and furthermore satisfy condition (2). 2 The normalization condition (3) admits the following geometric interpretation.
As seen above, the region is the convex cone generated by the difference vectors C(x, y, z i ), represented by the dark gray region in Fig. 1a. The smaller region (3) on the coefficients λ i , is instead the convex hull generated by the difference vectors C(x, y, z i ), represented by the smaller dark gray region in Fig. 1b. The effect of the normalization condition (3) is thus to shrink from the larger convex cone to the smaller convex hull. Finally, the region in light gray in Fig. 1b singles out the points which are at least as large as some point in this convex hull. Lemma 2 thus requires the difference vector C( x, y, z) to belong to this light gray region.

ME has no equiprobable mappings
Lemmas 1 and 2 say that ME differs from SHG because of the normalization condition (3). This apparently small technical difference has substantial phonological implications. Indeed, this Section shows that the normalization condition (3) makes the ME typology so rich that it can distinguish between any two mappings. In other words, equiprobability is impossible in ME. The reasoning is presented here informally, split up into three steps formalized in the final appendix.
Step 1 Let us suppose that the two mappings (x, y) and ( x, y) are equiprobable in ME, namely that the ME probability identity P ME w (y | x) = P ME w ( y | x) holds for every choice of the nonnegative weight vector w. Let z 1 , . . . , z m be the loser candidates of the UR x. They define a light gray region as in Fig. 1b, namely the region of points which are at least as large as the points in the convex hull generated by the difference vectors C(x, y, z i ). Let us denote this light gray region as LGR ME (z 1 , . . . , z m ). Analogously, let z 1 , . . . , z m be the loser candidates of the other UR x. They as well define the light gray region of points which are at least as large as the points in the convex hull generated by the difference vectors C( x, y, z j ). Let us denote this light gray region as LGR ME ( z 1 , . . . , z m ).
The probability identity P ME w (y | x) = P ME w ( y | x) is equivalent to the two reverse inequalities P ME w (y | x) ≤ P ME w ( y | x) and P ME w (y | x) ≥ P ME w ( y | x). By lemma 2 above, the former inequality requires each difference vector C( x, y, z j ) to belong to LGR ME (z 1 , . . . , z m ). And the latter inequality requires each difference vector C(x, y, z i ) to belong to LGR ME ( z 1 , . . . , z m ). A simple convexity argument deduces from these two facts the identity LGR ME (z 1 , . . . , z m ) = LGR ME ( z 1 , . . . , z m ) between the two light gray regions.
Step 2 To proceed, let us suppose for concreteness that m = 4 and that the light gray region LGR ME (z 1 , z 2 , z 3 , z 4 ) is the one plotted in light gray in Fig. 1b. The difference vectors corresponding to the two losers z 1 and z 2 are extreme points (or vertices) of this light gray region. In the sense that they crucially contribute to shape it: if these two points were shifted even slightly in any direction, the corresponding light gray region would change. The identity between the two light gray regions established in step 1 thus entails that the two light gray regions share the same set of extreme points. In conclusion, the two difference vectors corresponding to losers z 1 and z 2 which are extreme points of the light gray region in figure Fig. 1b must be shared by the two equiprobable mappings considered. Since these difference vectors are shared by the two equiprobable mappings, they can be "peel off" the two sides of the Figure 2: Steps 1-2 for the remaining losers z3 and z4. ME probability identity.
Step 3 We are thus left with the difference vectors corresponding to the other two losers z 3 and z 4 in Fig. 1b. These latter two vectors are not extreme points of the original light gray region but rather sit in the interior of the light gray region. Indeed, they can be shifted around without affecting the shape of the light gray region. Yet, once the two losers z 1 and z 2 have been "peeled off" at step 2, we can repeat the reasoning in steps 1 and 2 ignoring the two losers z 1 and z 2 and instead considering only the other two losers z 3 and z 4 .
Thus, we construct the convex hull of the difference vectors corresponding to just these two remaining losers z 3 and z 4 . This convex hull is the segment which connects the two corresponding dots. Next, we construct the light gray region of points which are at least as large as some point in that segment, as depicted in Fig. 2. Now the difference vectors corresponding to the two losers z 3 and z 4 are extreme points of the new light gray region. We can therefore repeat the reasoning in steps 1-2 and conclude that these two difference vectors as well must be shared by the two equiprobable mappings considered. And so on.
The reasoning informally sketched above leads to the following Proposition 1, which is the first main result of this paper. It says that two mappings are equiprobable in ME if and only if they share all difference vectors. This entails in particular that the two mappings must have the same number of loser candidates. In other words, the ME typology is so rich that the only case where ME fails to come up with at least one weight vector which assigns different probabilities to the two mappings (x, y) and ( x, y) is when the two mappings are the same mapping, in the sense that they are indistinguishable by the constraints, as they have the same difference vectors. 4 4 To illustrate, suppose that the constraint set only consists of the two constraints NOVOICEDOBSTRUENT and IDENT(voice). The mappings (x, y) = (/mab/, [map]) and ( x, y) = (/bam/, [pam]) will always have the same ME proba-Proposition 1 Two mappings (x, y) and ( x, y) are equiprobable in ME if and only if the corresponding sets of difference vectors coincide. 2

SHG allows for equiprobable mappings
The preceding Section has shown that ME is so rich that it can distinguish between any two different mappings. Crucially, this typological richness is peculiar to ME, not intrinsic to probabilistic constraint-based phonology. In this section, we illustrate this point with the case of SHG. As in the preceding section, the discussion is kept informal. The formalization rests on the same convex geometric tools used for ME in the final appendix. The details are omitted here for reasons of space (see the longer version of this paper available on the authors' website). Let us consider two mappings (x, y) and ( x, y). Again, let z 1 , . . . , z m be the loser candidates of the UR x. They define a light gray region as in Fig. 1a, namely the region of points which are at least as large as the points in the convex cone generated by the difference vectors C(x, y, z i ). Let us denote this light gray region as LGR SHG (z 1 , . . . , z m ). This region is different from (and larger than) the light gray region LGR ME (z 1 , . . . , z m ) considered above for ME, because the latter ME region is restricted through the normalization condition (3) and therefore defined in terms of convex hulls rather than convex cones. Analogously, let z 1 , . . . , z m be the loser candidates of the other UR x and let LGR SHG ( z 1 , . . . , z m ) be the corresponding SHG light gray region.
Again as in the case of ME, Lemma 1 says that the uniform SHG probability identity P SHG w (y | x) = P SHG w ( y | x) entails that the two SHG light gray regions coincide, namely that LGR SHG (z 1 , . . . , z m ) = LGR SHG ( z 1 , . . . , z m ). Yet, these SHG light gray regions have different geometric properties than the ME light gray regions. As a result, in the case of SHG the identity between the two light gray regions tells us much less about the difference vectors that generate them than in the case of ME.
To see that concretely, let us consider for instance the SHG light gray region in Fig. 1a. The loser candidates z 2 , z 3 and z 4 have difference vectors which sit in the interior of this light gray region. These losers thus contribute nothing to shape bility, because they and their losers have the same constraint violation profiles. the light gray region: their difference vectors can be shifted around without affecting the shape of the region. Identity of the light gray regions thus tells us nothing about identity of these difference vectors which sit in the interior.
Interestingly, the loser candidates whose difference vectors sit in the interior of the SHG light gray region can be characterized phonologically as those losers which are HG redundant given the rest of the losers. In the sense that, for every nonnegative weight vector w, if the HG harmony of the winner y is larger than that of the nonredundant losers, then it is in particular larger than the harmony of the redundant losers. In other words, these redundant losers carry no interesting phonological content as they do not in any way affect the weight vectors consistent with the mapping (x, y).
The case of the loser z 1 in Fig. 1a is instead different. Its difference vector sits on the border of the light gray region and therefore contributes to its shape. Yet, its position is not completely determined by the shape of the region. In fact, the shape of the region is not affected if this difference vector is slid closer to or further away from the origin. Equivalently, the shape of the region is not affected if the difference vector corresponding to the nonredundant loser z 1 is rescaled by a nonnegative constant λ ≥ 0. This means that the identity of the two SHG light gray regions does not entail identity of the difference vectors which generate them, not even for those difference vectors which sit on the boundary of the regions and therefore correspond to nonredundant losers. The identity of the two SHG light gray regions only entails that the difference vectors of the nonredundant losers are one the rescaling of the other. This informal reasoning leads to the following Proposition, which is our second main result.
Proposition 2 Two mappings (x, y) and ( x, y) are equiprobable in SHG if and only if each nonredundant difference vector C(x, y, z i ) is a rescaling of some nonredundant difference vector C( x, y, z j ), namely C(x, y, z i ) = λC( x, y, z j ) for some λ ≥ 0; analogously, each nonredundant difference vector C( x, y, z j ) is a rescaling of some nonredundant difference vector C(x, y, z i ). 2 Interestingly, this characterization of SHG equiprobability coincides with the characterization of equivalence in categorical HG obtained by A&M. We conclude that two mappings are equiprobable in SHG (namely are always assigned  the same probability) if and only if they are equivalent in categorical HG (namely no HG grammar succeeds on one but fails on the other).

Equiprobability in Finnish stress
This section brings the preceding formal results to bear on Finnish word stress.
However, the skipping clause turns out to be a coarse approximation of the actual facts. Skipping is sometimes optional and we find variable stress in cases like pró.fes.so.rìl.la∼pró.fes.sò.ril.la 'professor-ADE', where the basic rule fails at the second variant. This optional pattern turns out to depend on two additional conditions that affect the outcome in a gradient manner (Anttila, 2012): (a) low vowels (/a,ä, o,ö/) attract stress and high vowels (/e, i, u, y/) repel stress; (b) stress is avoided next to a heavy syllable. 5 In addition to native speaker intuitions about syllable prominence, empirical support for these soft conditions can be obtained from the optional rule of Stop Deletion (Keyser and Kiparsky, 1984) which deletes singleton stops in extrametrical syllables (Anttila, 2012). In particular, the /t/ in the partitive suffix /-tA/ is deleted vs. retained (j, (kon.sul Table 4: SHG's two red blocks are split into two chains of uniform inequalities in ME depending on the location of secondary stress feet. Given the input /professori-i-tA/ 'professor-PL-PAR' we have two possible foot structures: (pró.fes.so)(rèi.ta) where /t/ falls inside a foot and is retained vs. (pró.fes)(sò.re)ja where /t/ falls outside a foot and is deleted. The metrical free variation is thus reflected in segmental free variation. This provides a valuable diagnostic for foot structure, especially because the segmental variation is present even in the written standard language readily available in large quantities. The constraints necessary for deriving the foot structure in Finnish nouns are shown in Table 2. These constraints were applied to 48 types of partitive plural nouns, systematically varying stem length, syllable weight, and vowel sonority. All in all, the test set contains 4 types of three-syllable stems, 12 types of 4-syllable stems, and 32 types 5-syllable stems (stem types are briefly denoted as "(a), (b), . . . " in what follows).
SHG We computed the uniform probability inequalities predicted by SHG for this Finnish stress test case using CoGeTo (Magri and Anttila, 2019), a suite of tools for studying constraintbased typologies of categorical and probabilistic phonological grammars based on their underlying rich convex geometry. The key observation is that SHG predicts seven blocks of equiprobable mappings, shown in Table 3. These blocks are furthermore organized into two chains of uniform probability inequalities. The predicted probabilities increase from left to right. The symbol "≤" between two boxes means that the candidates in the box on the left are predicted to have a probability at most as large as the candidates in the box on the right.
To evaluate the empirical accuracy of the equiprobabilities predicted by SHG, we examined Finnish /t/-deletion in a corpus of approximately 9 million nouns (tokens) harvested from Finnish internet sites on April 12, 2005. The percentages reported in Table 3 represent the token frequency of /t/-retention vs. /t/-deletion variants for each phonologically distinct stem type. The corpus data are consistent with the equiprobability prediction in five out of the seven blocks, namely those in black. These blocks turn out to be empirically nearly categorical, with almost all stems undergoing either /t/-deletion or /t/-retention, consistently with the equiprobability prediction. However, the two red blocks in Table 3 bundle together the stem types (c)-(f) despite them showing rather different empirical frequencies, providing prima facie evidence against SHG's equiprobability prediction. The stem types are illustrated by /symposiumi/ 'symposium', /polyamidi/ 'polyamide', /liirumlaarumi/ 'nonsense', and /inkunaabeli/ 'incunable'. The stems differ in the weight and quality of the preantepenultimate and antepenultimate syllables (heavy vs. light, [+low] vs. [−low]), which results in constraint violation differences, yet HG predicts that all four should undergo /t/-deletion/retention at identical rates. In order to reconcile SHG's equiprobability predicitions with corpus frequencies, we make the following observations. First, the difference between types (d) /liirumlaarumi/ and (f) /inkunaabeli/ is not statistically significant (χ 2 = 2.9849, df = 1, p = 0.08404). Second, type (c) contains only two stems: /symposiumi/ 'symposium' and /imperiumi/ 'empire', both potentially syllabifiable as four-syllable stems, e.g., im.pe.ri.u.mi ∼ im.pe.riu.mi (Anttila and Shapiro, 2017), which is consistent with their unexpectedly high /t/-deletion rate. This leaves us with type (e) /polyamidi/ 'polyamide' (N = 69), again with an unexpectedly high deletion rate for which we have no plausible explanation. We conclude that by and large our Finnish corpus data support SHG's equiprobability predictions. ME One might wonder whether ME with its ability to make fine-grained distinctions might actually offer a more principled solution to the difficulties just discussed. This turns out not to be the case. On the retention side, ME predicts the uniform probability inequalities in the top row of Table 4. For example, the retention probability of /polyamidi/ is predicted to be at most as high as that of /liirumlaarumi/, no matter the choice of the weight vector. That seems initially promising: these inequalities are in fact exactly what we observe in the data. Puzzlingly, on the deletion side, ME reverses the probabilities, yielding the uniform probability inequalities in the bottom row of Table 4. For example, the deletion probability of /polyamidi/ is predicted to be at most as high as that of /liirumlaarumi/. This is exactly the opposite of what we observe in the data. We submit there is simply no way to reconcile ME's predictions with the corpus data. Such counterintuitive probability reversals appear in other blocks as well.

Summary and conclusions
We have shown that ME predicts typologies so rich that ME grammars can distinguish between any two different mappings and therefore admit no equiprobable mappings (Proposition 1). This richness does not extend to other implementations of probabilistic constraint-based phonology, such as SHG (Proposition 2), revealing a fundamental difference between the two frameworks.
We have then applied these results to the test case of Finnish word stress. Our corpus data provide preliminary evidence in favor of SHG's equiprobability predictions. In the two blocks where SHG appeared to run into problems, ME did not help refine the analysis empirically, but instead split the SHG equiprobable stem types apart in a counterintuitive fashion. Our study thus provides some preliminary empirical support in favor of SHG, which permits equiprobable mappings, against ME, which does not. probability inequality P ME w ( x, y) ≤ P ME w (x, y) requires the reverse inclusion conv(c 1 , . . . , c m ) + R n + ⊆ conv( c 1 , . . . , c m ) + R n + , yielding (5).
conv(c 1 . . . c m )+R n + P = conv( c 1 . . . c m )+R n + P (5) Step 2. This identity (5) says in particular that the two sets P and P on its left and right hand side have the same set of extreme points, namely ext(P ) = ext( P ). The set ext(P ) of extreme points of the set P is nonempty. In fact, a set which is closed, convex, nonempty, and does not contain a line admits at least an extreme point (Bertsekas, 2009, Proposition 2.1.2). Indeed, P is closed, because conv(c 1 , ..., c m ) is compact, R n + is closed, and the sum of a compact set with a closed set is closed (Bertsekas, 2009, Section 1.3). Furthermore, P is convex, because conv(c 1 , ..., c m ) and R n + are both convex and the sum of two convex sets is convex. Finally, P is obviously nonempty and it does not contain a line.
The set ext(P ) of extreme points of the set P is a subset of the set of difference vectors {c 1 , . . . , c m }.
In fact, the set of extreme points of the finitely generated polyhedron conv(c 1 , . . . , c m ) is a subset of {c 1 , . . . , c m } (by the Krein-Milman theorem). The set of extreme points of the pointed cone R n + only consists of the zero vector 0. And the set ext(A + B) of extreme points of the vector sum A + B of any two polyhedra A and B is a subset of the vector sum ext(A) + ext(B) of the two sets ext(A) and ext(B) of extreme points of A and B, namely ext(A + B) ⊆ ext(A) + ext(B) (Bertsimas and Tsitsiklis, 1997, exercise 2.22). Analogously, the set ext( P ) of extreme points of the set P is a nonempty subset of the set { c 1 , . . . , c m }.
Step 3. The terms on the left and the right hand side of the ME probability identity (4) which correspond to the shared difference vectors in Ω cancel out. The ME probability identity thus reduces to m i=h+1 e w T c i = m j=h+1 e w T c j , where the sums start at h + 1 rather than at 1. The claim follows by iterating the reasoning above, starting from the latter simplified ME probability identity.