How to Choose Successful Losers in Error-Driven Phonotactic Learning

An error-driven phonotactic learner is trained on a stream of licit phonological forms. Each piece of training data counts as a winner in terms of Optimality Theory. In order to test its current grammar, the learner needs to compare the current winner with a properly chosen loser . This paper advocates a new subroutine for the choice of the loser, based on the idea of minimizing the “distance” from the given winner.

1 Error-driven phonotactic learning and the problem of the choice of the loser Phonotactics is knowledge of the distinction between licit and illicit phonological forms (Chomsky and Halle, 1965). We adopt the model of phonotactics developed within Optimality Theory (Prince and Smolensky, 2004), briefly reviewed here. A candidate set is a collection of pairs (x, y); the first element x is called the underlying form; the second element y is called the candidate surface form. A constraint assigns to each candidate pair (x, y) a non-negative number of violations, which measures how the mapping of the underlying form x to the surface form y deviates from the ideal relative to the specific perspective valued by that constraint. Constraints come in two types. Faithfulness constraints punish a candidate pair (x, y) based on the discrepancy between the underlying form x and the surface form y. For instance, the faithfulness constraint IDENT[VOICE] is violated by a candidate pair of segments differing in voicing. Markedness constraints punish a candidate pair (x, y) based on the ill-formedness of the surface form y. For instance, the markedness constraint NOVOICEOBS is violated by a candidate pair of segments whose surface segment is a voiced obstruent. A constraint ranking is a linear order over the constraint set. The grammar G corresponding to the ranking takes an underlying form x and returns a candidate (x, y) which -beats any other candidate (x, z) whose underlying form is x and whose surface form z is different from y, in the sense of condition (1). The candidate (x, y) is then called the winner while (x, z) is called a loser. Losers are stricken out as a mnemonic.
(1) There exists a constraint which is winnerpreferring (i.e., assigns less violations to the winner candidate (x, y) than to the loser candidate (x, z)) which is -ranked above every constraint which is loser-preferring (i.e., assigns more violations to the winner candidate (x, y) than to the loser candidate (x, z)).
A surface form y is phonotactically licit according to the grammar G provided there exists some underlying form x which is mapped to y by that grammar. An error-driven phonotactic learner is trained on a sequence of phonological forms all phonotactically licit according to a target grammar and it tries to infer that grammar as follows. It maintains a current hypothesis of the target grammar, which is initialized to a most restrictive grammar, namely one which deems illicit as many forms as possible. The current grammar is then slightly updated in the direction of a looser phonotactics, whenever it incorrectly predicts the current piece of data to be il-licit. This learning scheme is formalized in OT as the error-driven ranking algorithm (EDRA) outlined in the following pseudo-code and detailed below (Tesar and Smolensky, 1998;Boersma, 1998 If the winner (x, y) does not beat the loser (x, z) according to the ranking vector θ: 7: update the current ranking vector θ 8: until no more errors are made at line 6 The EDRA knows the underlying constraint set C 1 , . . . , C n . The current constraint ranking is represented by assigning to each constraint C k a numerical ranking value θ k , with the understanding that high ranking values correspond to high ranked constraints. The ranking values are collected into a ranking vector θ = (θ 1 . . . , θ n ). At line 1, the ranking values of the faithfulness and the markedness constraints are initialized to 0 and to a large positive constant θ > 0, respectively. Thus, the markedness constraints start out above the faithfulness ones, yielding a grammar which is phonotactically maximally restrictive (Smolensky, 1996).
At line 3, the EDRA is fed a piece of training data, consisting of a surface form y licit according to the target constraint ranking. No assumptions are made on the sequence of training data (e.g., no assumptions are made on the frequency with which various licit forms are fed to the learner). At line 4, the EDRA needs to reconstruct an underlying form x corresponding to the current winner surface form y. A common choice is to set x identical to y, under the assumption that the underlying OT typology is idempotent, namely it maps every phonotactically licit form faithfully into itself (Magri, 2015b). The proper definition of the subroutine for the choice of the loser form z at line 5 is the topic of this paper and will thus be discussed in detail below.
At line 6, the EDRA checks whether the current ranking vector θ satisfies condition (2), where W and L are the sets of winner-and loser-preferring constraints relative to the intended winner and loser candidates (x, y) and (x, z).
If condition (2) holds, any ranking which respects the current ranking vector θ (in the sense that C h C k whenever θ h > θ k ) satisfies the OT condition (1), namely succeeds at making the intended winner (x, y) beat the intended loser (x, z) (Boersma, 2009). In this case, the learner has nothing to learn from the comparison between the current winner and loser forms.
Failure of condition (2) instead suggests that the current ranking values of the loser-preferring (winner-preferring) constraints are too large (too small, respectively) and thus need to be updated at line 7. We assume the re-ranking rule (3) (Tesar and Smolensky, 1998;Boersma, 1998;Magri, 2012).
(3) a. Increase the ranking values of the w winner-preferring constraints by 1 w+1 ; b. decrease the ranking values of the undominated loser-preferring constraints by 1. Each winner-preferring constraint is promoted by 1 w+1 , where w is the total number of winnerpreferring constraints. The loser-preferring constraints are demoted by 1. Only those loserpreferring constraints that really need to be demoted are indeed demoted, namely those which are not currently ranked underneath a winner-preferring constraint and are therefore called undominated.
The only implementation detail which has been left open in this outline of the EDRA model is the proper definition of subroutine for the choice of the loser at line 5. This is the topic of this paper.
2 Two test cases to evaluate subroutines for the choice of the current loser The EDRA model is guaranteed to converge (under the assumption that the target grammar is idempotent): after a finite (small) number of iterations, it is not possible to sample from the set of target licit forms any surface form y which would force the learner to make an update in the if-loop at lines 6-8. Convergence holds irrespectively of the subroutine for the choice of the current loser used at line 5. Suppose now that this subroutine satisfies the basic condition (4). This condition says that the EDRA never wastes data (Tesar and Smolensky, 1998): if there is an opportunity to learn something from the current winner (i.e., if there exists at least a loser which is able to trigger an update), the EDRA will not "waste" that opportunity (i.e., the chosen loser indeed triggers an update).
(4) The subroutine for the choice of the loser returns a loser which triggers an update at line 7, whenever such a loser exists. This condition (4) ensures that, if a surface form y is licit according to the target grammar the EDRA has been trained on, then it is also licit according to any ranking which respects the final ranking vector θ fin entertained by the EDRA at convergence (in the sense that C h C k whenever θ fin h > θ fin k ). In other words, the EDRA succeeds at half of the learning problem: it has learned to recognize licit forms as such. The ranking learned by the EDRA could nonetheless deem licit too many forms. In other words, it could describe a phonotactics which, although consistent with the target one, is not sufficiently restrictive. Are there guarantees that the EDRA also learns to recognize illicit forms as such?
Consider a phonotactic pattern which has the following property: there exists a subset of the markedness constraints which punish exactly all and only the illicit forms. This phonotactic pattern can thus be analyzed in terms of a constraint ranking such as (5): the designated subset of markedness constraints hold sway at the top while the remaining markedness constraints are silent at the bottom. The relative ranking of the faithfulness constraints sandwiched in between is irrelevant. We therefore refer to these phonotactic patterns as F-irrelevant. For instance, suppose that the constraint set contains a markedness constraint against voiced velar obstruents and a markedness constraint against dorsal fricatives. The velar inventory [g k G x], which only admits the voiceless velar stop (illicit segments are stricken out), follows by just letting those two markedness constraints hold sway at the top. The EDRA model described in section 1 has been shown to be restrictive when the target phonotactics is F-irrelevant (Magri, 2013a;Magri, 2014a;Magri, 2015c;Magri, 2015a). In other words, it succeeds at learning the target phonotactics. This success holds irrespectively of the details of the phonological analysis (e.g., the content of the markedness and faithfulness constraints or any other properties of the target ranking, besides it being F-irrelevant). It also holds irrespectively of how the current loser is chosen at line 5 of the pseudo-code-as long as condition (4) is respected. In order to tackle the issue of the proper definition of the subroutine for the choice of the loser at line 5, we thus need to look at the behavior of the EDRA model on phonotactic patterns which are not F-irrelevant. Let's briefly recall two two examples of such phonotactic patterns which are not F-irrelevant (Magri and Kager, 2015).
Voicing is especially effortful at the velar place: due to the small oral volume behind the velar constriction, the supra-glottal pressure quickly equalizes the sub-glottal pressure, hindering vocal cords vibration (Ohala, 1983). Many attested velar inventories comply with phonetic markedness, namely have voiceless stops or fricatives without the voiced ones. Yet, UPSID (Maddieson, 1984) documents two inventories [g k G x] and [g k G x] which are phonetically counterintuitive, as they admit the voiced fricative at the exclusion of the voiceless one. If we could posit a markedness constraint which punishes [x] at the exclusion of [G], these inventories could be generated by letting that markedness constraint hold sway at the top of the ranking (5). Yet, such a markedness constraint would be incompatible with the grounding hypothesis (Hayes and Steriade, 2004): [x] is not any worse than [G] from any phonetic perspective. Fortunately, the desired inventory can be generated in compliance with the grounding hypothesis whenever /G/ is harder to neutralize than /x/, so that the former surfaces at the exclusion of the latter. This neutralization pattern requires some faithfulness constraints (which preserve /G/ from neutralizing) to be ranked above some other faithfulness constraints (those violated by the neutralization of /x/). In conclusion, the grounding hypothesis forces us to posit a crucial relative ranking among the faithfulness constraints. The learnability guarantees recalled above for F-irrelevant target rankings (5) thus do not apply in these cases.
Let's look closer at the inventory [g k G x], which lacks the voiced stop and the voiceless fricative. We analyze this inventory as follows: /x/ can be neutralized to [k] preserving voicing, while /G/ cannot be neutralized preserving voicing, because [g] is independently ruled out by a dedicated constraint. This intuition can be cashed out as follows. We assume that only velar obstruents are candidates of the velar obstruents, as stated in (6a). The ranking (6b) then yields the target inventory [g k G x].
The inventory [g k G x] only lacks the voiceless fricative and thus differs from the inventory considered above only because the voiced velar stop [g] is now licit. We analyze this inventory as follows: /x/ can be neutralized to [h] preserving voicing, while /G/ cannot be neutralized while preserving voicing, because [H] is independently ruled out by a dedicated constraint. This intuition can be cashed out as follows. We assume place impermeability apart from the velar/glottal border: only the velar and glottal obstruents are candidates of the velar and glottal obstruents, as stated in (7a). The ranking (7b) then generates the target inventory [g k G x].
From the perspective of the phonological analysis, the assumption (6a) of velar place impermeability and the assumption (7a) of place impermeability apart from the velar/glottal border are not restrictive: they can be reinterpreted as the assumption that the faithfulness constraints for place features are high ranked. Yet, this interpretation introduces additional relative ranking conditions among the faithfulness constraints which would need to be carefully con-sidered in the learnability analyses. To start from the simplest case, the learnability analyses developed in the rest of the paper explicitly adopt the restrictive assumptions (6a) and (7a) on candidacy.
The only detail in the description of the EDRA model which has been left open in section 1 concerns the proper definition of subroutine for the choice of the loser at line 5. The rest of this paper tackles this issue using the test cases of the two velar inventories just discussed. . According to the classical subroutine for the choice of the loser, the learner chooses as the current loser the candidate predicted to win by the current ranking values-more precisely, by an arbitrary ranking consistent with the current ranking values (Tesar and Smolensky, 1998;Magri, 2013b). This classical subroutine for the choice of the loser satisfies condition (4), namely it never wastes data: if there exists at least a loser which is able to trigger an update, the chosen loser can be shown to indeed trigger an update. Unfortunately, the classical subroutine for the choice of the loser leads to trouble when the EDRA model is trained on the inventory [g k G x]. Here is why. The comparison between the winner mapping (/G/, [G]) and the three loser mappings (/G/, [g]), (/G/, [k]), and (/G/, [x]) sorts the constraints into winner-/loser-preferring as represented in (8) with Elementary Ranking Conditions (Prince, 2002).
The loser [g] will (almost) never be chosen, because the constraint NOVOICEDSTOP is winnerpreferring in the corresponding first ERC in (8), starts ranked at the top, and is never demoted (because never loser-preferring). The choice of the current loser thus effectively boils down to [k] and [x]. The markedness constraints start out above the faithfulness constraints. That ranking configuration is preserved for a large initial portion of the run. Throughout that portion, the choice of the current loser is thus completely determined by the markedness constraints. Since [k] is unmarked, the classical subroutine for the choice of the loser always chooses [k]. Unfortunately, the corresponding second ERC in (8)  We thus replace the classical subroutine with the new subroutine described below in pseudo-code. Here, we consider an arbitrary underlying form x (while the EDRA model always chooses x equal to the winner form y). For a related proposal, see (Riggle, 2004). Three remarks are in order. First, Require: a current winner (x, y) candidate: 1: construct the ERC matrix corresponding to the comparisons (x, y) ∼ (x, z) for all possible loser candidates z for the underlying form x 2: split any ERC with multiple L's into multiple ERCs with a single L; 3: determine the smallest numbern such that there exists an ERC withn winner-preferrers which is inconsistent with the current ranking values; 4: pick at random among the inconsistent ERCs withn winner-preferrers.
the new subroutine satisfies condition (4): if there exists a loser which is able to prompt an update, the new subroutine will return one such loser, as it only searches among losers whose corresponding ERC is inconsistent with the current ranking values. Second, the new subroutine chooses among such losers the one(s) which minimizes the "differ-ence" between the winner and the loser, as measured in terms of the number of winner-preferring constraints which distinguish between them. Third, the new subroutine is computationally less expensive than the classical one, because it circumvents the computation of the predicted optimal candidate. When the new subroutine for the choice of the loser is deployed on the surface form y = [G] corresponding to the ERC block (8), it prevents the EDRA from choosing the loser [g], because the corresponding first ERC is already consistent with the current ranking values (through the high ranked winner-preferring markedness constraint NOVOISTOP). And it also prevents it from choosing the loser [k], because the corresponding second ERC has "too many" W's. The EDRA is thus biased towards choosing the loser [x]. The corresponding third ERC promotes IDENT [VOICE] but not IDENT[CONT], leading to the target ranking (6b). We thus obtain the following Theorem 1 When trained on an arbitrary sequence of data sampled from the velar inventory [g k G x], the EDRA model with the new subroutine for the choice of the loser succeeds at learning the target ranking (6b).

How to analyze the new subroutine
The preceding section has motivated a new subroutine for the choice of the current loser. This section highlights some formal properties of this new subroutine which turn out useful in the analysis of EDRA's restrictiveness. For concreteness, we focus on the velar inventory [g k G x], with the analysis (7) (9).
Because of the new subroutine for the choice of the loser, the original ERC matrix (9) can effectively be simplified block-by-block as in (10) (8). Let S be the first time when the current ranking vector entertained by the EDRA becomes consistent with either ERC 1 or ERC 6 in (10)-convergence ensures that such a time S exists. Because of the new subroutine for the choice of the current loser, ERC 2 cannot trigger any update before time S, because ERC 2 belongs to the same block as ERC 1, because ERC 2 has more W's than ERC 1, and because the current ranking vector is never consistent with ERC 1 before time S. Analogously, ERCs 3, 4, 5, and 7 cannot trigger any update before time S. In other words, the run up to time S is determined by ERCs 1, 6, 8 and 9 alone. Consider next ERC 7. Since it has more W's than the other ERCs 3-6 in the same block, it cannot trigger any update until the current ranking values have become consistent with the other ERCs 3-6. In other words, if ERC 7 triggers updates at all in the run considered, it will start triggering updates only late into the run. Thus, let the time T be defined as follows. If ERC 7 triggers at least an update in the run considered, T is the smallest time such that ERC 7 triggers an update between times T and T + 1; if ERC 7 triggers no updates in the run considered, T is the final time of the run. Of course, S ≤ T (before time S, the current ranking values are still inconsistent with ERC 6 and the EDRA is thus forbidden to consider ERC7, which has more W's). In conclusion, a generic run of the EDRA model on the current test case can be split into three stages, as in (11) This reasoning illustrates another formal property of the new subroutine for the choice of the loser: that it makes the various ERCs enter the scene in stages, ordered by their complexity, namely by the number of winner-preferring constraints they re-rank. This means in turn that the analysis of a generic run can be split into different stages, with an increasing number of ERCs active at each stage. This turns out to be very useful for the analysis of EDRA's restrictiveness. In fact, establishing restrictiveness requires a characterization of the final rank-ing vector entertained by the EDRA at convergence. To obtain that characterization, we start from the initial stage and work towards the end. For each stage, we characterize the ranking vectors the learner can end up with at the end of that stage. Obviously, we have to do that for each ranking vector the learner could end up with at the end of the preceding stage. This logics is illustrated in (12). Suppose that the analysis of the first stage (in between the beginning of the run and time S) concludes that the EDRA can end up with one of two ranking vectors θ S 1 and θ S 2 at the time S when that stage ends. The analysis at the second stage (in between times S and T ) will then have to be repeated twice, for each of the two ranking vectors viable at time S. And so on. These considerations suggest that we aim for particularly tight analyses of the ranking vectors entertained at the end of the initial stages, in order to avoid a combinatorial explosion of the analyses required at later stages. Of course, tight analyses are readily possible when only a few training ERCs can trigger updates and thus mold the current ranking vector. As we increase the number of training ERCs which trigger updates, the analysis becomes more involved, and the characterization of the stagefinal ranking vectors becomes looser. As illustrated in (11), the new subroutine for the choice of the loser thus comes very handy for the analysis of restrictiveness, as it ensures that the EDRA is trained on few ERCs at the beginning of the run, with additional ERCs entering the scene only at later stages.
The final appendix makes these considerations concrete through a detailed analysis of the behavior of the EDRA model with the new subroutine for the choice of the loser trained on the ERC matrix (10) corresponding to the inventory [g k G x P H h]. The resulting analysis establishes the following result.
Theorem 2 When trained on an arbitrary sequence of data sampled from the glottal/velar inventory [g k G x P H h], the EDRA model with the new subroutine for the choice of the loser succeeds at learning the ranking (7b).

Conclusion
This paper has motivated a new subroutine for the choice of the current loser in phonotactic errordriven learning. Informally, the new subroutine chooses a loser which is as similar as possible to the intended winner, while being able to trigger an update. Similarity in measured in terms of the number of winner-preferring constraints in the corresponding ERC. Crucially, this new subroutine allows the various training ERCs to become active in stages, ordered by their complexity, measured in terms of the number of winner-preferring constraints. This allows for careful restrictiveness guarantees, such as the one provided by theorem 2. The proof of the theorem illustrates a number of techniques for the restrictiveness analysis of the EDRA model with the new subroutine for the choice of the loser. Proof. The proof is by induction on time t. The inequality (13) trivially holds at the initial time t = 0, because of the choice of the initial ranking values θ t=0 *[h] = θ > 0 and θ t=0 ID[DOR] = 0. Assume that the inequality holds at time t and let me show that it then holds at time t + 1 as well. If the update between times t and t + 1 has been triggered by the ERCs 1 through 6, then the inequality holds at time t + 1 because it held at time t and the two constraints *[h] and IDENT [DOR] have not been re-ranked between times t and t + 1. If the update between times t and t + 1 has been triggered by the ERC 7, then the inequality holds at time t + 1 because it held at time t and both constraints *[h] and IDENT [DOR] have been promoted by the same amount in between times t and t + 1. Finally, if the update between times t and t + 1 has been triggered by the ERCs 8 or 9, then the inequality holds at time t + 1 because of the following chain of inequalities.
At step (14a), I have used the fact that the update by ERCs 8 or 9 in between times t and t+1 has demoted the constraint *[h] by 1, according to the re-ranking rule (3b). At step (14b), I have used the fact that, in order for ERCs 8 or 9 to have been able to trigger an update in between times t and t + 1, the current ranking value θ t *[h] of the loser-preferring constraint *[h] must have been larger than or equal to the ranking value θ t ID[DOR] of the winner-preferring constraint IDENT [DOR]. Finally at step (14c), I have used the fact that the update by ERCs 8 or 9 in between times t and t + 1 has promoted the constraint IDENT[DOR] by 1/4, as that ERC has w = 3 winner-preferring constraints and the re-ranking rule (3a) set the promotion amount equal to 1 w+1 . If ERCs 8 and 9 were to trigger lots of updates, IDENT[DOR] would be promoted a lot and *[h] would be demoted a lot. In the end, *[h] would thus find itself underneath IDENT[DOR] separated by a large distance. But lemma 1 says that is impossible. Hence, ERCs 8 and 9 can never trigger too many updates, as stated by lemma 2.
Lemma 2 The numbers α t 8 and α t 9 of updates triggered by ERCs 8 and 9 up to an arbitrary time t in an arbitrary run can be bound as follows: where θ is the initial ranking value of the markedness constraints.

Proof. The ranking values θ t *[h] and θ t ID[DOR] of the constraints *[h] and IDENT[DOR]
at an arbitrary time t can be expressed as follows in terms of the numbers of updates α t 7 , α t 8 , α t 9 triggered by the ERCs 7, 8, and 9 up to time t.

A.2 Analysis up to time T
Recall from subsection 4 that time T is the smallest time such that ERC 7 triggers an update between times T and T + 1 (or the time when the run ends, in case ERC 7 triggers no updates). Constraint IDENT[DOR] is only promoted by ERCs 8 and 9 up to time T (ERC 7 triggers no updates before time T ). Since these two ERCs cannot trigger too many updates by lemma 2, IDENT[DOR] cannot raise too high up to time T , as stated by the following lemma. Proof. The faithfulness constraint IDENT[DOR] is only promoted by ERCs 8 and 9 up to time T (ERC 7 triggers no updates before time T ). The ranking value of IDENT[DOR] at an arbitrary time t ≤ T can then be expressed as follows in terms of the numbers α t 8 , α t 9 of updates triggered by ERCs 8 and 9 up to time t.
The following lemma says that the markedness constraint NODORFRIC cannot have dropped too much before time T . This follows from the fact that only ERCs 2 and 5 demote NODORFRIC up to time T (ERC 7 has not triggered any update yet). In order for NODORFRIC to have been demoted a long way, these two ERCs must have triggered many updates. Yet, the faithfulness constraint IDENT [CON] is winner-preferring in both ERCs and is thus promoted by each update they trigger. These ERCs thus cannot trigger too many updates, because they cannot demote NODORFRIC a long way underneath IDENT [CON].
Lemma 4 The ranking value of the markedness constraint NODORFRIC satisfies Proof. Suppose by contradiction that the claim is false. This means that there exists some time t < T such that the markedness constraint NODORFRIC is demoted in between times t − 1 and t and its ranking value at time t is smaller than or equal to the forbidden threshold θ/5 + 1/4. Since constraints are demoted by 1, its ranking value at the time t − 1 preceding the update must have been already smaller than or equal to θ/5 + 1/4 + 1, as stated in (20a). Only ERCs 3 and 5 can have triggered this demotion (ERC 7 triggers no updates before time T ). Crucially, the constraint IDENT[CONT] is winnerpreferring relative to both ERCs 3 and 5. In order for either ERC 3 or 5 to have been able to demote NODORFRIC in between times t − 1 and t, the ranking value of the winner-preferring constraint IDENT[CONT] at time t − 1 must thus have been smaller than or equal to the ranking value of the loser-preferring constraint NODORFRIC at time t − 1, as stated in (20b). The rest of the proof derives a contradiction from these two inequalities (20).
The ranking value of the markedness constraint NODORFRIC at time t − 1 can be lower bounded as θ t−1 NODORFRIC ≥ θ − α t−1 3 − α t−1 5 , by only considering the contribution of the ERCs 3 and 5 which demote it, while ignoring the contribution of the ERCs 2 and 9 which promote it. Plugging this bound into (20a) yields the following bound on the number of updates triggered by ERCs 3 and 5 up to time t − 1.
The ranking value of the faithfulness constraint IDENT[CON] at time t − 1 can be lower bounded as θ t−1 ID[CON] ≥ 1 3 α t−1 3 + 1 3 α t−1 5 , by only considering the contribution of ERCs 3 and 5. Using the bound (21) on the number of updates triggered by ERCs 3 an 5, we obtain the following lower bound on the ranking value of IDENT [CON].
The inequalities (20a), (20b), and (22) are contradictory (provided θ is large), because they require the ranking value θ t−1 NODORFRIC to be smaller than Recall that time S is the first time when the current ranking vector becomes consistent with either ERC 1 or ERC 6. Because of the new subroutine for the choice of the loser, the run up to time S is determined by ERCs 1, 6, 8, and 9, as noted above. Since the faithfulness constraint IDENT[VOICE] is the only winner-preferring constraint in both ERCs 1 and 6, it must raise a long way in order for the current ranking vector to become consistent with either ERC 1 or ERC 6 at time S, as stated by lemma 5. Lemma 5 The ranking value of the faithfulness constraint IDENT[VOICE] satisfies the following inequality at time S: Proof. For concreteness, suppose it is ERC 1 which becomes consistent with the current ranking vector at time S (the reasoning is identical if it is ERC 6 instead). This means that the ranking value of the winner-preferring constraint IDENT[VOICE] is larger than the ranking value of the loser-preferring constraint NOVOICEDSTOP at time S, as stated by the following inequality.
At step (27a), I have used the expression (25a) of the ranking value of IDENT [VOICE]. At step 27b), I have used the bound (26) on α S 1 . At step (27c), I have lower bounded by getting rid of the contribution of α S 6 , which is crucially multiplied by a positive coefficient.

A.4 Analysis after time S
The faithfulness constraint IDENT[DOR] is only promoted by ERCs 7, 8, and 9. The latter two ERCs 8 and 9 promote IDENT[DOR] and not IDENT [VOICE]. Yet, they can only trigger few updates by lemma 2, and thus cannot give a substantial advantage to the former constraint over the latter. Furthermore, ERC 7 promotes both IDENT [DOR] and IDENT [VOI], and thus does not give the former any advantage over the latter. The following lemma thus concludes that IDENT[DOR] will never be able to surpass IDENT [VOICE], which already sits high at time S by lemma 5.

Lemma 6
The ranking values of the faithfulness constraints IDENT[VOICE] and IDENT[DOR] satisfy the following inequality at any time time t ≥ S: Suppose by contradiction that (28) fails at some time t ≥ S, as stated in (29).
− 2 From now on, let α S,t i denote the number of updates triggered by the ith ERC in between times S and t. Thus, α t i = α S i + α S,t i . The ranking value of the faithfulness constraint IDENT[DOR] at time t can be expressed as in (30). At step (30a), I have used the fact that this constraint is promoted only by ERCs 7, 8, and 9. At step (30b), I have used the fact that ERC 7 triggers no updates before time T and thus also no updates before time S (because S ≤ T ), so that α S 7 = 0 and thus α t 7 = α S,t 7 .

ID[DOR]
(a) The ranking of the faithfulness constraint IDENT[VOICE] at time t can be expressed as in (31). At step (31a), I have expressed the ranking value at time t ≥ S as the ranking value at time S plus the increment in the ranking value due to the promotions between times S and t. At step (31b), I have lowered bounded the ranking value of IDENT[VOICE] at time S using (23).

A.5 An auxiliary result
The next step in the analysis (namely, the proof of lemma 7 below) rests on theorem 3 (Magri, 2014b).
Theorem 3 Consider an arbitrary run of the EDRA with the re-ranking rule (3). Assume that each training ERC has a unique L. Focus on a specific training ERC, say the ıth one. Let C be its unique loserpreferring constraint and let C h be one of its (possibly many) winner-preferring constraints, as in (32).
Define the coefficient δ i as follows: (33) . . e/W/L. . . W . . . The number of updates α ı triggered by the ıth input ERC is either null or else bounded as follows where θ init and θ init h are the initial raking values of the two constraints C and C h ; α i is the number of updates triggered by the ith training ERC; w i is the number of its winner-preferring constraints; the sum in (34c) runs over all training ERCs.
Here is the intuitive idea. Suppose that the initial ranking value θ init of the loser-preferrer C is larger than the initial ranking value θ init h of the winnerpreferrer C h . A certain number of updates by the ıth ERC are thus justified just in order to compensate for this bad choice of the initial ranking values, as quantified by the term (34a). At that point, the two constraints could in principle have exactly the same ranking values. An additional update is thus justified in order to bring the winner-preferring constraint C h above the loser-preferrer C , yielding the term (34b). Further updates by this ıth ERC are only justified if this ranking configuration C h C is disrupted by updates triggered by some other training ERCs, as quantified by the term (34c). This term sums the number of updates α i triggered by the generic ith training ERC multiplied by the "amount of disruption" δ i caused by that ERC to the ranking configuration C h C . For instance, suppose that the ith training ERC looks like the top ERC listed in (33). The amount δ i of disruption caused by that ERC is δ i = 1 w i +1 , because that ERC disrupts the ranking configuration C h C by promoting C by 1 w i +1 . A.6 Analysis after time T Lemma 7 says that IDENT[CONT] is always ranked above IDENT[DOR] after time T , with a sufficient distance in between the two faithfulness constraints (at least 2). The proof of this lemma is more involved than the proof of the preceding lemmas. The difficulty is due to the fact that ERC 2 promotes IDENT[CONT] at the exclusion of IDENT[DOR] while ERC 7 promotes IDENT[DOR] at the exclusion of IDENT [CONT]. In order to compare the ranking values of these two faithfulness constraints, we thus need some connection between the numbers of updates α t 2 and α t 7 triggered by the two ERCs 2 and 7. What allows this connection to be established is the fact that the constraint NODORFRIC is winnerpreferring in ERC 2 but loser-preferring in ERC 7. Since ERC 2 thus promotes the constraint NODOR-FRIC which ERC 7 tries to demote, updates by ERC 2 "buy" extra updates by ERC 7. If ERC 2 happens to trigger few updates (and thus to contribute little to the height of IDENT[CONT]), then it will buy only few updates by ERC 7 (which will therefore contribute little to the height of IDENT[DOR]). Theorem 3 is used to formalize this intuition, yielding the link between α t 2 and α t 7 in (38). Lemma 7 Suppose that in the run considered, ERC 7 does trigger at least an update. The ranking values of the faithfulness constraints IDENT[CON] and IDENT[DOR] satisfy the following inequality: (35) θ t ID[CON] ≥ θ t ID[DOR] + 2 at any time time t ≥ T .
Proof. Since ERC 7 triggers an update between times T and T + 1, the loser-preferring constraint NODORFRIC cannot be underneath the winnerpreferring constraint IDENT[VOICE] at time T , as stated in (36a). Furthermore, the current ranking vector at time T must be consistent with ERC 5 (otherwise, the algorithm would have chosen ERC 5 instead of ERC 7, as the former has less W's). This means that the loser-preferring constraint NODOR-FRIC is already underneath IDENT[CON] at time T , as stated in (36b).

NODORFRIC
The chain of inequalities in (37) thus holds. In step (37a), I have used (36). In step (37b), I have used the fact that T ≥ S and that the ranking values of the faithfulness constraints can only grow with time (because they are never demoted). In step (37c), I have used the inequality (23).

ID[VOI]
(c) ≥ 1 3 θ Since ERC 7 triggers no updates before time T , the number α t 7 of updates it has triggered up to time t is equal to the number α T,t 7 of updates it has triggered between times T and t, as stated in (38a). Ap-plying theorem 3 to ERC 7 pivoting on its winnerpreferring constraint IDENT[DOR] and considering time T as the initial time yields the inequality (38b). In step (38c), I have used (36b) together with the fact that θ T ID[DOR] ≥ 0.