Constraint-based Learning of Phonological Processes

Phonological processes are context-dependent sound changes in natural languages. We present an unsupervised approach to learning human-readable descriptions of phonological processes from collections of related utterances. Our approach builds upon a technique from the programming languages community called *constraint-based program synthesis*. We contribute a novel encoding of the learning problem into Boolean Satisfiability constraints, which enables both data efficiency and fast inference. We evaluate our system on textbook phonology problems and datasets from the literature, and show that it achieves high accuracy at interactive speeds.


Introduction
Phonological processes govern the way speech sounds in natural languages change depending on the context. For example, in English verbs, the past tense suffix /d/ turns into [t] after voiceless consonants (so the word "zipped" is pronounced [zIpt], while "begged" is pronounced [bEgd]). Linguists routinely face the task of inferring phonological processes by observing and contrasting surface forms (pronunciations) of morphologically related words. To aid linguists with this task, we consider the problem of learning phonological processes automatically from collections of related surface forms.
This problem setting imposes four core requirements, which guide the design of our approach: 1. Inference results must be fully interpretable: our goal is to explain phonological processes exhibited by the data, not merely predict pronunciations of unseen words. Hence, our model takes the form of discrete, conditional rewrite rules from rule-based phonology (Chomsky and Halle, 1968).
2. Inference must be unsupervised: phonological processes are formalized as transformations from (latent) underlying forms to surface forms (rather than between surface forms).
3. Inference must be data-efficient: typically only a handful of data points are available.
4. Inference must be fast: we envision linguists using our system interactively, tweaking the data and being able to see the inferred rules within minutes.
Recently program synthesis has emerged as a promising approach to interpretable and dataefficient learning (Ellis et al., 2015;Singh et al., 2017;Verma et al., 2018;Ellis et al., 2018). In program synthesis, models are represented as programs in a domain-specific language (DSL), which allows domain experts to impose a strong prior by designing an appropriate DSL. Program synthesis uses powerful constraint solvers to perform combinatorial optimization and find the least-cost program in the DSL that fits the data. Program synthesis has been previously used to tackle the problem of phonological rule learning (Ellis et al., 2015), however their work uses global inference which scales poorly and hence does not satisfy requirement 4 (their system takes an hour on average to solve a phonology textbook problem).
In this work, we propose a novel inference technique that satisfies all four core requirements. Our key insight is that the problem of learning conditional rewrite rules can be decomposed into three steps: inference of the latent underlying forms, learning the changes (rewrites), and learning the conditions. Moreover, each of these problems can be encoded as a constrained optimization problem that can be solved efficiently by modern satisfiability modulo theories (SMT) solvers (de Moura and Bjørner, 2008). Both the decomposition and the encoding into constraints are contributions of this work. We implement this approach in a system called SYPHON and show that it is capable of generating accurate phonological rules in under a minute and from just 5-30 data points.

Background and Problem Definition
In this section, we illustrate phonological processes and the problem of phonological rule induction using our running example of English verbs.

Rule-Based Phonology
Phonological features. Phones (speech sounds) are described using a feature system that groups similarsounding phones together. For instance, voiced consonants ( . However, some phones may be uniquely identified by several feature vectors, and not all feature vectors correspond to phones (the feature system is redundant). For example, the feature vector [+low +high] does not correspond to any phones, as no phone can have both a raised and a lowered tongue body.
Phonological rules. In rule-based phonology, a phonological process is formalized as a conditional rewrite rule that transforms an underlying form of a word (roughly, the unique stored form of the word) into its surface form (the word as it is intended to be pronounced). In our English past tense example, the underlying form /zIpd/-formed by concatenating the stem /zIp/ and past tense suffix /d/-is transformed into the surface form [zIpt] by a rule that makes an obstruent voiceless when it occurs after a voiceless obstruent: In general, phonological rules have the form A → B / L R, where all of A, B, L, and R are feature vectors. The rule means that any phone that matches A and occurs between two phones that match L and R, respectively, will be rewritten to match B (leaving the features not mentioned by B intact). A is called the target of the rule, B is called the structural change, and L and R are the left and the right contexts. 1 In the example above, the right context is omitted, because it is irrelevant to the rule's application; formally, A, L, and R may each be empty feature vectors, which are defined to match any phone.
Hereafter, we refer to the sequence LAR of the target and the context as the condition of the rule. If the condition is empty, the rule applies unconditionally. In addition to + and −, the values of features in the condition of the rule may be variables, which enforce that features have the same value in different parts of the condition. For example, [αconsonant] describes a rule which applies between pairs of consonants and pairs of vowels, but not between a consonant and a vowel.

Problem Definition
The input to our problem is a matrix of surface forms, such as the one shown in Fig. 1, left. These forms are arranged into rows, corresponding to different stems, and columns, corresponding to different inflections (in this case, the third-person singular and past tense of English verbs). In the interest of space, we only show four rows from this data set, but a typical input in a phonology textbook problem is only slightly larger and ranges from 5 to 30 rows.
Given these data, our task is to infer the latent underlying forms for each of the words in the input such that the resulting matrix of underlying forms factorizes into stems and suffixes, and to learn a sequence of phonological rules which, when applied elementwise to the matrix of underlying forms, reproduces the matrix of surface forms.
This learned sequence of phonological rules is generative in the following sense: given the underlying form for a new word, such as /aeskz/, we can deterministically apply these rules to generate the surface form of that word, [aesks]. We use this property to evaluate the accuracy of the rule set we learned by holding out a portion of the words from the data, and then applying the rule to the underlying forms of those words, which were determined through phonological research.

Phonological Intuition
The design of our system is informed by how linguists solve the problem of phonological rule induction. When a phonologist analyzes these data, they begin by positing underlying forms that are likely to result in the simplest set of rules. For example, they observe that the substring shared in each row is most likely the stem, which surfaces without change; the underlying suffix in the first column in Fig. 1 Figure 1: The general structure of the problem, shown concretely for English verbs.
/z/, which sometimes surfaces as [s] and other times as [@z]; and similarly, the underlying suffix in the second column is likely /d/, which can change to [t] or [@d]. The choice of /z/ and /d/ as the underlying suffixes is preferred to, say, /s/ and /t/, because this choice lets us explain all the observed data using only three edits: The next step is to merge and generalize individual edits: the first two edits are both devoicing an obstruent, so they can be merged into [−sonorant] → [−voice], while the last edit is an insertion and cannot be generalized.
The final step of the analysis is to infer the conditions under which each of the two structural changes occurs. By contrasting examples in the first column, we infer that the insertion happens when the suffix /z/ occurs after a strident (like /s/ in /mIs/); otherwise, /z/ and /d/ are devoiced whenever they occur after a voiceless obstruent (like /p/ in /zIp/). The full data set can be explained using the two rules in Fig. 1, right. Note that in order to capture the data in both columns, the insertion rule says that [@] is inserted whenever the stridency of the left and right context matches. Note also that in this case the order of rules matters: for words like /mIsz/, insertion is applied first, which prevents the devoicing rule from applying.

Learning Phonological Rules
As illustrated in Fig. 1, the input to our learning problem is a matrix of surface forms X ij with I rows and J columns. The goal is to learn a discrete rule set R, while jointly inferring the latent set of I stems S i and J affixes A j .
Hypothesis space. The hypothesis space for R can be formalized as a context-free grammar: According to this grammar, R is a sequence of rules R; each R is defined in terms of four feature vectors C; each feature vector is a sequence of pairs of feature values V and feature names F .
Rewriting. We use C R and B R to denote the condition and structural change of a rule R, respectively. A feature vector C can be interpreted as a Boolean formula that holds of a phone a if a possesses all features in C; we denote by |C| the number of models of this formula, i.e. phones in the inventory Φ that satisfy C. Similarly, C R is a Boolean formula over trigrams of phones. A rewrite of a trigram abc by rule R is defined as: The notion of rewrites can be extended to words and rule sets.
Learning as constrained optimization. We can now formalize our problem as a hard correctness constraint over rules and underlying forms U ij : Here, A j [S i ] denotes a concatenation of the prefix/suffix A j with the stem S i . There might be many rule sets R that are consistent with all the data, and what we would like is to pick one that generalizes to other data that exhibits the same phonological process (for example, the rule inferred in Fig. 1 should generalize to other regular English verbs). Hence we frame the learning problem Figure 2: Probabilistic model of a phonological process. A rule set R is sampled from a description length prior. We observe a set of N surface phonemes x k ; each x k is generated by sampling a rule R k from R and an underlying trigram u k , and deterministically applying R k to u k (coin flip b k decides whether u k should match R k 's condition).
as a constrained optimization problem and derive the objective function using a Bayesian model.

Bayesian Model
Generative process. Intuitively, to generate surface forms X ij , we must sample a single rule set R, I stems S i , and J affixes A j , and then deterministically apply R to each A j [S i ]. Prior work on phonological rule learning (Ellis et al., 2015) assumed that S i and A j are sampled uniformly from the language and independently of R. We observe, however, that in most data sets of interest, underlying forms are in fact sampled to contrast the contexts in which R does and does not apply. We model this intuition as a strong sampling process depicted in Fig. 2.
For simplicity, in this model each observation corresponds to an individual rule application to an underlying trigram u that produces a surface phoneme x. For example, the rewrite /zIpz/ → [zIps] is represented as four observations: Our generative process first samples a ruleset R from the description length prior over the hypothesis space (1): is the length of rule R and w s > 0 is a model hyperparameter. For each observation k ∈ 1..N , we pick a rule R k uniformly from R. Before sampling the underlying trigram u k , we flip a coin b k to decide whether we want to sample a positive or a negative trigram, i.e. whether C R k (u k ) should hold true; we then sample u k uniformly from the set of all positive (resp. negative) trigrams (subject to the hard constraint that they form a factorizable matrix U ij ). Finally, we deterministically compute x k R k (u k ). Hence we can define: Objective function. Taking logs, we can derive the following approximate minimization objective for our constrained optimization problem: where N + R is the number of positive examples for this rule. (Note that this objective ignores P (R k | R) and b k , which are assumed to be uniform. It also ignores the negative examples. This provides a reasonable approximation, under the assumption that |¬C R | |C R | for each rule R, which holds in the current setting.) This function includes a simplicity term, which favors rules with shorter (and hence, more general) conditions, and a likelihood term, which favors more specific conditions if there are sufficient positive examples to support them. This likelihood term stems from our strong sampling assumption; we demonstrate its importance for inferring accurate rules in Sec. 5.

Inference by Program Synthesis
To solve the constrained optimization problem we build upon a technique from programming languages called constraint-based program synthesis (Solar-Lezama, 2013). Constraint-based synthesis. The input to (inductive) program synthesis is a DSL that defines the space of possible programs and a set of input-output examples E = − − → i,o ; the goal is to find a program whose behavior is consistent with the examples. In constraint-based synthesis, this search problem is reduced to solving a boolean constraint. To this end, we index the DSL by a bitvector c, called a control vector. We then define a mapping from control vectors to program behaviors via an evaluation relation ϕ(c,y,z)-a boolean formula that holds if and only if a program indexed by c produces output z on input y. Given the evaluation relation, the synthesis problem reduces to solving the following boolean constraint: ∃c.
An SMT solver (de Moura and Bjørner, 2008) is then used to find a satisfying assignment for c, which allows us to recover the corresponding program. For this approach to succeed, the evaluation relation has to be designed carefully so that it only uses constraints that the solver can efficiently reason about.
Synthesis of phonological rules. In our setting, the DSL is the space of all rule sets R (up to a certain size), and the evaluation relation ϕ(c,U,X) is the correctness condition (2). Importantly, our setting differs from traditional program synthesis in two ways: first, we have to search for both the control vector and the inputs, and second, in addition to satisfying the correctness condition, we also seek to minimize the objective function (3). If we encode the objective function as ψ(c, −−−→ U,X ), we can reduce rule learning to the following constrained optimization: Given a proper encoding of ϕ and ψ, this constraint can be solved by an optimizing SMT solver (Bjørner et al., 2015); this is the approach used in prior work (Ellis et al., 2015). However, this is a very computationally intensive problem. The reason is the astronomical size of the search space: for a problem of factorizing a 10×2 matrix X ij into stems of length S = 3 and affixes of length A = 2, if we limit the maximum number of rules N R to 2 and consider an inventory Φ with 90 phones and a feature set F with 30 features, we can estimate the size of the search space as 3 |F |N R |Φ| I S +J A ≈ 2 600 .
Decomposition. To achieve scalable inference, we decompose the global constrained optimization problem into three steps, inspired by phonological intuition we described in Sec. 2.3: 1. Underlying form inference. In the first step we use an SMT solver to generate likely underlying stems and suffixes. We rank them based on the heuristic that underlying forms U ij that have a smaller edit distance from surface forms X ij are more likely to produce simple rules (Sec. 3.3).
2. Change inference. Given the set of edits between each U ij and the corresponding X ij , we identify the smallest set of structural changes B that can describe all the edits (Sec. 3.4).
3. Condition inference. Finally, for each structural change B, we use program synthesis to infer the condition under which this change occurs (Sec. 3.5). If this step fails, we go back to step 1 and generate the next candidate matrix U ij .
In the rest of this section we detail these three steps. For illustration purposes, in all examples we will assume that our feature set has just three features: voice v, sonorant s, and continuant c.

Underlying Form Inference
The input to this step is the matrix of surface forms X ij and the output is a set of aligned pairs U,X ij . Tab  The output matrix U, X ij has to satisfy two properties: (i) the matrix U ij can be factorized into stems S i and affixes A j , and (ii) each pair U,X has a small edit distance. Our intuition is that underlying forms that have a small edit distance from surface forms are likely to produce simpler rules. Hence we generate candidate matrices U,X ij in the order of increasing edit distance, until rule inference succeeds for one of them. This strategy will always eventually find a matrix of underlying forms which can be related to the surface forms by a rule set we can infer as long as one exists. This process is not guaranteed to find the global minimum of the objective function (3), but we show empirically that it produces good results.
We can encode the properties (i) and (ii) as a boolean constraint over unknown strings with concatenation and length, which can be solved efficiently by the Z3STR2 solver (Zheng et al., 2017). From the solutions for those unknowns it is straightforward to recover not only the stems and suffixes, but also the required alignment information between the underlying and surface forms.

Change Inference
The input to change inference is the set of all edits in the aligned pairs U,X ij , computed in the previous step, and the output is a set of structural changes that captures all the edits. Tab. 2 illustrates this for the edits from Tab. 1; columns LHS and RHS show relevant features of the left-and and right-hand sides of the edit. For each edit, we compute the set of all possible structural changes which are consistent with the edit.

Condition Inference
For each structural change B inferred in the previous step, we now attempt to determine the condition LAR under which the change applies. If successful, a rule A → B / L R is added to the inferred rule set R; otherwise we go back to underlying form inference and try the next candidate matrix U ij .
For a given change B, the input to condition inference is the set of pairs u, k , where u k is a phone trigram in some underlying form and the label k can be positive ( ), negative (⊥), or unknown (?). Tab. 3 illustrates this for trigrams from U = /zIpz/. A trigram is labeled positive if its middle phone undergoes the change B in the data, negative if it does not undergo B, and unknown if B has no effect on this phone. In our example, neither /I/ nor /p/ in /zIps/ actually changed, however /zIp/ is labeled ⊥ while /Ipz/ is labeled ?, because /p/ is already [−v], and hence devoicing has no effect on it. Our goal is to infer a condition consistent with the labels of all the positive and negative trigrams (unknown trigrams are ignored).  Inference by program synthesis. To frame condition inference as a program synthesis problem we need to define the control vector that indexes the space of all possible conditions, and a corresponding evaluation relation. In our control vector, for each feature f , we use six control variables, which represent the three positions that a feature can appear in a rule (left context, target, and right context) and the two values it can take on (+ and −). We denote these variables by f v p for v in V = {+,−} and p in P = {l,t,r}. Our evaluation relation takes the form ϕ(c,u, ) matches(c,u) ⇔ , where matches is a relation specifying whether the condition indexed by c matches the trigram u. The matches relation is further defined as follows: where u p,f is the value of feature f at position p in trigram u.

Inductive Bias
In addition to being consistent with the data, we also want the condition to minimize the objective function (3). We encode the objective function as w s s(c)+l(c), where s(c) encodes the simplicity of the condition indexed c (its size), l(c) encodes the likelihood, and w s is a model hyperparameter which determines the relative importance of simplicity.
The challenge is to encode the likelihood term in a solver-friendly way. To count the number of models of |C R |, we observe that |C R | = |C l R ||C t R ||C r R |, i.e. we can independently count the models of the target, and the left and right contexts, so l(c) N + p∈P log(|C p R |) We also observe that |C p R | can be encoded efficiently using a constraint whose size is linear in the size of the phone inventory Φ: Finally, as the solver does not support logarithms, we encode log using a lookup table. This is tractable, since we only need to evaluate the log of each |C p R |, which is at most the size of our inventory, roughly 100 phones.

Current limitations
SYPHON currently leverages three simplifying assumptions about rules for domain-specific problem decomposition and SMT encoding, which are crucial to making learning computationally tractable. Conjunctive conditions. Rule conditions are conjunctions of equalities over feature values, and each rule has a unique change. We can thus decompose the learning process into change inference and condition inference: change inference greedily groups all observed edits into changes, and from then on we assume that each change uniquely corresponds to a rule. Local context. The condition of each rule is only a function of the target phone and two surrounding phones. This allows us to encode condition inference as learning a formula over trigrams of phones, which has a compact encoding as SMT constraints. Rule interaction. One rule's change does not create conditions for another. This allows us to perform condition inference for each rule independently.
Many attested patterns in real languages go beyond these limitations. We believe that it is possible to lift these restrictions, and still leverage the structure of conditional rewrite rules to retain most of the benefits of our problem decomposition. We leave this extension to future work.

Data
We evaluate our system on two broad categories of datasets: lexical databases and textbook problems.

Lexical Databases
We use large lexical databases to investigate two (morpho)phonological processes in English: flapping (6457 rows) and regular verb inflections (2756 rows). We process the CMU pronouncing dictionary (Weide, 2014) to create underlying and surface form pairs exemplifying flapping, as in Gildea and Jurafsky (1996). For verb inflections, we combine morphological information extracted from CELEX-2 (Baayen et al., 1993) with CMU transcriptions to create a database of regular verbs, where each row of the database contains the third-person singular present tense wordform and past tense wordform for a given verb. For both datasets we have gold standard solutions for both rule sets and underlying forms, provided by one of our coauthors, who is a phonologist.

Textbook Problems
For this category, we curated a set of 34 problems from phonology textbooks (Gussenhoven and Jacobs, 2017;Odden, 2005;Roca and Johnson, 1999) by selecting problems with local, non-interacting rules. For each problem, the input data set is tailored (by the textbook author) to illustrate a particular phonological process, and contains 20-50 surface forms. For all of these problems we have gold standard solutions, either provided with the textbook or inferred by a phonologist. Gold standard solutions range in complexity from one to two rules. Out of the 34, 21 problems are shared with (Ellis et al., 2019), which we use as the baseline for inference times.
Following the textbooks, these problems are further subdivided into 10 matrix problems, 20 alternation problems, and 4 supervised problems. The matrix problems follow the format presented in Sec. 2. The alternation and supervised problems are easier, as we are given more information about the underlying form. For alternation problems, we are essentially given a set of choices for what the underlying form might be, and for supervised problems the underlying form is given exactly. These problems include examples of phones in complementary distribution. Our problem decomposition allows us to switch out underlying forms inference to handle different kinds of input. According to this classification, the flapping lexical database is an alternation problem and verbs is a matrix problem.

Experiments
We evaluate our system on the two categories of data sets discussed in Sec. 4. We split the 34 textbook problems into 24 development and 10 test problems. Our system has several free parameters (most importantly, the simplicity weight w s ). These were hand-tuned on all of the data except the test problems. For the test problems, we only added missing sounds to the inventory as needed. The 10 test problems are all alternation problems. We leave for future work the investigation of these hyperparameter settings on new matrix problems.

Lexical Database Experiments
We evaluate our system on two large English datasets, one demonstrating flapping, and the other verbs. For each dataset, we learn a rule set from 20, 50 and 100  Table 4: Accuracy results for the English flapping and verbs corpora data sets on 20, 50 and 100 training examples. SYPHON (SP ) and SYPHON-(SP-) are two variants of our model, with and without likelihood, resp. Accuracy reports the generalization accuracy on unseen inputs, rule match and UF indicate how well the inferred rule and underlying form resp. match the gold standard. data points, and evaluate its accuracy on the remaining data. We also perform a syntactic comparison of the rule set against the gold standard rules, which we report as average precision and recall among the sets of features in the two rules. Finally, we compare the latent underlying forms we inferred for each problem to the known correct underlying forms. Tab. 4 summarizes the results. Tab. 5 (rows 1-3) shows the actual rules inferred on the three flapping training sets.
To examine the importance of likelihood in our system, we repeat this experiment for a variant of our system SYPHON-, which does not consider likelihood and simply optimizes our simplicity prior. As the number of data points increases, the effect of the likelihood grows, and so for SYPHON the recall compared to the gold standard quickly climbs. By contrast, the recall of SYPHON-plateaus, which shows that likelihood is indeed important for finding good rules.

Textbook Problem Experiments
We evaluate the textbook problems under the following three experimental conditions. First, to evaluate the generalization accuracy for unseen inputs, for each of the problems, we hold out a randomly sampled 20% of the data. We then learn a rule set on the remaining data, and validate it against the held out data. We repeat this process three times, and report the average accuracy for each class of problems in Tab. 6. We also evaluate syntactic accuracy of the rules and of underlying forms, in the same way as for the lexical databases. Additionally, we evaluate our system on 10 test problems, which were left out entirely when tuning the system hyperparameters. We report the same metrics for these problems. Tab. 5 shows inferred rules for selected development  problems (rows 4-8) and test problems (rows 9-13).
The accuracy of our inferred rules and underlying forms is 100% for all textbook problems. This is not surprising: the combination of hard constraints and a restrictive DSL makes inferring incorrect rules or underlying forms very difficult. More interesting is the syntactic comparison to the gold standard. This measure is intended to estimate how well the rules SYPHON produces match phonologists' intuition. The results in Tab. 6 confirm that without the likelihood term, inference tends to exclude important features from the rule condition: the recall for held out problems goes down by 21%.
Finally, we compare inference times of SYPHON with the prior work of Ellis et al. (2019), which is also based on constraint-based program synthesis but does not perform problem decomposition, instead using the global encoding outlined in Sec. 3.2. As shown in Tab. 7, the decomposition makes SYPHON at least two orders of magnitude faster, with an average inference time of just 30 seconds for matrix problems, thus enabling phonologists to use the tool interactively.

Related Work
Learning (morpho-)phonology is a rich and active area of research; in this overview, we focus on approaches that share our goal of inferring fully interpretable models of phonological processes. Most closely related to ours is the work of Ellis et al. (2015) and its (unpublished) follow-up (Ellis et al., 2019) on using program synthesis to infer phonological rules. As mentioned above, the main difference is that SYPHON is two orders of magnitude faster than their system thanks to a novel decomposition and efficient SMT encoding. On the other hand, we impose extra restrictions on the hypothesis space (i.e. we only support local rules), which means that SYPHON is unable to solve some of the harder textbook problems that Ellis et al. (2019) can solve. In addition, Ellis et al. (2019) propose a method for inducing phonological representations which are universal across languages.
Beyond program synthesis, Rasin et al. (2017) use a comparable description length-based approach to unsupervised joint inference of underlying phonological forms and rewrite rule representations of phonological processes, but use a genetic algorithm to find approximate solutions. Gildea and Jurafsky (1996) and Chandlee et al. (2014) discuss supervised learning of restricted classes of finite-state transducer representations of several phonological processes (including English flapping). To date, such work either requires thousands of training observations (Gildea and Jurafsky, 1996) or has used abstracted and greatly simplified symbol inventories and training data (Chandlee et al., 2014). Hayes and Wilson (2008), Goldsmith and Riggle (2012), and Futrell et al. (2017) propose different methods for learning probabilistic models of phonotactics, which represent gradient co-occurrence restrictions between surface segments within a word. Unlike the current implementation of SYPHON, these models include representational structures that enable them to capture certain non-local phenomena. However, because these models focus on phonotactics, they do not infer underlying forms or rules which relate underlying forms to surface forms.
Finally, much work has focused on learning representations of phonological processes as mappings that minimally violate a set of ranked or weighted constraints (Prince and Smolensky, 2004;Legendre et al., 1990), but such work has generally taken the constraint definitions as given and focused on learning rankings or weights (see e.g. Goldwater and Johnson, 2003;Tesar and Smolensky, 2000;Boersma and Hayes, 2001), with some exceptions (Doyle et al., 2014;Doyle and Levy, 2016).

Conclusion
We have presented a new approach to learning fully interpretable phonological rules from sets of related surface forms. We have shown that our approach produces rules that largely match linguists' intuition from a handful of examples and within minutes. The contributions of this paper are a novel decomposition of the global inference problem into three local problems, as well as an encoding of these problems into constraints that can be efficiently solved by an SMT solver.