Canonical Context-Free Grammars and Strong Learning: Two Approaches

Strong learning of context-free grammars is the problem of learning a grammar which is not just weakly equivalent to a target grammar but isomorphic or structurally equivalent to it. This is closely related to the problem of deﬁning a canonical grammar for the language. The current proposal for strong learning of a small class of CFG s uses grammars whose nonterminals correspond to congruence classes of the language, in particular to a sub-set of those that satisfy a primality condition. Here we extend this approach to larger classes of CFG s where the nonterminals correspond instead to closed sets of strings; to elements of the syntactic concept lattice. We present two different classes of canonical context-free grammars. One is based on all of the primes in the lattice: the other, more suitable for strong learning algorithms is based on a subset of primes that are irreducible in a certain sense.


Introduction
This paper is concerned with the problem of strong learning of context-free grammars in the distributional framework. One approach, initiated in (Clark, 2014) is to develop strong learning algorithms by defining canonical grammars based on properties of algebraic structures associated with the language: specifically the syntactic monoid of the language. In that paper a strong learning result was presented for a subclass of the class of substitutable languages, languages which have a simple language theoretic closure property.
In this paper we will extend the canonical grammar ideas to a larger class of grammars, while not presenting a full strong learning result, for reasons of space and some technical details not yet resolved. Rather than using the syntactic monoid, we use the syntactic concept lattice (SCL), (Clark, 2013), a richer structure that is suitable for modeling all context-free grammars. In the case of substitutable languages the syntactic monoid is almost identical to the syntactic concept lattice.
We want these canonical grammars to be as unambiguous as possible, and to use as few nonterminals as possible. These two obvious principles pull in the same direction: a grammar with extra nonterminals will typically have extra derivations and thus a higher degree of ambiguity. Finding some global minimum leads in general to intractable computational problems -the set covering problem, a classic NP-hard problem -and the answer may be indeterminate (in that there may be two structurally distinct minima). So rather we stipulate some technical notion which is more determinate, and can be efficiently identified (though we do not talk about the algorithmic issues here). In particular, we want the grammars defined to be compatible with efficient learning algorithms for context-free grammars (Yoshinaka, 2012a;Leiß, 2014).
In the case of the monoid, we only have one operation, concatenation, and given a derivation tree with unlabeled interior nodes, each node in the tree can only be legally labeled with the unique congruence class of the yield of the subtree. Thus given the unlabeled trees, the labeling is determined.
In the case of the SCL, we have a lattice struc-ture, and so there are many different possible ways of modeling the structure, and many different ways of labeling a given tree from the very specific to the very general. We previously argued in (Clark, 2011) that the most general labelings would be optimal; that view now seems simplistic. We argue that we should only model those unpredictable parts of the structure, that is to say those places where the structure differs from the free structure P(Σ * ). The grammar does not need to state that {u} • {v} = {uv} or that {u} ∪ {v} = {u, v}: these are true in the free structure. It is only when these are not equal that we need to represent the difference.
We will give definitions of the basic mathematical concepts we use in Section 2, including a brief introduction to the syntactic concept lattice in Section 2.4 to make the paper self-contained.
In Section 3 we explain the relation between strong learning and canonical grammars.
Then in Section 4 we extend the definition of primes from congruence classes to closed sets of strings. Section 5 presents our first family of canonical grammars that are based directly on all of the primes in the language.
In the case of concepts the lattice structure means that there may be many different concepts that contain a given string, and so in Section 6 we discuss how to exploit the lattice structure to select a smaller set of categories that are irreducible in some sense; and then in Section 7 we present a second family of canonical grammars based on this restricted subset of the primes. We finish with a worked example to illustrate the abstract mathematical discussion and some discussion.

Strings, Languages and Contexts
We assume a fixed alphabet Σ and write Σ * for the set of strings. A language is a subset of Σ * , we write concatenation of languages L, M as L · M , or sometimes just LM . The empty string is λ. We take a symbol ∈ Σ, and using this we define a context as an element of Σ * Σ * , written l r. We define as l r w = lwr, and extend these to sets of strings and contexts in the usual way. The empty context λ λ = is particularly important: of course w = w.

Grammars
We define CFGs standardly as a tuple Σ, V, S, P where S ∈ V is a single start symbol, V is the set of nonterminals and P is a finite subset of V × (Σ ∪ V ) * , written as N → α. The derivation process is denoted by ⇒ and ⇒ * . We define L(G, N ) = {w ∈ Σ * | N ⇒ * G w} and define L(G) = L(G, S). We also define the set of derivation contexts: The following property corresponds to the contextfree property of the derivation process: Two grammars G 1 = Σ, V 1 , S 1 , P 1 , G 2 = Σ, V 2 , S 2 , P 2 are weakly equivalent if L(G 1 ) = L(G 2 ). They are isomorphic if there is a bijection φ : V 1 → V 2 , such that φ(S 1 ) = S 2 and φ(P 1 ) = P 2 , extending φ to productions and sets of productions in the natural way. Isomorphic grammars are identical except for a relabeling of the nonterminal symbols. Clearly isomorphism implies weak equivalence.

Lattices
We assume some familiarity with lattices: see (Davey and Priestley, 2002) for basic definitions. We write , ⊥, ∨, ∧ as standard. An element x is join- For some lattices, the set of join irreducible elements and the set of meet irreducible elements can form a "basis" for the lattice, in that every element can be represented as a finite join of join-irreducible elements and/or a finite meet of meet-irreducible elements. In the lattice of all subsets of Σ * , P(Σ * ) the join irreducible sets are the singleton sets {w} for any string, and the meet irreducible sets are Σ * \ {w}.
A descending chain is a strictly descending sequence of elements of a lattice X 0 ⊃ X 2 ⊃ . . . X n . A lattice satisfies the descending chain condition (DCC) if there are no infinite descending chains. If a lattice satisfies the DCC, then every nonempty subset has at least one minimal element. We define the ascending chain condition (ACC) dually.

The Syntactic Concept Lattice
We now describe the syntactic concept lattice briefly; for fuller descriptions see e.g. (Clark, 2013;Leiß, 2014;Wurm, 2012). Given a fixed language L, we have a Galois connection between sets of strings and sets of contexts defined, where S is a set of strings and C is a set of contexts: A closed set of strings is a set of strings S such that S = S , a closed set of contexts is one such that C = C . A concept is an ordered pair S, C such that S = C and C = S . In this case both S and C are closed. We will therefore often refer to a concept through the corresponding closed set of strings. Note that for any such concept, the following property holds, which corresponds to the context-free property of the CFG derivations: Clearly w ∈ L iff ∈ {w} , and so L is closed. The Syntactic Concept Lattice of L, written B(L) is the collection of concepts, with the following constants, relations and operations, which we define in terms of the closed sets of strings alone: With these operations B(L) is a complete idempotent semiring, and furthermore a complete residuated lattice.
This lattice forms a hierarchy of all distributionally definable sets of strings in the language. There will be a finite number of elements iff the language is regular. Minimal grammars will have nonterminals that correspond to elements of the syntactic concept lattice as shown by (Clark, 2013). Given a contextfree grammar G such that L(G) = L, we define a universal morphism h L : V → B(L) given by h L (N ) = L(G, N ) . We extend this to a CFGmorphism in the obvious way. (Clark, 2013) proved that for all CFGs L(h L (G)) = L. Therefore any CFG for L can be mapped to a possibly smaller grammar whose nonterminals are elements of B(L). We can therefore assume that the nonterminals of the grammar are elements of B(L).

Weak and Strong Learning
We will not present any learning algorithms here, but the work is motivated by learning considerations and so we need to make the background assumptions clear. In standard models of learning, there is a target grammar G * and the learner, using information only about L(G * ), must eventually return a gram-marĜ such that L(Ĝ) = L(G * ). In strong learning (Clark, 2014) in contrast, given the same information source, the learner must pick aĜ such thatĜ is isomorphic to G * . (Clark, 2014) observes that the existence of a canonical grammar is a necessary condition for a strong learning algorithm. Any strong learning algorithm will implicitly define a canonical grammar for any language in the class of languages that it learns. Much of that paper is in fact concerned with precisely that definition. Accordingly in this paper we focus on defining a canonical grammar rather than directly presenting a learning algorithm.
The universal property of the syntactic concept lattice is an important tool. This means that rather than dealing directly with CFGs which are arbitrary and intractable we can deal with the lattices B(L) which have nice mathematical properties. We can assume without loss of generality that the nonterminals of the grammar will correspond to concepts or closed sets of strings: to elements of the lattice. Given this, there is a natural notion of a production being correct: Given that a language that is not regular will have an infinite number of concepts, we need a principled way of selecting a finite number of these in an appropriate way so that we have a finite grammar. The general approach we take is to identify some elements that are irreducible in some sense with respect to the algebraic structure of the residuated lattice.
In the case of substitutable grammars, the closed sets of strings are almost exactly the congruence classes-except for , ⊥ and {λ} , every closed set of strings is either equal to a congruence class or a congruence class together with λ. There seems to be only one plausible way to define a grammar, given that the mathematical structure of the congruence classes is just a monoid. Since this structure is so simple, there is only one reasonable irreducibility property that we can use to select from the congruence classes: primality, which we define later. In the case of general CFLs things are unsurprisingly much more complicated. There seem to be two different factors to be considered. One factor concerns, as in the case of the monoid, the concatenation structure of the strings-the monoid structure of B(L)-and the other concerns the partial order: the lattice structure of B(L).
We start by discussing the concatenation structure in Section 4 and discuss the lattice structure later in Section 6.

Primes
Since a monoid is a very simple algebraic structure, with a single associative binary operation, there is only one reasonable technique to define a subset of elements of the syntactic monoid in such a way that the grammar based on those elements is well behaved. (Clark, 2014) argues that we should represent only those elements where concatenation differs from the free operation of concatenation: in other words where [uv] ⊃ [u] [v]. Language theoretically these represent places where the monoid has some nontrivial structure and grammatically they provide evidence for a nontrivial nonterminal: a nonterminal which occurs on the left hand side of more than one production. Congruence classes which have this desirable property are called primes.
For a congruence class the definition of a prime is straightforward. If a nonzero nonunit 1 congruence class X has a nontrivial decomposition into two congruence classes Y, Z such that X = Y · Z then it is composite. The trivial decompositions are In the case of a CFL which is not substitutable, we need a different criterion, since we may use concepts that are not congruence classes but unions of congruence classes. This is complicated by the fact that the empty string may occur in many different closed sets of strings.
The zero congruence class, if it exists, is ∅ and the unit is composite. We write P(L * ) for the set of primes of a language L * . The unit prime is {λ} .
Note that here we do not exclude {λ} . Clearly for any closed set of strings X = X · {λ} = {λ} · X. For the condition to be nontrivial, we clearly need to exclude such cases, but we also want to exclude cases such as a * = a * · a * and a Since a ∈ {a} either a ∈ B and λ ∈ C or vice versa. Assume the former. Since a ∈ B, this means that B ⊇ {a} and since λ ∈ C, this means that B ⊆ {a} . Therefore B = {a} , or, by a similar argument C = {a} . Therefore it is prime.
, they must both be equal to {λ} since that is the smallest concept that contains λ.
Definition 2. If X ∈ B(L) and α ∈ P(L) + , a nonempty string of primes, we writeᾱ for the concatenation of the primes in α. So if α = A 1 , . . . , A n thenᾱ = A 1 · · · · · A n . We say that α is a prime decomposition of X iffᾱ = X, and none of the elements of α are unit. In the special case where X = {λ} we consider {λ} to be a prime decomposition. We need to consider two cases: one where a prime contains λ and one where it does not. If a closed set of strings contains the empty string, that means that it represents an optional category; it can be replaced by the empty string. If X is a closed set of strings that contains λ and X = Y · Z, then clearly λ ∈ Y ∩ Z and Y ⊆ X and Z ⊆ X. Therefore a decomposition of a concept that contains λ will be into proper subsets of that concept. Decompositions using concepts with λ may not terminate, if the lattice has infinite descending chains.
Example 2. Let L 1 = (ba) * , and L n = L n−1 · (ba n ) * . Consider the language L = n L n . This is a closed set of strings with an infinite descending chain L ⊃ L \ L 1 ⊃ L \ (L 1 ∪ L 2 ) · · · . For each n, (ba n ) * is closed and prime, and L has no finite prime decomposition.
Lemma 3. If B(L * ) satisfies DCC then every element has a prime decomposition.
Proof. If X is prime, then it has a length one decomposition, X . Define the width of a non empty set of strings to be the minimum length of a string in the set. If it contains the empty string, then the width is zero.
Let M be the set of all non-zero non-unit concepts without prime decompositions. Suppose it is nonempty; then it has at least one minimal element, by the DCC. Take a minimal element of minimal width, X. Suppose X has width n. It is not prime by assumption and so X = Y · Z. Case 1: width of X is zero and so both Y and Z contain the empty string: therefore Y, Z are both proper subsets of X. Therefore they are not in M (since X is minimal). Moreover they are not zero or unit, therefore they have prime decompositions such that Y =ᾱ and Z =β and therefore αβ is a decomposition of X. Case 2: the width of X is greater than zero; and the width of Y and Z are both less than width of X. Then Y and Z are both not in M and therefore both have prime decompositions and therefore so does X. Case 3: the width of X is greater than zero, and one of Y or Z had width zero. Assume that λ ∈ Y (the other case is identical), then Z is a proper subset of X and therefore has a prime decomposition, and Y has width less than X and is therefore also has a prime decomposition.
These decompositions aren't necessarily unique, but this lemma shows that the set of primes is sufficiently large to express any concept we want through finite concatenations.
It is not the case that every closed set of strings that is composite has a unique prime decomposition; even if we restrict ourselves to maximal decompositions: decompositions where no element can be re-placed with a larger one. Clearly we can decompose, for example a * into {λ, a k } · (a * \ {a k }) for any k, and for a suitable language these can be prime.
Example 3. Consider the language L = {a, aa} this has closed sets of strings A = {a} and B = {λ, a}. L = A · B = B · A. So L is composite and it has two distinct prime factorisations, which are clearly both maximal.
It simplifies the analysis here if we assume that all the concepts in a language have unique prime decompositions; accordingly we will restrict ourselves to that case for the moment, though we will remove this requirement in Section 7.
Definition 3. A language has the unique factorisation property (UFP) if every closed set of strings with a nonempty distribution has a unique maximal prime factorisation; as before we stipulate that {λ} has such a unique factorisation.
If a language has the UFP, and P is a closed set of strings, then we can write Φ(P ) for the unique prime factorisation, which is a string of primes, in the case of P , is the length 1 string. If α = Φ(P ) thenᾱ = P . Now if we have languages which are UFP, DCC and a finite number of primes, then we can define a unique grammar.
Definition 4. We define the class of languages L P to be the set of languages which satisfy all of the following three conditions: 1. the unique prime decomposition property, 2. have no infinite ascending or descending chains, 3. and have a finite number of primes.
All substitutable context free languages with a finite number of primes satisfy these conditions and so this is an extension of the approach in (Clark, 2014). All regular languages have a finite number of primes and therefore satisfy the chain conditions, but may not be uniquely decomposable.
Suppose we have a language L ∈ L P and a prime N , we can construct a set of productions with N on the left hand side as follows.
Definition 6. Define Γ(N ) = {α ∈ P(L) + λ | N ⊃ α}. We can take the maximal elements of this set: The elements of max(Γ(N )) will be the right hand sides of productions with N on the left hand side.
Proof. Consider the set {γ ∈ Γ(N ) |γ ⊇ᾱ}. If B(L) has no infinite ascending chains, then every non empty set has a maximal element; so let α be a maximal element of this set which clearly satisfies the condition required.
Note that N is not in Γ(N ), since we require a strict superset relation; this saves the definition from vacuity. If we have infinite ascending chains then we may not have a maximal element. This motivates the use of an ascending chain condition.

Canonical Grammar based on primes
We are now in a position to define a set of canonical grammars for L P . Given the nature of the problem it is inevitable that we will have to have a large number of restrictions on the sorts of grammars that we can learn. There are two ways of framing these restrictions either as restrictions on the grammars that generate the language or as language theoretic restrictions themselves. Here we stick to the latter approach; we take a language, and define criteria that define a class of languages. As the reader will see though, we can express the constraints we need in purely language theoretic terms, though those constraints will correspond naturally to finiteness constraints on the various sets of nonterminals and productions.
Definition 7. For each language L ∈ L P we define the grammar G P (L) as the tuple Σ, V, S, P where • S is a distinguished symbol, • P is the union of the following sets of productions, for all N ∈ P(L) λ N → a if a ∈ N , and a ∈ Σ Lemma 5. For every prime N , L(G, N ) ⊇ N .
Proof. Induction on length of w. Base case; |w| ≤ 1. in which case if w ∈ N then there is a rule N → w. Otherwise if w ∈ N and w = a 1 . . . a n where a i ∈ Σ, then since N ⊇ a 1 . . . a n , and each a i is prime, we know that we have a correct production a 1 . . . a n ∈ β(N ). Therefore there is some α ∈ γ(N ) such that w ∈ᾱ, and a production N → α.
Since w ∈ᾱ we must have strings w 1 . . . w k = w such that k = |α|, and for each 1 ≤ i ≤ k w i ∈ N i and α = N 1 . . . N k . By induction otherwise each w i has length less than n and thus N i * ⇒ w i , and therefore N * ⇒ w. We should also consider the case where k = 1, in which case we have a unary rule, and the case where all but one of the w i have length 0. In both cases we have one prime Q strictly less than N , and since we have a finite number of primes, and no unary cycles a simple induction on the derivation height will suffice.
Lemma 6. The canonical grammar for G is finite and generates L.
Proof. Note that all rules are correct and thus by induction we can show that L(G, N ) ⊆ N , which combined with the previous lemma tells us that L(G, P ) = P for all primes.
It is finite since we cannot have two productions whose right hand sides start with the same prime. Note that B(L) is a residuated lattice and that for any two closed sets of strings X, Y , X\Y = {w | X · {w} ⊆ Y } is closed. If P → N α and P → N β are both productions with |α| > 0, |β| > 0 then since α ≤ N \P we have that α = β since they are both equal to the unique prime factorisation of N \P . Therefore if there are only n primes, we can have at most 2n + |Σ| + 1 productions with P on the left hand side.
Finally we verify that S generates the right strings, which is immediate. Either L is prime, in which case it is trivial since we have a production S → L, or L has prime decomposition N 1 . . . N k , in which case we can see that the production S → N 1 . . . N k combined with the fact that L(N i ) = N i , gives the fact that L(S) = L.
This gives us a canonical grammar class for L P but the grammars are still very redundant and ambiguous. In the next section we consider how to select a smaller set of nonterminals.
Lemma 7. Every language with a finite number of primes and no infinite chains is a context free language.
Proof. Clearly L P is a proper subset of the context free languages, but even if we have no unique prime decomposition, we can still get a CFG by picking some shortest prime decomposition nondeterministically.
We can weaken these conditions as we shall see as they are not necessary conditions; here is an example of a CFL with infinite chains that still receives an adequate canonical grammar.
Example 4. Let L = {w ∈ {a, b} + : |w| a > |w| b } The congruence classes are obviously indexed by |w| a − |w| b . Call these E n = {w | |w| a − |w| b = n}. The closed sets of strings are C n = {w | |w| a − |w| b ≥ n}. L = C 1 is prime, C −1 is prime, C 0 is prime. C n for n > 1 are composite.
Ergo this is composite. Therefore there are exactly three primes. However this still has a simple canonical grammar.

Lattice structure
For general context-free languages there may be very many prime concepts, and as a result grammars based on all primes may be excessively ambiguous, and unsuitable for the description of natural languages. 2 Indeed there are finite languages where the number of primes is exponential in the number of strings in the language. Example 5. For some large Σ = {a 1 , . . . , a n } define L = {a i a j | i = j}. Clearly |L| = n(n − 1). Every nonempty proper subset X of Σ is closed; defined by the set of contexts { a i | a i ∈ X}. B(L) therefore has 2 n + 1 concepts, none of which are composite.
In as case such as this the grammar defined by G P (L) will be exponentially larger than |L|; which is clearly undesirable.
Moreover, the previous approach relies on the number of primes in the language being finite. Many simple languages have an infinite number of primes though, and so it is natural to try to extend this approach by considering some additional properties that might serve to pick out a finite subset of these primes. We need some additional constraints to get smaller and less ambiguous grammars. While it is natural to look to the meet and join irreducible elements of the syntactic concept lattice, it seems better to use a slightly larger set; the images of the irreducible elements of the free lattice. We call these semi-irreducible; we would like to use the terms join-prime and meet-prime, but these already have different meanings in lattice theory. Definition 8. Suppose X ∈ B(L); we say that X is join-semi-irreducible (JSI) if X = {w} for some w ∈ Σ * . we say that X is meet-semi-irreducible (MSI) if X = {l r} for some l, r ∈ Σ * .
Observe that if X is join-semi-irreducible then it contains some strings that are not in any lower concepts. Similarly if X is meet-semi-irreducible then it contains some contexts that do not occur in any higher concepts. Note that L is always MSI.
We will illustrate this with a simple example, using a finite language. Example 6. Consider the language generated by the This is a finite language which consists of 11 strings, all of length 2. The string cz receives two parses under this grammar: Figure 1 shows the lattice for this language. Note that the ambiguous letters/words {c, m, z} are at the bottom of the diagram. For example {c} = C(A) ∪ C(B); this is clearly JSI, since it is defined by c, but not MSI. At the top of the diagram are concepts that are MSI, but not JSI. In the middle, marked with boxes we have the concepts that are MSI-JSI.
We can represent every element either as a meet of MSI-concepts or a join of JSI-concepts.
We can now discuss the role of these concepts. Suppose we have some derivation in a grammar S * ⇒ lN r * ⇒ lwr, where N is the nonterminal corresponds to some concept.
This places two constraints on N . On the one hand l r ∈ N , in other words N ≤ {l r} , which is an MSI-concept. On the other hand w ∈ N in other words N ≥ w , a JSI-concept. Clearly {l r} ≥ w since lwr ∈ L. In the special case where {l r} = w , we know then that this must be the value of N . Therefore the elements that are MSI, JSI and primes are very special.
Definition 9. A closed set of strings is an MSI-JSIprime, if it is MSI, JSI and prime.

Grammar based on MSI-JSI-primes
We now consider how to define a class of grammars where the nonterminals consist only of the set of MSI-JSI-primes. Rather than requiring that the lattice contains no infinite chains, and using the UFP, we define a weaker property, which more directly determines the finiteness of the relevant sets of productions. We define the non-standard term finitely Noetherian.
Definition 10. We say that a set Γ ⊆ B(L) + , with the pre-order , is finitely Noetherian if | max Γ| is finite and every element of Γ is less than some maximal element.
A set can fail to be finitely Noetherian either because it has infinite ascending chains with no maximal elements, or because it has an infinite number of maximal elements.
Definition 11. L MJ is the set of all languages L such that 1. L is nonempty and does not contain λ.
3. For each a ∈ Σ and each l r ∈ a there is some X ∈ V such that a ∈ X and l r ∈ X .
L MJ is incomparable with L P , as the following two examples show.
Example 7. L = {ax, bx, ay, cy, bz, cz}. This language in is L P but not L MJ as there are no MSI-JSIprimes that contain for example a. {a, b} is MSI as it is defined by x, but not JSI. {a} on the other hand is JSI but not MSI.
Example 8. L = {a n xb n | x ≥ 0} ∪ {a n xb n | x ≥ 0}. This is in L MJ but not L P . Note that for all n, {b n , c n } is closed and MSI as it is defined by the single context a n x , and is clearly prime, but not JSI. Therefore there are infinitely many primes. The MSI-JSI-primes are only {a n xb n | x ≥ 0}, {a n xc n | x ≥ 0}, {a}, {b}, {c} and {x}. Definition 12. If L ∈ L MJ and V is the set of MSI-JSI-primes, then for a set of strings X we write Γ(X) = {α ∈ V + |ᾱ ⊂ X}, and ∆(X) = {α ∈ V + |ᾱ ⊆ X}. Definition 13. For every language L in L MJ , we define a grammar The boxed elements are those which are both MSI and JSI, which, apart from the empty string concept, correspond to the nonterminals of the original grammar. {a, b, c, m} is MSI, since it is equal to { z} , but it is not JSI.
• P L is the set of all productions X → a such that X ∈ V , a ∈ Σ ∪ λ and a ∈ X, and there is no Y ∈ V such that a ∈ Y and Y < X.
• P B is the set of productions X → α for every X ∈ V and for every α ∈ max Γ(X) • P S is the set of productions S → α for every α ∈ max ∆(L).
Note that by Definition 11, this is finite and a CFG. We now show that this grammar will generate all of the strings in the language. We do this by a joint induction on the length of the strings and the height of the nonterminals in the lattice; it seems easier to write these proofs as reductios.
Lemma 9. For any X ∈ V , If λ ∈ X then Proof. Suppose this is false. Take a minimal X ∈ V such that λ ∈ X but it is not the case that X * ⇒ G * λ. If X is a minimal element of the set of nonterminals that contain λ then there would be a production X → λ in G * ; therefore X is not minimal. Let Y be some nonterminal less than X such that λ ∈ Y . Since X is minimal we have Y * ⇒ λ. Now Y ∈ Γ(X) so there is some production X → α such thatᾱ ⊇ Y . Now if α = Z 1 . . . Z k then each of the Z i must be a proper subset of X that contains λ, (since λ ∈ Y ⊆ᾱ). Since they are proper subsets we have Z i * ⇒ λ and thus X * ⇒ λ which is a contradiction.
Lemma 10. For any a ∈ Σ and X ∈ V , if a ∈ X then X * ⇒ G * a.
Proof. We use just the same argument as the previous proof, except that when we consider α = Z 1 . . . Z k , there must be at least one i such that a ∈ Z i and λ ∈ Z j for all j = i. By Lemma 9, Z j * ⇒ λ, by minimality of the counterexample Z i * ⇒ a, and therefore X * ⇒ a.
Lemma 11. For any w = a 1 . . . a n ∈ {l r} for some l r, there are A 1 , . . . , A n ∈ V such that a i ∈ A i , and A 1 . . . A n ⊆ {l r} .
Proof. By induction on n. Base case, n = 1 is trivial by Part 3 of Definition 11. Clearly a 1 ∈ {l a 2 . . . a n r} . Pick some A 1 ∈ V such that a 1 ∈ A 1 and l a 2 . . . a n r ∈ A 1 . Since A 1 ∈ V there is some v 1 such that A 1 = v 1 . Since a 2 . . . a n ∈ {lv 1 r} then by the inductive hypothesis there are A 2 , . . . , A n such that A 2 . . . A n ⊆ {lv 1 r} , and therefore The result then follows by induction.
Lemma 12. For any X ∈ V , w ∈ X if |w| > 1 then Proof. Suppose this is false for some w and X; pick a shortest w and a minimal X.
By Lemma 10, |w| > 1. Let w = a 1 . . . a n . Let l r be some context such that X = {l r} . By Lemma 11, we know that we have some A 1 . . . A n such that A 1 . . . A n ⊆ X, a i ∈ A i , for 1 ≤ i ≤ n. Since X is a prime, we know that A 1 . . . A n ⊂ X. Therefore there is some X → β such that A 1 . . . A n ⊆β. Let β = B 1 . . . B k for some k ≥ 1. Now w ∈β. Therefore there are strings v 1 . . . v k = w such that v j ∈ B j , 1 ≤ j ≤ k. Now for each v j , B j either |v j | < |w| or B j ⊂ X, and so by the assumption at the beginning of the proof, Proof. Write G * for G MJ (L * ). It is easy to see that each production X → α is correct. A simple induction establishes that L(G * ) ⊆ L * . The nontrivial part of the proof is to show that every string in L * is generated from S. Given the previous lemmas, this is straightforward; the proofs are a bit repetitive, following the earlier lemmas with minor variations. Suppose w ∈ L * .
• If w = a for some a ∈ Σ then pick some X ∈ V such that ∈ X and a ∈ X. By Lemma 10, X * ⇒ a and X ⊆ L; If X = L then there is a unary rule S → X. Otherwise X ∈ ∆(L) and so there is some production S → B 1 . . . B k such that a ∈ B 1 . . . B k . So there must be some B i such that a ∈ B i and λ ∈ B j for j = i, and the result follows by Lemmas 10 and 9.
• If |w| > 1, then there are two cases. First it might be that L ∈ V , in which case there is a single rule S → L, and it is immediate by Lemma 12. If not, then we can use the same argument as in Lemma 12, which we take rapidly here. If w = a 1 . . . a n ∈ L, then we have some A 1 , . . . , A n such that A 1 . . . A n ∈ ∆(L) and A i * ⇒ a i for 1 ≤ Therefore L(G * ) ⊇ L * which establishes the theorem.

Example
We will illustrate some properties of our proposed solution with a simple toy example, based approximately on an example in (Berwick et al., 2011). We use a finite language as this shows most sharply the distinction between weak learning, which is trivial, and strong learning, which even in the case of acyclic CFGs is either impossible or intractable in the general case, depending on how it is formalised. The language is generated by the following grammar; optional elements are in brackets.
S → NP VP ., S → CAN NP VI ? NP → EAGLES (RC), NP → THEY RC → THAT VP VP → (CAN) VI VP → DIED VI → EAT, VI → FLY It contains examples like "can eagles that fly eat?" and "eagles that can fly can eat." "can eagles that died eat?". The language contains some optional elements and as a result is not substitutable. This is not the minimal example as for technical reasons the example needs to be sufficiently complex. tion to indicate the structure unequivocally. So for example, we added a word "died". This alters the structure of the lattice and means that the class VP is then a prime. Without this addition, the concept corresponding to VP would be decomposable as {CAN, λ}·{EAT, FLY}. Similarly we added the word THEY to the class of NP. Without these additions, the language would not contain enough information to distinguish whether, for example, the relative clause attaches to the preceding noun or the following verb. Consider the language L = {ab, acb}. There is no motive, based on the strings, for the claim that c attaches to the left or the right. But if we enlarge it slightly to {ab, acb, ad} it is more natural to attach the c optionally to the b.
The lattice contains 43 elements of which 10 are prime, which are listed in Table 1; we label each prime with a mnemonic label, reusing the nonterminal symbols from the target grammar for ease of reading. The composite elements contain for example the set of four strings {EAT EAT, EAT FLY, FLY EAT, FLY FLY } which is clearly not prime. Figure 2 and 3 show some trees of some of the sentences that illustrate the structure of the derived grammars.

Discussion and Conclusion
The two methods we present here do not exhaust the possibilities: rather they present two extremes. The word CAN, forms a concept on its own, AUX, since when it occurs at the beginning of a sentence, it is obligatory. The concept RC which contains the sequences THAT EAT, THAT CAN FLY and so on, is always optional, and so there is no corresponding concept that does not contain λ. seems to be the smallest possible set of primes that we can define using these techniques. Our reliance on individual contexts and substrings in the definition of MSI-JSI-primes is both natural but inadequate. In terms of earlier traditions of analytical linguistics, we are following Sestier-Kunze domination rather than Dobrushin-domination (Marcus, 1994). (Kunze, 1967(Kunze, 1968 argues that for the adequate description of German lexical categories simple Dobrushin domination is inadequate. Note also the similar notions in (Adriaans, 1999) of contextseparability and expression-separability. The set of MSI-JSI-primes may well be too restricted; it seems necessary to have nonterminals that correspond to cases where {l r} {w} . We leave this problem for future work.
It also seems quite natural to define a dual of primality. For a closed set of strings X, we can de- fine conditions X = Y Z, for some closed sets of strings Y, Z. There are also weaker variants of this. These correspond to a condition that the nonterminal occur on the left hand side of more than one production.
From a linguistic point of view, it is interesting that these approaches give local trees of potentially unbounded ranks as well as empty constituents and unary rules. This therefore means that the claim that natural languages only use a binary syntactic operation (MERGE) becomes a contentful empirical claim.
There is a close relation between the models used here and the primal and dual weak learning algorithms presented in (Yoshinaka, 2012b); in particular, the categories that are MSI, have the 1-finitecontext-property (FCP) and the JSI-concepts have the 1-finite-kernel-property (FKP). The languages in L MJ therefore will be weakly learnable under certain paradigms. In order to turn the results here into a full strong learning result requires then the efficient computation of the canonical grammar given a sufficiently large weakly correct grammar. There do appear to be some technical problems to overcome: for example showing that the number of errors made in selecting the MSI-JSI-primes will only ever be finite.
Given the extension of distributional learning to multiple context-free grammars (MCFGs) (Seki et al., 1991) by (Yoshinaka, 2011), and the extension of the syntactic concept lattice in (Clark and Yoshinaka, 2014), it seems possible to straightforwardly extend these methods to at least some MCFGs. In particular the notion of a closed set of strings being composite is naturally generalised by replacing the single concatenation operation ·, with the family of all non-deleting non-permuting linear regular functions of appropriate arities.
The existence of these canonical grammars seems to be related in an interesting way to algebraic properties of the syntactic concept lattice. Indeed the finite cardinality of the lattice is exactly equivalent to the regularity of the language. It seems that other finiteness properties of the lattice-for example, compactness, the chain conditions etc.-may be crucial. More generally, the results presented here show that it may be possible to have strong learning algorithms for some quite large classes of languages. This suggests that the orthodox view that semantic information is required to learn syntactic structure may be mistaken; the set of strings of the language may define an intrinsic structure that can be learned purely distributionally. If the structures so defined can support compositional interpretation of the semantics, then this would provide strong empirical support for this approach.