General Perspective on Distributionally Learnable Classes

Several algorithms have been proposed to learn different subclasses of context-free grammars based on the idea generically called distributional learning . Those techniques have been applied to many formalisms richer than context-free grammars like multiple context-free grammars, simple context-free tree grammars and others. The learning algorithms for those different formalisms are actually quite similar to each other. We in this paper give a uniform view on those algorithms.


Introduction
Approaches based on the idea generically called distributional learning have been making great success in the algorithmic learning of various subclasses of context-free grammars (CFGs) (Clark, 2010c;Yoshinaka, 2012). Those techniques are applied to richer formalisms as well. The formalisms studied so far include multiple CFGs (Yoshinaka, 2011a), simple context-free tree grammars (CFTGs) (Kasprzik and Yoshinaka, 2011), second-order abstract categorial grammars (Yoshinaka and Kanazawa, 2011), parallel multiple CFGs (Clark and Yoshinaka, 2014), conjunctive grammars (Yoshinaka, 2015) and others. The goal of this paper is to present a uniform view on those algorithms.
Every grammar formalism for which distributional learning techniques have been proposed so far generate their languages through context-free derivation trees, whose nodes are labeled by production rules. The formalism and grammar rules deter-mine how a context-free derivation tree τ is mapped to a derived objectτ = d. A context-free derivation tree τ can be decomposed into a subtree σ and a tree-context χ so that τ = χ [σ]. The subtree determines a substructure s =σ of d and the tree-context determines a contextual structure c =χ in which the substructure is plugged to form the derived object d = c ⊙ s, where we represent the plugging operation by ⊙. In the CFG case, c is a string pair ⟨l, r⟩ and s is a string u and ⟨l, r⟩ ⊙ u = lur, which may correspond to a derivation I * ⇒ lXr * ⇒ lur where I is the initial symbol and X is a nonterminal symbol. In richer formalisms those substructures and contexts may have richer structures, like tuples of strings or λ-terms. A learner does not know how a given example d is derived by a hidden grammar behind the observed examples. A learner based on distributional learning simply tries all the possible decompositions of a positive example into arbitrary two parts c ′ and s ′ such that d = c ′ ⊙ s ′ where some grammar may derive d thorough a derivation tree τ ′ = χ ′ [σ ′ ] withχ ′ = c ′ andσ ′ = s ′ . Based on observation on the relation between substructures and contexts collected from given examples, a hypothesis grammar is computed. We call properties on grammars with which distributional learning approaches work distributional properties.
This paper first formally defines grammar formalisms based on context-free derivation trees. We then show that grammars with different distributional properties are learnable by standard distributional learning techniques if the formalism satisfies some conditions, which include polynomialtime decomposability of objects into contexts and substructures. In addition, we discuss cases where we cannot enumerate all of the possible contexts and substructures.

Σ-grammars
There is a number of ways to represent a language, a subset of an object set O * , whose elements are typically strings, trees but anythings encodable are eligible. Formalisms this paper discusses generate objects in O * through context-free derivation trees τ , which are mapped to an element d ∈ O * in a uniform way. The map is inductively defined and computed. Each derivation subtree τ ′ of τ also determines an object, which we call a substructure of d. Each substructure is not necessarily a member of O * . For example, nonterminal symbols of multiple CFGs (Seki et al., 1991) derive n-tuples of strings, where the value n is unique to each nonterminal, while the languages generated by multiple CFGs are still simply string sets. A generalization of the CFG formalism is specified by kinds of objects that each nonterminal generates and admissible operations over those objects.
Let O be a set of objects, which are identified with their codes of finite length. We have a set Ω of finite representations O which are interpreted as subsets O O of O through an effective procedure. By a sort we flexibly refer to O ∈ Ω or O O ⊆ O. We also have an indexed family of computable functions from tuples of objects of some sorts to objects of some sort. Let F be a set of function names or function indices f , which represent functionsf . By O 1 × · · · × O n → O 0 we denote the set of functions whose domain is O 1 × · · · × O n and codomain is We assume that the domain sorts O 1 , . . . , O n and the codomain sort O 0 are easily computed from f . We specify a class of grammars by a triple, which we call a signature, A context-free Σ-grammar (Σ-grammar for short) is a tuple G = ⟨N, σ, F, P, I⟩ where N is a finite set of nonterminal symbols, I ⊆ N is a set of initial symbols, σ ∈ N → Ω is a sort assignment on nonterminals such that σ(X) = O * for all X ∈ I, F ⊆ F is a finite set of function names, and P is a finite set of production rules, which are elements of N × F × N * . Each production rule is denoted as A Σ-grammar defines its language via derivation trees, which are recursively defined as follows.
• If τ i are X i -derivation trees for i = 1, . . . , n and ρ is a rule of the form X 0 ← f ⟨X 1 , . . . , X n ⟩, then the term The case where n = 0 gives the base of this recursive definition. An X-derivation tree is complete if X ∈ I. The yield of any X-derivation tree is called an X-substructure. By S(G, X) we denote the set of X-substructures. The language of G is L(G) = ∪ X∈I S(G, X), which we call a Σlanguage. In other words, L(G) is the set of the yields of complete derivation trees. The class of Σlanguages is denoted by L(Σ).
Distributional learning is concerned with what Xderivation contexts represent. An X-derivation context is obtained by replacing an occurrence of an X-derivation tree in a complete derivation tree by a special symbol □ σ(X) . Accordingly the yieldχ of an X-derivation context χ should be a finite representation of a function that gives χ[τ ] when applied toτ for any X-derivation tree τ . We assume to have a set E O of representations of functions from O O to O * for O ∈ Ω to which the yields of derivation contexts belong.
The yield of any X-derivation context is called an X-context. By C(G, X) we denote the set of Xcontexts. For c ∈ C(G, X) and s ∈ S(G, X), c ⊙ s is the result of the application of the function represented by c to s.

Context-substructure relation
By S and C we denote the set of substructures and contexts, respectively, which can be obtained by some grammar in G(Σ): We write S * for S O * . Note that the above definition is relative to Σ.
Hereafter, whenever we write c ⊙ s and f (s 1 , . . . , s n ), we assume they are well-formed. That is, the domains of the functions represented by c and f match the sorts to which s and s 1 , . . . , s n belong, respectively. Accordingly we drop the subscript O from □ O and writẽ f (s 1 , . . . , s j−1 , □, s j+1 , . . . , s n ). When we have a substructure set S, we assume S ⊆ S O for some O ∈ Ω. We often identify s with {s} unless confusion arises. Also we assume S O ̸ = ∅ for all O ∈ Ω. The same assumptions apply to contexts.
We are interested in whether the composition c⊙s belongs to a concerned language L ∈ L(Σ). Clark (2010b) has introduced syntactic concept lattices to analyze the context-substring relation on string languages and particularly to design a distributional learning algorithm for CFLs. Generalizing his discussion, we define an O-concept lattice B O (L) of a language L ⊆ O * for respective sorts O ∈ Ω. Assuming L and O understood from the context, let us write We call a pair ⟨S, C⟩ ⊆ S O × C O a concept iff S † = C and C ‡ = S. For any S ⊆ S O and C ⊆ C O , ⟨S † , S ‡ ⟩ and ⟨C † , C ‡ ⟩ are concepts. We call them the concepts induced by S and C, respectively. For two concepts ⟨S 1 , C 1 ⟩ and ⟨S 2 , We can introduce a partial order to substructure sets based on the concepts that they induce. Let us The relation represents the substitutability of S 1 for S 2 .
Lemma 1. The following three are equivalent for S, T ⊆ S O : If S 1 ≤ L S 2 and S 2 ≤ L S 1 , we write S 1 ≡ L S 2 .

Conditions to be distributionally learnable
Distributional learning algorithms decompose examples d ∈ S * into contexts c ∈ C O and substructures s ∈ S O so that c ⊙ s = d. Then a primal approach uses substructures or sets of substructures as nonterminals of a conjecture grammar. We want On the other hand, a dual approach uses contexts or sets of contexts as nonterminals where the semantics of the nonterminal is We require G(Σ) to be a tractable formalism such that composition and decomposition can be done efficiently. Assumption 1. There are polynomial-time algorithms which • decide whether s ∈ L(G) from s ∈ S * and G ∈ G(Σ), Assumption 2. There is p ∈ N such that the arity of every f ∈ F is at most p. Assumption 3. There are polynomial-time algorithms that compute SUB(d), CON(d) and Actually by Assumptions 2 and 3, one can derive the polynomial-time uniform membership decidability. Moreover, it is easy to filter out nonmembers of S |d , C |d and F |d from SUB(d), CON(d) and FUN(d), respectively, but it is not necessary. Assumption 3 implies |Ω |d | is polynomially bounded, It is often the case that elements of Ω represents pairwise disjoint sets. Actually for any signature Σ, one can find Clearly every Σ-grammar has an equivalent Σ ′ -grammar. Moreover, this makes it clear that from s ∈ S one can immediately specify the unique sort O ′ ∈ Ω ′ such that s ∈ O O ′ . Similarly we may assume that each c ∈ C has unique O ∈ Ω such that c ∈ C O and finding that O is a trivial task. Hereafter we work under this assumption. By O and f we mean O O andf for notational convenience.
Example 1. A right regular grammar over an alphabet ∆ is a Σ reg -grammar for Σ reg = ⟨{∆ * }, F, ∆ * ⟩. F has nullary functions which are members of ∆ ∪ {ε} and unary functions f a for f a (w) = aw for all w ∈ Σ * for some a ∈ ∆. Clearly the class of right regular grammars satisfies Assumptions 1, 2 and 3.
The class of CFGs itself satisfies Assumption 1 but not Assumptions 2 and 3, since we have no limit on n. But several normal forms fulfill Assumptions 2 and 3.
denotes the set of m-tuples of strings. Linear context-free rewriting systems, equivalent to nondeleting multiple CFGs, are Σ mcfg -grammars where O * = O 1 = ∆ * and every f ∈ F Om 0 ,Om 1 ,...,Om n concatenates strings u i,j occurring in an input ⟨⟨u 1,1 , . . . , u 1,m 1 ⟩, . . . , ⟨u n,1 , . . . , u n,mn ⟩⟩ in some way to form an m 0 -tuple of strings. The uniform membership problem of this class is PSPACE-complete (Kaji et al., 1992). There are infinitely many ways to decompose a string d into substructures and contexts as O m ∈ Ω |d for all m. Assumptions 1 and 3 will be fulfilled when we restrict admissible functions so that F Om 0 ,...,Om n ̸ = ∅ only if n ≤ p and m i ≤ q for all i.
As is the case for multiple CFGs, Assumption 2 is often needed to make the uniform membership problem solvable in polynomial-time (Assumption 1).

Learning models
Learning algorithms in this paper work under three different learning models. A In the framework of identification in the limit from positive data, a learner is given a positive presentation of the language L * = L(G * ) of the target grammar G * and each time a new example d i is given, it outputs a grammar G i computed from d 1 , . . . , d i . We say that a learning algorithm A identifies G * in the limit from positive data if for any positive presentation d 1 , d 2 , . . . of L(G * ), there is an integer n such that G n = G m for all m ≥ n and L(G n ) = L(G * ). We say that A identifies a class G of grammars in the limit from positive data iff A identifies all G ∈ G in the limit from positive data.
We say that A identifies a class G of grammars in the limit from positive data and membership queries when we allow A to ask membership queries (MQs) to an oracle when it computes a hypothesis grammar. An instance of an MQ is an object d ∈ O * and the oracle answers whether d ∈ L * in constant time.
The third model is the learning with a minimally adequate teacher (MAT). A learner is not given a positive presentation but it may ask equivalence queries (EQs) to an oracle in addition to MQs. An instance of an EQ is a grammar G. If L(G) = L * , the oracle answers "Congratulations!" and the learning process ends. Otherwise, the oracle returns a When we have an oracle, the learning task itself is trivial unless we show some favorable property on the learning efficiency.

Learnable subclasses
This section presents how Σ-grammars with distributional properties can be learned. Note that all of those properties are relative to Σ. We assume Σgrammars G * = ⟨N * , σ * , F * , P * , I * ⟩ in this section have no useless nonterminals or functions. That is, S(G * , X) ̸ = ∅, C(G * , X) ̸ = ∅ for all X ∈ N * and every f ∈ F * appears in some rule in P * .

Substitutable Languages
Definition 1 (Clark and Eyraud (2007) The definition can be rephrased as follows: (2008) has proposed a learning algorithm for k, l-substitutable CFLs, which satisfy the following property: x 1 uy 1 vz 1 , x 1 uy 2 vz 1 , x 2 uy 1 vz 2 ∈ L =⇒ x 2 uy 2 vz 2 ∈ L for any x i , y i , z i ∈ ∆ * , u ∈ ∆ k and v ∈ ∆ l . We define a signature Σ k,l = ⟨Ω k,l , F k,l , O * ⟩ as fol- Here we put overlines to make elements of Ω pairwise disjoint.
The binary function + α,β concatenates two strings from sorts α and β and gives the right sort in Ω − {O * }. For example, x is the suffix of vw of length l. The unary operation □ O * α ∈ F O * ,α simply removes the overline and "promotes"ū ∈ α to u ∈ O * . ∆ consists of the nullary functions giving a single letter from ∆. It is not hard to see that every CFG has an equivalent Σ k,lgrammar. Note that O * -nonterminals never occur on the right hand side of a rule in a Σ k,l -grammar. Hence C O * is just the singleton {□ O * } such that □ O * ⊙ u = u for all u ∈ ∆ * , whereas C α ̸ = C O * contains arbitrary pairs of strings ⟨l, r⟩ ∈ ∆ * × ∆ * such that ⟨l, r⟩ ⊙ū = lur for anyū ∈ α. The Σ k,lsubstitutability is exactly the k, l-substitutability.
Theorem 1. The class of substitutable Σ-languages is identifiable in the limit from positive data.
The theorem follows Lemmas 2 and 3 below. From a finite set D of positive examples, Algorithm 1 computes the grammar SUBSTP(D) = ⟨N, σ, F, P, I⟩ defined as follows: • P consists of the rules of the form An alternative way to construct a grammar is to use contexts rather than substructures for nonterminals. One can replace SUBSTP(D) in the algorithm by SUBSTD(D) which is defined as follows.
• P consists of the rules of the form The existing algorithms for different classes of substitutable languages (Clark and Eyraud, 2007;Yoshinaka, 2008;Yoshinaka, 2011a) are based on slight variants of SUBSTP. This paper shows the correctness of the algorithm using SUBSTD.
Lemma 2. Let D be a finite subset of a Σsubstitutable language L * and G the grammar output by SUBSTD(D). Then L(G) ⊆ L * .
Proof. One can show by induction on the derivation that if s ∈ S(G, and s = f (s 1 , . . . , s n ). The induction hypothesis says c i ⊙s i ∈ L * for all i. By the rule construction, there are t i for i = 1, . . . , n such that c i ⊙ t i ∈ D ⊆ L * and c ⊙ f (t 1 , . . . , t n ) ∈ D ⊆ L * . We have s i ≡ L * t i since they occur in the same context c i . By Lemma 1, c ⊙ f (s 1 , . . . , s n ) ∈ L * .
Let G * = ⟨N * , σ * , F * , P * , I * ⟩ be a Σ-grammar generating L * . Fix s X ∈ S(G * , X) and c X ∈ C(G * , X) where c X = □ O * for X ∈ I * . Define D * by | X 0 ← f ⟨X 1 , . . . , X n ⟩ ∈ P * } . Proof. Let G = SUBSTD(D). If G * has a rule X 0 ← f ⟨X 1 , . . . , X n ⟩ then G has the correspond- In particular since c X for X ∈ I is the identity func- This shows that we do not need too many data to achieve a right grammar, since |D * | ≤ |P * | + |N * |, where | · | denotes the cardinality of a set. Moreover, it is easy to see Algorithm 1 updates its conjecture in polynomial time in the total size of D by Assumptions 1, 2 and 3.

Finite kernel property
Definition 2 (Clark et al. (2009), Yoshinaka (2011b)). A nonempty finite set S ⊆ S σ(X) is called a k-kernel of a nonterminal X if |S| ≤ k and S(G, X) ≡ L(G) S .
A Σ-grammar G is said to have the k-finite kernel property (k-FKP) if every nonterminal X has a kkernel S X .
Theorem 2. Under Assumptions 1, 2 and 3, Algorithm 2 identifies Σ-grammars with the k-FKP in the limit from positive data and membership queries. The conjecture grammar PRIMAL k (K, F, J) = ⟨N, σ, F, P, I⟩ of Algorithm 2 is defined from finite sets of substructures K ⊆ S, functions F ⊆ F and contexts J ⊆ C. The subsets of those sets corresponding to respective sorts are denoted as • P consists of the rules of the form The grammar is constructed by the aid of finitely many MQs by Lemma 1. However, this condition (2) cannot be checked by finitely many MQs. The condition (1) can be seen as an approximation of (2), which is decidable by finitely many MQs. Clearly (2) implies (1) but not vice versa. If a rule satisfies (1) but not (2), we call the rule incorrect. If a rule is incorrect, there is a witness c ∈ C O 0 − J O 0 such that c ⊙ S 0 ∈ L * and c ⊙ f (S 1 , . . . , S n ) / ∈ L * . Lemma 4. For every finite K ⊆ S and F ⊆ F there is J ⊆ C such thatĜ = PRIMAL(K, F, J) has no incorrect rules and |J| ≤ |F ||K| k(p+1) , in which case L(Ĝ) ⊆ L * .
Let S X be a k-kernel of each nonterminal X of a grammar G * = ⟨N * , σ * , F * , P * , I * ⟩ generating L * .
Lemma 5. There is a finite subset D ⊆ L * such that S X ⊆ S |D for all X ∈ N * , F * ⊆ F |D and |D| ≤ k|N * | + |P * |. Moreover, if S X ⊆ K for all X ∈ N * and F * ⊆ F , then L * ⊆ L(Ĝ).
We prove Theorem 2 discussing the efficiency.
Proof of Theorem 2. Clearly Algorithm 2 updates its conjecture in polynomial time in the data size. Polynomially (in the size of G * ) many positive examples will stabilize K and F by Lemma 5. After K and F stabilized, all the incorrect rules will be removed with at most polynomially (in |K||F |) many examples by Lemma 4. After that point Algorithm 2 never changes the conjecture, which generates the target language L * .

Congruential grammars
Definition 3 (Clark (2010a)). A Σ-grammar G is said to be congruential if every s ∈ S(G, X) is a 1-kernel of every X ∈ N .
Congruential Σ-grammars have the 1-FKP. Under the following additional assumption, this special case will be polynomial-time learnable with a minimally adequate teacher.
Assumption 4. For any derivation tree τ , the size of its yieldτ is polynomially bounded by that of τ .
Theorem 3. Under Assumptions 1, 2, 3 and 4, Algorithm 3 learns any language L * generated by a congruential Σ-grammar G * with a minimally adequate teacher in time polynomial in |N * |, |F * |, ℓ where ℓ is the total size of counterexamples given to the learner. Algorithm 3 uses the same grammar construction PRIMAL as Algorithm 2 where the parameters K and F are calculated from positive counterexamples given by the oracle. On the other hand, J is computed in a different way. By Lemma 4, when the oracle answers a negative counterexample d towards an EQ, our conjectureĜ must use an incorrect rule to derive d. To find and remove such an incorrect rule, Algorithm 3 calls a subroutine WITNESSP with input (τ d , □), where τ d is a derivation tree of G whose yield is d. To be precise, τ d does not have to be a derivation tree. Rather what we require is that for each s ∈ S |d , one can compute at least one tuple of s 1 , . . . , s n ∈ S |d and f ∈ F |d such that s = f (s 1 , . . . , s n ) and the height of the lowest derivation tree of each s i is strictly lower than that of s. Indeed one can do this in polynomial time by a dynamic programming method from SUB(d) and FUN(d). Yet for explanatory easiness, we treat such information as an (implicit) derivation tree τ d . The procedure WITNESSP returns a context that witnesses an incorrect rule that contributes to generating d by searching τ d recursively calling itself. The procedure WITNESSP in general takes a pair (τ, c) such that τ is an for the yieldsτ i of τ i . One can find i such that This means an incorrect rule is in τ i . We call WITNESSP(τ i , c ⊙ f (s 1 , . . . , s i−1 , □,τ i+1 , . . . ,τ n )).
Proof. The number of recursive calls of WITNESSP is no more than the height of τ d , which is at most |S |d |. Let the instance of the j-th recursive call be (τ j , c j ) and χ j the derivation context for c =χ j . χ j+1 is obtained from χ j by replacing at most p subtrees by a derivation tree whose yield is an element of K. By Assumption 4, the size of c j and thus the size of an instance of an MQ is polynomially bounded by |d|ℓ. WITNESSP runs in polynomial time.
Lemma 7. Each time Algorithm 3 receives a negative counterexample, at least one incorrect rule is removed.
Lemma 8. Let G * = ⟨N * , σ, F * , P * , I * ⟩ be a congruential grammar generating L * . Each time Algorithm 3 receives a positive counterexample, the car- Proof of Theorem 3. Time between an EQ and another is polynomially bounded by Lemma 6. By Lemmas 5 and 8, Algorithm 3 gets at most |N * | + |F * | positive counterexamples. The grammarĜ = PRIMAL(K, F, J) is constructed from those positive counterexamples, so it has polynomially many rules. Therefore, by Lemma 7, after getting polynomially many negative counterexamples, which suppress all the incorrect rules, Algorithm 3 gets a right grammar representing L * .

Finite context property
Definition 4 (Clark (2010b), Yoshinaka (2011b) 1 ). A nonempty finite set C ⊆ C is called a k-context of a nonterminal X if |C| ≤ k and A Σ-grammar G is said to have the k-(weak) finite context property (k-FCP) if every nonterminal X has a k-context C X .
Theorem 4. Under Assumptions 1, 2 and 3, Algorithm 4 identifies Σ-grammars with the k-FCP in the limit from positive data and membership queries.
The theorem can be shown by an argument similar to the proof of Theorem 2 based on Lemmas 9 and 10 below. The discussion on the learning efficiency of Algorithm 2 is applied to Algorithm 4 as well. The conjecture grammar DUAL k (J, F, K) = ⟨N, σ, F, P, I⟩ of Algorithm 4 is defined from finite sets of contexts J ⊆ C, functions F ⊆ F and substructures K ⊆ S. For each C ⊆ J O , we write C (K) to mean C † ∩K O . This set can be seen as a finite approximation of C † , which is computable with MQs.
• P consists of the rules of the form 1 We adopt the definition by Yoshinaka, which is slightly weaker than Clark's.
We say that a rule [ In that case, there are s i ∈ C † i such that C 0 ⊙f (s 1 , . . . , s n ) ⊈ L * . Lemma 9. For every finite J ⊆ C and F ⊆ F there is K ⊆ S such thatĜ = DUAL(J, F, K) has no incorrect rules and |K| ≤ p|F ||J| k(p+1) , in which case L(Ĝ) ⊆ L * .
Let G * = ⟨N * , σ * , F * , P * , I * ⟩ generate L * and C X a k-context of each nonterminal X ∈ N * . Lemma 10. There is a finite subset D ⊆ L * such that C |D ⊇ C X for all X ∈ N * , F |D ⊇ F * and |D| ≤ k|N * | + |P * |. Moreover, if J ⊇ C X for all X ∈ N * and F ⊇ F * , then L * ⊆ L(Ĝ).
Theorem 5. Under Assumptions 1, 2 and 3, Algorithm 3 learns any language L * generated by a context-deterministic Σ-grammar G * with a minimally adequate teacher in time polynomial in |N * |, |F * |, ℓ where ℓ is the total size of counterexamples given to the learner.
Algorithm 5 uses the same grammar construction DUAL as Algorithm 4. By Lemma 9, when the oracle answers a negative counterexample d towards an EQ, our conjectureĜ must use an incorrect rule to derive d. To find and remove such an incorrect rule, Algorithm 5 calls a subroutine WITNESSD with a derivation tree τ d ofĜ whose yield is d. The procedure WITNESSD returns a finite set of substructures that witnesses an incorrect rule that contributes to generating d. Otherwise,τ i ∈ c i † for all i, which means the rule ρ is incorrect. WITNESSD(τ ) returns the set {τ 1 , . . . ,τ n }. Differently from the case of WIT-NESSP, an instance of a recursive call is always an (implicit) derivation tree of some s ∈ S |d . This explains why we do not need Assumption 4 in this case. Lemma 11. Time between an EQ and another is polynomially bounded. Lemma 12. Each time Algorithm 5 receives a negative counterexample, at least one incorrect rule is removed. Lemma 13. Let G * = ⟨N * , σ, F * , P * , I * ⟩ be a context-deterministic grammar for L * . Each time Algorithm 5 receives a positive counterexample, the set { X ∈ N * | J ∩ C(G * , X) = ∅ } ∪ (F * − F ) gets shrunk.

Combined approaches
By combining primal and dual approaches, one can obtain stronger approaches (Yoshinaka, 2012). The class of Σ-grammars whose nonterminals admit either a k-kernel or l-context can be learned by combining the techniques presented in Sections 6.2 and 6.4 under Assumptions 1, 2 and 3. Also Σ-grammars whose nonterminals satisfy either the requirement to be congruential or to be contextdeterministic can be learned with a minimally adequate teacher under Assumptions 1, 2, 3 and 4 (Sections 6.3 and 6.5).

Restricted cases
In some grammar classes, it may be the case that only (supersets of) C |d and F |d are computable in polynomial-time but S |d is not, or the other way around: S |d and F |d are efficiently computable but C |d is not. For example, in non-permuting parallel multiple CFGs (Seki et al., 1991), elements of S |d for a string d are tuples of strings of the form ⟨v 1 , . . . , v m ⟩ for d = u 0 v 1 u 1 . . . v m u m and such substrings are polynomially many if m is fixed. However, C |d contains exponentially many contexts. Clark and Yoshinaka (2014) showed that still a dual approach works for parallel multiple CFGs if nonterminals are known to have k-contexts belonging to a certain subset C ⊆ C such that C |d = C |d ∩ C is polynomial-time computable. A symmetric result of a primal approach has also been obtained by Kanazawa and Yoshinaka (2015) targeting a certain kind of tree grammars. This section does not postulate Assumption 3.
Definition 6. A Σ-grammar G is said to have the (k, S)-FKP if every nonterminal admits a k-kernel which is a subset of S.
Assumption 5. There are polynomial-time algorithms that compute SUB(d), CON(d) and FUN(d) It is not hard to see that Algorithm 2 works for learning Σ-grammars with (k, S)-FKP under Assumptions 1, 2 and 5. All discussions in Section 6.2 hold for this restricted case.
The symmetric definition and assumption are as follows.
Definition 7. A Σ-grammar G is said to have the (k, C)-FCP if every nonterminal admits a k-context which is a subset of C.
Assumption 6. There are polynomial-time algorithms that compute SUB(d), CON(d) and It is not hard to see that under Assumptions 1, 2 and 6, Algorithms 4 work for learning Σ-grammars with (k, C)-FCP Σ-grammars. All discussions in Section 6.4 hold for this restricted case.
When learning substitutable languages, even a weaker assumption suffices.
Assumption 7. There are sets S ⊆ S and C ⊆ C such that for every nonterminal X of G ∈ G(Σ), we have S(G, X) ∩ S ̸ = ∅ and C(G, X) ∩ C ̸ = ∅. Moreover, there are polynomial-time algorithms that compute SUB(d), CON(d) and Under Assumptions 1, 2 and 7, Algorithm 1 works using either SUBSTP or SUBSTD.
On the other hand, the results on the polynomialtime MAT learnability of congruential and contextdeterministic Σ-grammars do not hold anymore under any of Assumptions 5, 6 and 7.

Extending learnable classes
This section compares learnable classes of Σlanguages for different Σ with the same special sort O * . For Σ 1 and Σ 2 with Σ i = ⟨Ω i , F i , O * ⟩, if Ω 1 ⊆ Ω 2 and F 1 ⊆ F 2 , every Σ 1 -grammar is a Σ 2 -grammar, so L(Σ 1 ) ⊆ L(Σ 2 ). However, since the distributional properties defined so far are relative to a signature, a Σ 1 -grammar with a distributional property under Σ 1 does not necessarily have the corresponding property under Σ 2 . Yet if S O and C O are preserved by moving from Σ 1 to Σ 2 , the distributional properties other than the substitutability are preserved.
Let us define the direct union Σ 0 = ⟨Ω 0 , F 0 , O * ⟩ of arbitrary signatures Σ 1 and Σ 2 by where G i is a trivial variant of F i working on the new domain and codomain of the form (O, i) and □ i (s, i) = s for all s ∈ O * . Then every Σ i -grammar G can be seen as a special type of Σ 0 -grammar by adding a new initial symbol Z and rules of the form Z ← □ Σ i ⟨X⟩ for all initial symbols X of G. We have L(Σ 1 ) ∪ L(Σ 2 ) ⊆ L(Σ 0 ). Every Σ i -grammar that is congruential, context-deterministic, with the k-FKP or with the k-FCP for i = 1, 2 can be seen as a Σ 0 -grammar with those properties. Note that C O * is the singleton of the identity function in Σ 0 , which means any element of L(G) is a 1-kernel of the new initial symbol Z. In this way, from two signatures, one can obtain a richer learnable class of languages.
The above argument on signature generalization does not hold for substitutable case. Rather the op-posite holds. If Ω 1 ⊆ Ω 2 and F 1 ⊆ F 2 , then a language substitutable under Σ 2 is substitutable under Σ 1 but not vice versa.
Let us say that Σ 2 is finer than Σ 1 if every sort of Ω 1 is partitioned into finite number of sorts in Ω 2 and every function of F 2 is a subfunction of some function in F 1 which accords with the partition. That is, every sort O of Ω 1 has a finite set Ω For instance, Σ k,l is finer than Σ k ′ ,l ′ for k ′ ≤ k and l ′ ≤ l in Example 4. If Σ 2 is finer than Σ 1 , L(Σ 1 ) = L(Σ 2 ) holds. Every language substitutable under Σ 1 is substitutable under Σ 2 but not vice versa. Moreover, every congruential (resp. context-deterministic) Σ 1 -grammar has an equivalent congruential (resp. context-deterministic) Σ 2 -grammar but not vice versa.
9 Grammars with partial functions Yoshinaka (2015) showed that a dual approach can be applied to the learning of conjunctive grammars. Conjunctive grammars (Okhotin, 2001) are CFGs extended with the conjunctive operation & so that one can extract the intersection of the languages of nonterminals. For example, a conjunctive rule A 0 → A 1 &A 2 means that if both A 1 and A 2 generate the same string u then so does A 0 . Conjunctive grammars cannot be seen as Σ-grammars, since the conjunctive operation & is a partial function whose domain is not represented as the direct product of two sorts, which is not legitimate in the general framework of Σ-grammars.
A partial signature is a triple Π = ⟨Ω, F, O * ⟩ which is defined in the way similar to a (total) signature but F may have partial functions. Accordingly contexts in C will be partial functions. We do not have C(G, X) ⊙ S(G, X) ⊆ L(G) any more, since c ⊙ s may not be defined for some elements c ∈ C(G, X) and s ∈ S(G, X). The correspondence between O-concept lattices and Σ-grammars collapses. This prevents the application of the theory of distributional learning developed in this paper to Π-grammars. Still we can generalize the discussion on the learning of conjunctive grammars.
Definition 8. A Π-grammar G is said to have the strong k-FCP if for any X ∈ N O , there is a finite set C X ⊆ C O with |C X | ≤ k such that S(G, X) = { s | c ⊙ s ∈ L for all c ∈ C X } .
Definition 8 requires every c ∈ C X to be total on S(G, X). One can learn Π-grammars with the strong k-FCP under Assumptions 1, 2 and 6, where C consists of total functions only. The grammar construction DUAL k should be modified so that we have a rule [ . . . , s n ) ∈ L * for any c ∈ C 0 and s i ∈ C (K) i such that f (s 1 , . . . , s n ) is defined. One might think that one can naturally define context-deterministic grammars accordingly: Every c ∈ C(G, X) should be a 1-context of X. However, this means that functions in such a Π-grammar are essentially total.