Probing the Linguistic Strengths and Limitations of Unsupervised Grammar Induction

Work in grammar induction should help shed light on the amount of syntactic structure that is discoverable from raw word or tag sequences. But since most current grammar induction algorithms produce unlabeled dependencies, it is difﬁ-cult to analyze what types of constructions these algorithms can or cannot capture, and, therefore, to identify where additional supervision may be necessary. This paper provides an in-depth analysis of the errors made by unsupervised CCG parsers by evaluating them against the labeled dependencies in CCGbank, hinting at new research directions necessary for progress in grammar induction.


Introduction
Grammar induction aims to develop algorithms that can automatically discover the latent syntactic structure of language from raw or part-of-speech tagged text. While such algorithms would have the greatest utility for low-resource languages for which no treebank is available to train supervised parsers, most work in this area has focused on languages where existing treebanks can be used to measure and compare the performance of the resultant parsers. Despite significant progress in the last decade (Klein and Manning, 2004;Headden III et al., 2009;Blunsom and Cohn, 2010;Spitkovsky et al., 2013;Mareček and Straka, 2013), there has been little analysis performed on the types of errors these induction systems make, and our understanding of what kinds of constructions these parsers can or cannot recover is still rather limited. One likely reason for this lack of analysis is the fact that most of the work in this domain has focused on parsers that return unlabeled dependencies, which cannot easily be assigned a linguistic interpretation. This paper shows that approaches that are based on categorial grammar (Steedman, 2000) are amenable to more stringent evaluation metrics, which enable detailed analyses of the constructions they capture, while the commonly used unlabeled directed attachment scores hide linguistically important errors. Any categorial grammar based system, whether deriving its grammar from seed knowledge distinguishing nouns and verbs (Bisk and Hockenmaier, 2013), from a lexicon constructed from a simple questionnaire for linguists (Boonkwan and Steedman, 2011), or from sections of a treebank (Garrette et al., 2015), will attach linguistically expressive categories to individual words, and can therefore produce labeled dependencies. We provide a simple proof of concept for how these labeled dependencies can be used to isolate problem areas in CCG induction algorithms. We illustrate how they make the linguistic assumptions and mistakes of the model transparent, and are easily comparable to a treebank where available. They also allow us to identify linguistic phenomena that require additional supervision or training signal to master. Our analysis will be based on extensions of our earlier system (Bisk and Hockenmaier, 2013), since it requires less supervision than the CCG-based approaches of Boonkwan and Steedman (2011) or Garrette et al. (2015). Our aim in presenting this analysis is to initiate a broader conversation and classification of the impact of various types of supervision provided to these approaches. We will see that most of the constructions that our system cannot capture, even when they are included in the model's search space, involve precisely the kinds of non-local dependencies that elude even supervised dependency parsers (since they require dependency graphs, instead of trees), and that have motivated the use of categorial grammarbased approaches for supervised parsing.
First, we provide a brief introduction to CCG. Next, we define a labeled evaluation metric that allows us to compare the labeled dependencies produced by Bisk and Hockenmaier (2013)'s unsupervised parser with those in CCGbank (Hockenmaier and Steedman, 2007). Third, we extend their induction algorithm to allow it to induce more complex categories, and refine their probability model to handle punctuation and lexicalization, which we show to be necessary when handling the larger grammars induced by our variant of their algorithm. While we also perform a traditional dependency evaluation for comparison to the non-CCG based literature, we focus on our CCG-based labeled evaluation metrics to perform a comparative analysis of Bisk and Hockenmaier (2013)'s parser and our extensions.
2 Combinatory Categorial Grammar CCG categories CCG (Steedman, 2000) is a lexicalized grammar formalism which associates each word with a set of lexical categories that fully specify its syntactic behavior. Lexical categories indicate the expected number, type and relative location of arguments a word should take, or what constituents it may modify. Even without explicit evaluation against a treebank, the CCG lexicon that an unsupervised parser produces provides an easily interpretable snapshot of the assumptions the model has made about a language (Bisk and Hockenmaier, 2013). The set of CCG categories is defined recursively over a small set of atomic categories (e.g. S, N, NP, PP). Complex categories take the form X\Y or X/Y and represent functions which create a result of category X when combined with an argument Y. The slash indicates whether the argument precedes (\) or follows (/) the functor (descriptions of CCG commonly use the vertical slash | to range over both / and \). Modifiers are categories of the form X|X, and may take arguments of their own.
CCG rules CCG rules are defined schematically as function application (>, <), unary (>B 1 , <B 1 ) and generalized composition (>B n , <B n ), typeraising (>T, <T) and conjunction: CCG derivations In the following derivation, forward application is used in line 1) as both the verb and the preposition take their NP arguments. In line 2), the prepositional phrase modifies the verb via backwards composition. Finally, in line 3), the derivation completes by producing a sentence (S) via backwards application: CCG dependencies CCG has two standard evaluation metrics. Supertagging accuracy simply computes how often a model chooses the correct lexical category for a given word. The correct category is a prerequisite for recovering the correct labeled dependency. By tracing through which word fills which argument of which category, a set of dependency arcs, labeled by lexical category and slot, can be extracted: lexical head of a lexical category c i is the corresponding word w i . In general, the lexical head of a derived category is determined by the (primary) functor, so that the lexical head of a category X or X|Z 1 |...|Z n that resulted from combining X|Y and Y or Y|Z 1 |...|Z n is identical to the lexical head of X. However, when a modifier X|X with lexical head m is combined with an X|... whose lexical head is w, the lexical head of the resultant X|... is w, not m. 2 Otherwise, from would become the lexical head of the S\N saw her from afar, and the sentence You know I saw her from afar would have a dependency between know and from, rather than between know and saw.
In general, word w j is a dependent of word w i if the k-th argument of the lexical category c i of word w i is instantiated with the lexical category of word w j . In the above derivation: The use of categories as dependency labels makes CCG labels more fine-grained than a standard dependency grammar. For example, the subject role of intransitive, transitive and ditransitive verbs are all SUB in dependency treebanks but take at least three different labels in CCGbank.
i j wj Label 2 1 I SUB 0 2 saw ROOT 2 3 her OBJ 2 4 from VMOD 4 5 afar PMOD I saw her from afar PMOD VMOD OBJ SUB ROOT An additional complexity in CCGbank are certain types of lexical categories (e.g. for relative pronouns or control verbs) which mediate nonlocal dependencies via a co-indexation mechanism. Identifying such non-local dependencies, e.g. to distinguish between subject and object control (I promise her to come vs. I persuade her to come), is most likely beyond the scope of any purely syntactic grammar induction system but will begin to emerge in a semi-supervised system. 2 That is, the argument X and result X of a modifier X|X are not two distinct instances of the same category, but unify. In this example, I fills the first argument of saw. This is represented by an edge from saw to I, labeled as a transitive verb ((S\N)/N). This procedure is followed for every argument of every predicate, leading to a labeled directed graph.

Spurious am
Evaluation metrics for supervised CCG parsers (Clark et al., 2002) measure labeled f-score (LF1) precision of these dependencies (requiring the functor, argument, lexical category of the functor and slot of the argument to all match). A second, looser, evaluation is often also performed which measures unlabeled, undirected dependency scores (UF1).
Non-local dependencies and complex arguments One advantage of CCG is its ability to recover the non-local dependencies involved in control, raising, or wh-extraction. Since these constructions introduce additional dependencies, CCG parsers return dependency graphs (DAGs), not trees. To obtain these additional dependencies, relative pronouns and control verbs require lexical categories that take complex arguments of the form S\NP or S/NP, and a mechanism for coindexation of the NP inside this argument with another NP argument (e.g. (NP\NP i )/(S|NP i ) for relative pronouns). These co-indexed subjects can be seen in Figure 1.  .14 WP N/(S\N) .08 WP ((N\N)/S)\((N\N)/N) .07 WDT ((S\S)\(S\S))\N .04 RBR S/(S\N) .04 WP S/(S/N) .02 WP  Table 9: Overall performance of the final systems discussed in this paper (Section 23) dicate missing information which only becomes available later in the discourse.

Final Overall Model Performance
Finally, we evaluate these models again on the standard Section 23 against our simplified labelset and on undirected unlabeled arcs.

CoNLL vs CCGbank dependencies
Finally, we examine whether the performance on standard unlabeled dependencies correlates with performance on CCGbank dependencies (Table 10) 2 . This also allows us to compare our systems directly to an unsupervised dependency parser (Naseem et al., 2010), who report directed attachment (unlabeled dependency) scores of a dependency-based HDP model that incorporates either "universal" knowledge (e.g. that adjectives may modify nouns) or "English-specific" knowledge (e.g. that adjectives tend to precede nouns) in the form of soft constraints. Their universal knowledge is akin to, but more explicit and de-2 BH13 use hyperparameter schemes and report 64.2@20.   Naseem et al. (2010).
tailed than the information given to the induction algorithm (see Bisk and Hockenmaier (2013) for a discussion). They evaluate on their training data, i.e. sentences of up to length 20 (without punctuation marks) of Sections 02-21 of the Penn Treebank 3 . We see that performance increases on CCGbank translate to similar gains on the CoNLL dependencies on long sentences. We should note that we expect this discrepancy to grow as systems capture more fine-grained distinction. In this vein, we computed directed attachment recall between CCGbank dependencies and Yamada and Matusumoto's head finding rules and found only a 72.5% overlap. Many of the discrepancies appear to be related to verb chains and analysis of the many DAG structures previously discussed. A full analsyis of the distinctions is beyond the scope of this paper but there is an interesting emperical question for future work as to whether annotation standards make learning even more burdensome.

Conclusions
In this paper, we have touched upon many linguistic phenomena that are common in language and we feel are currently out of scope for grammar induction systems. We focused our analysis on English for simplicity but many of the same types of problems exist in other languages and can be easily identified as stemming from the same lack  .14 WP N/(S\N) .08 WP ((N\N)/S)\((N\N)/N) .07 WDT ((S\S)\(S\S))\N .04 RBR S/(S\N) .04 WP S/(S/N) .02 WP  Table 9: Overall performance of the final systems discussed in this paper (Section 23) dicate missing information which only becomes available later in the discourse.

Final Overall Model Performance
Finally, we evaluate these models again on the standard Section 23 against our simplified labelset and on undirected unlabeled arcs.

CoNLL vs CCGbank dependencies
Finally, we examine whether the performance on standard unlabeled dependencies correlates with performance on CCGbank dependencies (Table 10) 2 . This also allows us to compare our systems directly to an unsupervised dependency parser (Naseem et al., 2010), who report directed attachment (unlabeled dependency) scores of a dependency-based HDP model that incorporates either "universal" knowledge (e.g. that adjectives may modify nouns) or "English-specific" knowledge (e.g. that adjectives tend to precede nouns) in the form of soft constraints. Their universal knowledge is akin to, but more explicit and de-2 BH13 use hyperparameter schemes and report 64.2@20.  tailed than the information given to the induction algorithm (see Bisk and Hockenmaier (2013) for a discussion). They evaluate on their training data, i.e. sentences of up to length 20 (without punctuation marks) of Sections 02-21 of the Penn Treebank 3 . We see that performance increases on CCGbank translate to similar gains on the CoNLL dependencies on long sentences. We should note that we expect this discrepancy to grow as systems capture more fine-grained distinction. In this vein, we computed directed attachment recall between CCGbank dependencies and Yamada and Matusumoto's head finding rules and found only a 72.5% overlap. Many of the discrepancies appear to be related to verb chains and analysis of the many DAG structures previously discussed. A full analsyis of the distinctions is beyond the scope of this paper but there is an interesting emperical question for future work as to whether annotation standards make learning even more burdensome.

Conclusions
In this paper, we have touched upon many linguistic phenomena that are common in language and we feel are currently out of scope for grammar induction systems. We focused our analysis on English for simplicity but many of the same types of problems exist in other languages and can be easily identified as stemming from the same lack 3 With Yamada and Matsumoto's (2003) head rules Figure 1: Unlabeled predicate-argument dependency graphs for two sentences with co-indexed subjects.
Errors exposed by labeled evaluation We now illustrate how the lexical categories and labeled dependencies produced by CCG parsers expose linguistic mistakes. First, we consider a wildly incorrect analysis of the first example sentence, in which the subject is treated as an adverb, and the PP as an NP object of the verb: None of the labeled directed CCG dependencies are correct. But under the more lenient unlabeled directed evaluation of Garrette et al. (2015), and the even more lenient unlabeled undirected metric of Clark et al. (2002), two (or three) of the four dependencies would be deemed correct: We now turn to a subtle distinction that corresponds to a systematic mistake made by all models we evaluate. The categories of noun-modifying prepositions (at) and possessive markers (') differ only in the directionality of their slashes: A full explanation of the calculus can be found in (Steedman, 2000) including discussion of a type-raising and a ternary rule for conjunction. We assume no type-changing in this work.

Dependencies
By tracing through which word fills which argument of a category a set of dependency arcs, labeled by lexical category and slot, can be extracted and are used for evaluation: lexical head of a lexical category c i is the corresponding word w i . In general, the lexical head of a derived category is determined by the (primary) functor, so that the lexical head of a category X or X|Z 1 |...|Z n that resulted from combining X|Y and Y or Y|Z 1 |...|Z n is identical to the lexical head of X. However, when a modifier X|X with lexical head m is combined with an X|... whose lexical head is w, the lexical head of the resultant X|... is w, not m. 2 Otherwise, from would become the lexical head of the S\N saw her from afar, and the sentence You know I saw her from afar would have a dependency between know and from, rather than between know and saw.
In general, word w j is a dependent of word w i if the k-th argument of the lexical category c i of word w i is instantiated with the lexical category of word w j . In the above derivation: The use of categories as dependency labels makes CCG labels more fine-grained than a standard dependency grammar. For example, the subject role of intransitive, transitive and ditransitive verbs are all SUB in dependency treebanks but take at least three different labels in CCGbank.
An additional complexity in CCGbank are certain types of lexical categories (e.g. for relative pronouns or control verbs) which mediate nonlocal dependencies via a co-indexation mechanism. Identifying such non-local dependencies, e.g. to distinguish between subject and object control (I promise her to come vs. I persuade her to come), is most likely beyond the scope of any purely syntactic grammar induction system but will begin to emerge in a semi-supervised system. 2 That is, the argument X and result X of a modifier X|X are not two distinct instances of the same category, but unify.
Spurious ambiguity and normal-form parsing Composition and type-raising introduce an exponential number of derivations that are semantically equivalent, i.e. yield the same set of dependencies. In supervised CCG parsers (Hockenmaier and Steedman, 2002;Clark and Curran, 2007), this spurious ambiguity is largely eliminated because the derivations in CCGbank are in a normal form that uses composition and type-raising only when necessary, although it can be further alleviated via the use of a normal-form parsing algorithm (Eisner, 1996;Hockenmaier and Bisk, 2010) that minimizes the use of composition (and typeraising). We will show below that this spurious ambiguity is particularly deleterious for unsupervised CCG parsers that do not impose any normalform constraints.

Unsupervised CCG parsing
We now review the unsupervised CCG parser of Bisk and Hockenmaier (2012b;, which is trained over parse forests obtained from a CCG lexicon that was induced from POS-tagged text. Unsupervised CCG induction The induction algorithm needs to identify the set of lexical categories and to learn the mapping between words and lexical categories, e.g.: Bisk and Hockenmaier (2012b) define an algorithm that automatically induces a CCG lexicon from part-of-speech tagged text in an iterative process. This process starts with a small amount of seed knowledge that defines which atomic categories (S, N and conj) can be assigned to which part-of-speech tags (nominal POS tags may have the category N, while verbs may have the category S). Based on the assumption that, under mild restrictions, words can either subcategorize for or modify the words they are adjacent to, this process produces lexical categories of increasing complexity. Immediate neighbors of words with categories S or N may act as modifiers with categories S|S or N|N. The second round of induction can also introduce modifiers (X|X)|(X|X) of existing modifiers X|X. In the first iteration, words with category S can take adjacent N arguments. In the second round, modifiers and words with category S|N that are adjacent to words with the category N or These dependencies are the complete predicate argument structure of the sentence and supervised evaluation is performed by computing a parser's precision and recall on matching the head, dependant, category and slot of each arc. A second looser evaluation is often also performed which simply checks that the undirected and unlabeled arcs match. An example of this difference that's particularly relevant to the discussion in this paper is the headedness of prepositional phrases versus posessives.

Prepositional Phrase
The The undirected edges for the inital noun phrase are identical, but the heads differ. In CCG, we assume that categories of the form X|X where X is atomic are modifiers. In this way, the first sentence turns the prepositional phrase (at the company) into a modifier of the woman. In contrast, in the as getting the wrong head leads to the company laughing or other semantically nonsensical analyses.

Using Labels to Diagnose Errors
Finally, we quickly provide an incorrect analysis of the first example sentence as a simple exercise in using labels to diagnose mistakes: In this example, the verb analysis is trying to analyze the language as VOS instead of SVO. Once familiar with reading CCG categories the model's output and mistake can be easily diagnosed. A model producing this analysis is not learning the correct word order of the language, nor the correct role for prepositions by taking afar as a subject. This type of mistake is obvious to a speaker of the language even without a treebank for evaluation. In this way we believe label prediction eases the analysis burden when diagnosing a system's output.

A Simplified Labeled Evaluation
In languages with treebanks, labeled evaluation can make this style of analysis even simpler. Fortunately, approaches using CCG can produce labeled output but unfortunately there are mismatches between the basic set of categories and those used in treebanks. We will focus on the English CCGbank but these details apply with only minor changes to German and Chinese as well.

Simplification
Because the lexical categories guide parsing, the set used in supervised parsing is extremely large and augmented with features. These features are not strictly part of the CCG calculus but mark properties of the underlying words, for example indicating if a verb is declarative or infinitival or if a noun phrase contains a number. These features A full explanation of the calculus can be found in (Steedman, 2000) including discussion of a type-raising and a ternary rule for conjunction. We assume no type-changing in this work.

Dependencies
By tracing through which word fills which argument of a category a set of dependency arcs, labeled by lexical category and slot, can be extracted and are used for evaluation: lexical head of a lexical category c i is the corresponding word w i . In general, the lexical head of a derived category is determined by the (primary) functor, so that the lexical head of a category X or X|Z 1 |...|Z n that resulted from combining X|Y and Y or Y|Z 1 |...|Z n is identical to the lexical head of X. However, when a modifier X|X with lexical head m is combined with an X|... whose lexical head is w, the lexical head of the resultant X|... is w, not m. 2 Otherwise, from would become the lexical head of the S\N saw her from afar, and the sentence You know I saw her from afar would have a dependency between know and from, rather than between know and saw.
In general, word w j is a dependent of word w i if the k-th argument of the lexical category c i of word w i is instantiated with the lexical category of word w j . In the above derivation: The use of categories as dependency labels makes CCG labels more fine-grained than a standard dependency grammar. For example, the subject role of intransitive, transitive and ditransitive verbs are all SUB in dependency treebanks but take at least three different labels in CCGbank. An additional complexity in CCGbank are certain types of lexical categories (e.g. for relative pronouns or control verbs) which mediate nonlocal dependencies via a co-indexation mechanism. Identifying such non-local dependencies, e.g. to distinguish between subject and object control (I promise her to come vs. I persuade her to come), is most likely beyond the scope of any purely syntactic grammar induction system but will begin to emerge in a semi-supervised system. 2 That is, the argument X and result X of a modifier X|X are not two distinct instances of the same category, but unify.
Spurious ambiguity and normal-form parsing Composition and type-raising introduce an exponential number of derivations that are semantically equivalent, i.e. yield the same set of dependencies. In supervised CCG parsers (Hockenmaier and Steedman, 2002;Clark and Curran, 2007), this spurious ambiguity is largely eliminated because the derivations in CCGbank are in a normal form that uses composition and type-raising only when necessary, although it can be further alleviated via the use of a normal-form parsing algorithm (Eisner, 1996;Hockenmaier and Bisk, 2010) that minimizes the use of composition (and typeraising). We will show below that this spurious ambiguity is particularly deleterious for unsupervised CCG parsers that do not impose any normalform constraints.

Unsupervised CCG parsing
We now review the unsupervised CCG parser of Bisk and Hockenmaier (2012b;, which is trained over parse forests obtained from a CCG lexicon that was induced from POS-tagged text. Unsupervised CCG induction The induction algorithm needs to identify the set of lexical categories and to learn the mapping between words and lexical categories, e.g.: Bisk and Hockenmaier (2012b) define an algorithm that automatically induces a CCG lexicon from part-of-speech tagged text in an iterative process. This process starts with a small amount of seed knowledge that defines which atomic categories (S, N and conj) can be assigned to which part-of-speech tags (nominal POS tags may have the category N, while verbs may have the category S). Based on the assumption that, under mild restrictions, words can either subcategorize for or modify the words they are adjacent to, this process produces lexical categories of increasing complexity. Immediate neighbors of words with categories S or N may act as modifiers with categories S|S or N|N. The second round of induction can also introduce modifiers (X|X)|(X|X) of existing modifiers X|X. In the first iteration, words with category S can take adjacent N arguments. In the second round, modifiers and words with category S|N that are adjacent to words with the category N or These dependencies are the complete predicate argument structure of the sentence and supervised evaluation is performed by computing a parser's precision and recall on matching the head, dependant, category and slot of each arc. A second looser evaluation is often also performed which simply checks that the undirected and unlabeled arcs match. An example of this difference that's particularly relevant to the discussion in this paper is the headedness of prepositional phrases versus posessives. The undirected edges for the inital noun phrase are identical, but the heads differ. In CCG, we assume that categories of the form X|X where X is atomic are modifiers. In this way, the first sentence turns the prepositional phrase (at the company) into a modifier of the woman. In contrast, in the as getting the wrong head leads to the company laughing or other semantically nonsensical analyses.

Using Labels to Diagnose Errors
Finally, we quickly provide an incorrect analysis of the first example sentence as a simple exercise in using labels to diagnose mistakes: In this example, the verb analysis is trying to analyze the language as VOS instead of SVO. Once familiar with reading CCG categories the model's output and mistake can be easily diagnosed. A model producing this analysis is not learning the correct word order of the language, nor the correct role for prepositions by taking afar as a subject. This type of mistake is obvious to a speaker of the language even without a treebank for evaluation. In this way we believe label prediction eases the analysis burden when diagnosing a system's output.

A Simplified Labeled Evaluation
In languages with treebanks, labeled evaluation can make this style of analysis even simpler. Fortunately, approaches using CCG can produce labeled output but unfortunately there are mismatches between the basic set of categories and those used in treebanks. We will focus on the English CCGbank but these details apply with only minor changes to German and Chinese as well.

Simplification
Because the lexical categories guide parsing, the set used in supervised parsing is extremely large and augmented with features. These features are not strictly part of the CCG calculus but mark properties of the underlying words, for example indicating if a verb is declarative or infinitival or if a noun phrase contains a number. These features The unlabeled dependencies inside the noun phrases are identical, but the heads differ. The first sentence turns the prepositional phrase (at the company) into a modifier of woman. In contrast, in the possessive case, woman 's modifies company. According to an unlabeled (directed) score, confusing these analyses would be 80% correct, whereas LF1 would only be 20%. But without a semantic bias for companies growing and women laughing, there is no signal for the learner.

Labeled Evaluation for CCG Induction
We have just seen that labeled evaluation can expose many linguistically important mistakes. In order to enable a fair and informative comparison of unsupervised CCG parsers against the lexical categories and labeled dependencies in CCGbank, we define a simplification of CCGbank's lexical categories that does not alter the number or direction of dependencies, but makes the categories and dependency labels directly comparable to those produced by an unsupervised parser. We also do not alter the derivations themselves, although these may contain type-changing rules (which allow e.g. participial verb phrases S[ng]\NP to be used as NP modifiers NP\NP) that are beyond the scope of our induction algorithm.
Although the CCG derivations and dependencies that CCG-based parsers return should in principle be amenable to a quantitative labeled evaluation when a gold-standard CCG corpus is available, there may be minor systematic differences between the sets of categories assumed by the induced parser and those in the treebank. In particular, the lexical categories in the English CCGbank are augmented with morphosyntactic features that indicate e.g. whether sentences are declarative (S[dcl]), or verb phrases are infinitival (S[to]\NP). Prior work on supervised parsing with CCG found that many of these features can be recovered with proper modeling of latent state splitting (Fowler and Penn, 2010). Since we wish to evaluate a system that does not aim to induce such features, we remove them. We also remove the distinction between noun phrases (NP) and nouns (N), which is predicated on knowledge of , allowing us to maintain the dependency on the subject. With these three simplifications we eliminate much of the detailed knowledge required to construct the precise CCGbankstyle categories, and dramatically reduce the set of categories without losing expressive power. One distinction that we do not conflate, even though it is currently beyond the scope of the induction algorithm, is the distinction between PP arguments (requiring prepositions to have the category PP/NP) and adjuncts (requiring prepositions to be (NP\NP)/NP or ((S\NP)\(S\NP))/NP). This simplification is consistent with the most basic components of CCG and can therefore be easily used for the evaluation and analysis of any weakly or fully supervised CCG system, not just that of Bisk and Hockenmaier (2012). An example simplification is present in Figure 2, and the reduction in the set of categories can be seen in Table 1. Similar simplifications should also be possible for CCGbanks in other languages.

Our approach
There are two parts to our approach: 1) inducing a CCG grammar from seed knowledge and 2) learning a probability model over parses. The induction algorithm (Bisk and Hockenmaier, 2012) uses the seed knowledge that nouns can take the CCG category N, that verbs can take the category S and may take N arguments, and that any word may modify a constituent it is adjacent to, to iteratively induce a CCG lexicon to parse the training data. In Bisk and Hockenmaier (2013), we introduced a model that is based on Hierarchical Dirichlet Processes (Teh et al., 2006). This HDP-CCG model gave state-of-the-art performance on a number languages, and qualitative analysis of the resultant lexicons indicated that the system was learning the word order and many of the correct attachments of the tested languages. But this system also had a number of shortcomings: the induction algorithm was restricted to a small fragment of CCG, the model emitted only POS tags rather than words, and punctuation was ignored. Here, we use our previous HDP-CCG system as a baseline, and introduce three novel extensions that attempt to address these concerns.

Experimental Setup
For our experiments we will follow the standard practice in supervised parsing of using WSJ Sections 02-21 for training, Section 22 for development and error analysis, and a final evaluation of the best models on Section 23. Because the induced lexicons are overly general, the memory footprint grows rapidly as the complexity of the grammar increases. For this reason, we only train on sentences that contain up to 20 words (as well as an arbitrary number of punctuation marks). All analyses and evaluation are performed with sentences of all lengths unless otherwise indicated. Finally, Bisk and Hockenmaier (2013) followed Liang et al. (2007) in setting the values of the hyperparameters α to powers (eg. the square) of the number of observed outcomes in the distribution. But when the output consists of words rather than POS tags, the concentration parameter α = V 2 is too large to allow the model to learn. For this reason, experiments will be reported with all hyperparameters set to a constant of 2500.  6 Extending the HDP-CCG system We now examine how extending the HDP-CCG baseline model to capture lexicalization and punctuation, and how increasing the complexity of the induced grammars affect performance (Table 2).

Modeling Lexicalization
In keeping with most work in grammar induction from part-of-speech tagged text, Bisk and Hockenmaier's (2013) HDP-CCG treats POS tags t rather than words w as the terminals it generates based on their lexical categories c. The advantage of this approach is that tag-based emissions p(t|c) are a lot less sparse than word-based emissions p(w|c). It is therefore beneficial to first train a model that emits tags rather than words (Carroll and Rooth, 1998), and then to use this simpler model to initialize a lexicalized model that generates words instead of tags. To perform the switch we simply estimate counts for the parse forests using the unlexicalized model during the E-Step and then apply those counts to the lexicalized model during the M-Step. Inside-Outside then continues as before. Many words, like prepositions, differ systematically in their preferred syntactic role from that of their part-of-speech tags. This change benefits all settings of the model (Column 2 of Table 2).

Modeling Punctuation
Spitkovsky et al. (2011) performed a detailed analysis of punctuation for dependency-based grammar induction, and proposed a number of constraints that aimed to capture the different ways in which dependencies might cross constituent boundaries implied by punctuation marks. A constituency-based formalism like CCG allows us instead to define a very simple, but effective Dirichlet Process (DP) based Markov gramreported dependency evaluation comparison with the work of Naseem et al. (2010). We fixed this hyperparameter setting for experimental simplicity but a more rigorous grid search might find better parameters for the complex models. mar that emits punctuation marks at the maximal projections of constituents. We note that CCG derivations are binary branching, and that virtually every instance of a binary rule in a normal-form derivation combines a head X or X|Y with an argument Y or modifier X|X. Without reducing the set of strings generated by the grammar, we can therefore assume that punctuation marks can only be attached to the argument Y or the adjunct X|X: To model this, for each maximal projection (i.e. whenever we generate a non-head child) with category C, we first decide whether punctuation marks should be emitted (M = {true, false}) to the left or right side (Dir) of C. Since there may be multiple adjacent punctuation marks (... ."), we treat this as a Markov process in which the history variable captures whether previous punctuation marks have been generated or not. Finally, we generate an actual punctuation mark w m : ∼ DP (α, p(M )) p(wm | Dir , Hist, C) ∼ DP (α, p(wm | dir , hist)) p(wm | Dir , Hist) ∼ DP (α, p(wm)) We treat # and $ symbols as ordinary lexical items for which CCG categories will be induced by the regular induction algorithm, but treat all other punctuation marks, including quotes and brackets. Commas and semicolons (,, ;) can act both as punctuation marks generated by this Markov grammar, and as conjunctions with lexical category conj. This model leads to further performance gains (Columns 3 and 4 of Table 2).

Increasing Grammatical Complexity
The existing grammar induction scheme is very simplistic. It assumes that adjacent words either modify one another or can be taken as arguments. Left unconstrained this space of grammatical cat-  Table 3: Test set performance of the final systems discussed in this paper (Section 23) egories introduced grows very rapidly, introducing a tremendous number of incorrect categories (analyzed later in Table 9). For this reason Bisk and Hockenmaier (2013) applied the HDP-CCG model to a context-free fragment of CCG, limiting the arity of lexical categories (number of arguments they can take) to two and the arity of composition (how many arguments can be passed through composition) to one. We know the space of grammatical constructions is larger than this, so we will allow the model to induce categories with three arguments and use generalized composition (B 3 ). Bisk and Hockenmaier (2013) allow lexical categories to only take atomic arguments, but, as explained above, non-local dependencies require complex arguments of the form S|N. We therefore allow lexical categories to take up to one complex argument of the form S|N. Atomic lexical categories are not allowed to take complex arguments, eliminating S|(S|N) and N|(S|N). Increasing the search space (Rows 3 and 4 of Table 2) shows corresponding decreases in performance. Finally, Bisk and Hockenmaier (2013) eliminated the possessive-preposition ambiguity explained above by disallowing categories of the form (X\X)/X and (X/X)\X to be used simultaneously. Removing this restriction does not harm performance (Column 5 of Table 2). Table 2 shows the performance of 20 different model settings on Section 22 under the simplified labeled CCG-based dependency evaluation proposed above, starting with Bisk and Hockenmaier's (2013) original model (henceforth: B 1 , top left). We see that modeling punctuation and lexicalization both increase performance. We also show that allowing categories of the form (X\X)/X and (X/X)\X on top of the lexicalized models with punctuation does not lead to a noticeable decrease in performance. We also see that an increase in grammatical and lexical complexity is only beneficial for the grammars that allow only atomic arguments, and only if both lexicalization   Naseem et al. (2010).

Summary and test set performance
and punctuation are modeled. Allowing complex arguments is generally not beneficial, and performance drops further if the grammatical complexity is increased to B 3 . Our further analysis will focus on the three bolded models, B 1 , B C 1 (the best model with complex arguments) and B P&L

3
(the best overall model), whose supertag accuracy, labeled (LF1) and unlabeled undirected CCG dependency recovery on Section 23 are shown in Table 3. We see that B C 1 and B P&L 3 both outperform B 1 on all metrics, although the unlabeled metric (UF1) perhaps misleadingly suggests that B C 1 leads to a greater improvement than the supertagging and LF1 metrics indicate.

CCGbank vs. dependency trees
Finally, to compare our models directly to a comparable unsupervised dependency parser (Naseem et al., 2010), we evaluate them against the unlabeled dependencies produced by Yamada and Matsumoto's (2003) head rules for Sections 02-21 of the Penn Treebank (Table 4) 2 . Naseem et al. (2010) only report performance on sentences of up to length 20 (without punctuation marks). Their approach incorporates prior linguistic knowledge either in the form of "universal" constraints (e.g. that adjectives may modify nouns) or "Englishspecific" constraints (e.g. that adjectives tend to modify and precede nouns). These universal constraints are akin to, but more explicit and detailed than the information given to the induction algorithm (see Bisk and Hockenmaier (2013) for a discussion). Comparing these numbers to labeled and unlabeled CCG dependencies on the same corpus (all sentences, hence, @∞), we see that performance increases on CCGbank do not translate to similar gains on these unlabeled dependencies. While we have done our best to convert the predicate argument structure of CCG into dependencies  Table 5: Detailed supertagging analysis: Recall scores of B 1 , B C 1 , and B 3 P&L on the most common recoverable (simplified) lexical categories in Section 22 along with the most commonly produced error.

Error analysis
Supertagging error analysis We first consider the lexical categories that are induced by the models. Table 5 shows the accuracy with which they recover the most common gold lexical categories, together with the category that they most often produced instead. We see that the simplest model (B 1 ) performs best on N, and perhaps over generates (N\N)/N (noun-modifying prepositions), while the overall best model (B P&L 3 ) outperforms both other models only on intransitive verbs.
The most interesting component of our analysis is the long tail of constructions that must be captured in order to produce semantically appropriate representations. We can inspect the confusion matrix of the lexical categories that the model fails to use to obtain insight into how its predictions disagree with the ground truth, and why these constructions may require special attention. Table 6 shows the most common CCGbank categories that were in the search space of some of the more complex models (e.g. B C 3 ), but were never used by any of the parsers in a Viterbi parse. These include possessives, relative pronouns, modals/auxiliaries, control verbs and ditransitives. We show the categories that the B C 1 model uses instead. The gold categories shown correspond to the bold words in Table 6. While the reason many of these cases are difficult is intuitive (e.g. very modifying tall instead of man), a more difficult type of error than previously discussed is that of recovering non-local dependencies. The recovery of nonlocal dependencies is beyond the scope of both standard dependency-based approaches and Bisk and Hockenmaier (2013)'s original induction algorithm. But the parser does not learn to use lexical categories with complex arguments correctly even when the algorithm is extended, to induce them. For example, B C 1 prefers to treat auxiliaries or equi verbs like promise as intransitives rather than as an auxiliary that shares its subject with pay. The surface string supports this decision, as it can be parsed without having to capture the nonlocal dependencies (top row) present in the correct (bottom row) analysis:   We also see that this model uses seemingly non-English verb categories of the form (S/N)/N, both for ditransitives, and object control verbs, perhaps because the possibly spurious /N argument could be swallowed by other categories that take arguments of the form S/N, like the (incorrect) treatment of subject relative pronouns. One possible lesson we can extract from this is that practical approaches for building parsers for new languages might need to focus on injecting semantic information that is outside the scope of the learner.
Dependency error analysis Table 7 shows the labeled recall of the most common dependencies. We see that both new models typically outperform the baseline, although they yield different improvements on different dependency types. B C 1 is better at recovering the subjects of intransitive verbs (S\N) and verbs that take sentential complements ((S\N)/S), while B 3 is better for simple adjuncts (N/N, S/S, S\S) and transitive verbs.
Wh-words and the long tail To dig slightly deeper into the set of missing constructions, we tried to identify the most common categories that are beyond the search space of the current induction algorithm. We first computed the set of categories used by each part of speech tag in CCGbank, and thresholded the lexicon at 95% token coverage for each tag. Removing the categories that contain PP and those that can be induced by the algorithm in its most general setting, we are left with the categories shown in Table 8. The tags that are missing categories are predominantly whwords required for wh-questions, relative clauses or free relative clauses. Some of these categories violate the assumptions made by the induction algorithm: question words return a sentence (S) but are not themselves verbs. Free relative pronouns return a noun, but take arguments. However, this is .04 WP S/(S/N) .02 WP  Table 9: Size, ambiguity, coverage and precision (evaluated on Section 22) of the induced lexicons.
a surprisingly small set of special function words and therefore perhaps a strategic place for supervision. Questions in particular pose an interesting learning question -how does one learn that these constructions indicate missing information which only becomes available later in the discourse?
Grammatical complexity and size of the search space As lexical categories are a good proxy for the set of constructions the grammar will entertain, we can measure the size and ambiguity of the search space as a function of the number of lexical category types it induces as compared to the percentage that are actually valid categories for the language. In Table 9, we compare the lexicons induced by variants of the induction algorithm by their token-based coverage (the percent of tokens in Sections 22 for which the induced tag lexicon contains the correct category), type-based coverage (the percent of category types that the induced lexicon contains), as well as type-based precision (the percent of induced category types that occur in Section 22). This analysis is independent of the learned models, as their probabilities are not taken into account. We see that as the number of lexical categories induced (subject to the constraints of Bisk and Hockenmaier (2012)) increases, the percent that are valid English categories decreases rapidly (type-based precision falls from 81.1% to 36.1%). Despite this, and despite a high token coverage of up to 90%, we still miss almost 70% of the required category types. This helps explain why performance degrades so much for B C 3 , the arity three lexicon with complex arguments.

Dealing with Non-Local Dependencies
While the methodology used here is restricted to CCG based algorithms, we believe the lessons to be very general. The aforementioned constructions involve optional arguments, non-local dependencies, and multiple potential heads. Even though CCG is theoretically expressive enough to handle these constructions, they present the unsupervised learner with additional ambiguity that will pose difficulties independently of the underlying grammatical representation.
For example, although our approach learns that subject NPs are taken as arguments by verbs, the task of deciding which verb to attach the subject to is frequently ambiguous. This most commonly occurs in verb chains, and is compounded in the presence of subject-modifying relative clauses (in CCGbank, both constructions are in fact treated as several verbs sharing a single subject). To illustrate this, we ran the B C 1 and B 3 P&L systems on the following three sentences: 1. The woman won an award 2. The woman has won an award 3. The woman being promoted has won an award The single-verb sentence is correctly parsed by both models, but they flounder as distractors are added. Both treat has as an intransitive verb, won as an adverb and an as a preposition: The woman won an award B3 P&L /B C 1 : N/N N (S\N)/N N/N N The woman has won an award B3 P&L /B C 1 : N/N N S\N S\S (S\S)/N N To accommodate the presence of two additional verbs, both models analyze being as a noun modifier that takes promoted as an argument. B C 1 (correctly) stipulates a non-local dependency involving promoted, but treats it (arguably incorrectly) as a case of object extraction: ... Discovering these, and many of the other systematic errors describe here, may be less obvious when analyzing unlabeled dependency trees. But we would expect similar difficulties for any unsupervised approach when sentence complexity grows without a specific bias for a given analysis.

Conclusions
In this paper, we have introduced labeled evaluation metrics for unsupervised CCG parsers, and have shown that these expose many common syntactic phenomena that are currently out of scope for any unsupervised grammar induction systems. While we do not wish claim that CCGbank's analyses are free of arbitrary decisions, we hope to have demonstrated that these labeled metrics enable linguistically informed error analyses, and hence allow us to at least in part address the question of where and why the performance of these approaches might plateau. We focused our analysis on English for simplicity, but many of the same types of problems exist in other languages and can be easily identified as stemming from the same lack of supervision. For example, in Japanese we would expect problems with post-positions, in German with verb clusters, in Chinese with measure words, or in Arabic with morphology and variable word order.
We believe that one way to overcome the issues we have identified is to incorporate a semantic signal. Lexical semantics, if sparsity can be avoided, might suffice; otherwise learning with grounding or an extrinsic task could be used to bias the choice of predicates, their arity and in turn the function words that connect them. Alternatively, a simpler solution might be to follow the lead of Boonkwan and Steedman (2011) or Garrette et al. (2015) where gold categories are assigned by a linguist or treebank to tags and words. It is possible that more limited syntactic supervision might be sufficient if focused on the semantically ambiguous cases we have isolated.
More generally, we hope to initiate a conversation about grammar induction which includes a discussion of how these non-trivial constructions can be discovered, learned, and modeled. Relatedly, in future extensions to semi-supervised or projection based approaches, these types of constructions are probably the most useful to get right despite comprising the tail, as analyses without them may not be semantically appropriate. In summary, we hope to begin to pull back the veil on the types of information that a truly unsupervised system, if one should ever exist, would need to learn, and we pose a challenge to the community to find ways that a learner might discover this knowledge without hand-engineering it.