A Supertag-Context Model for Weakly-Supervised CCG Parser Learning

Combinatory Categorial Grammar ( CCG ) is a lexicalized grammar formalism in which words are associated with categories that specify the syntactic conﬁgura-tions in which they may occur. We present a novel parsing model with the capacity to capture the associative adjacent-category relationships intrinsic to CCG by param-eterizing the relationships between each constituent label and the preterminal categories directly to its left and right, bi-asing the model toward constituent categories that can combine with their contexts. This builds on the intuitions of Klein and Manning’s (2002) “constituent-context” model, which demonstrated the value of modeling context, but has the advantage of being able to exploit the properties of CCG . Our experiments show that our model outperforms a baseline in which this context information is not captured.


Introduction
Learning parsers from incomplete or indirect supervision is an important component of moving NLP research toward new domains and languages. But with less information, it becomes necessary to devise ways of making better use of the information that is available. In general, this means constructing inductive biases that take advantage of unannotated data to train probabilistic models.
One important example is the constituentcontext model (CCM) of Klein and Manning (2002), which was specifically designed to capture the linguistic observation made by Radford (1988) that there are regularities to the contexts in which constituents appear. This phenomenon, known as substitutability, says that phrases of the same type appear in similar contexts. For example, the part-of-speech (POS) sequence ADJ NOUN frequently occurs between the tags DET and VERB. This DET-VERB context also frequently applies to the single-word sequence NOUN and to ADJ ADJ NOUN. From this, we might deduce that DET-VERB is a likely context for a noun phrase. CCM is able to learn which POS contexts are likely, and does so via a probabilistic generative model, providing a statistical, data-driven take on substitutability. However, since there is nothing intrinsic about the POS pair DET-VERB that indicates a priori that it is a likely constituent context, this fact must be inferred entirely from the data. Baldridge (2008) observed that unlike opaque, atomic POS labels, the rich structures of Combinatory Categorial Grammar (CCG) (Steedman, 2000;Steedman and Baldridge, 2011) categories reflect universal grammatical properties. CCG is a lexicalized grammar formalism in which every constituent in a sentence is associated with a structured category that specifies its syntactic relationship to other constituents. For example, a category might encode that "this constituent can combine with a noun phrase to the right (an object) and then a noun phrase to the left (a subject) to produce a sentence" instead of simply VERB. CCG has proven useful as a framework for grammar induction due to its ability to incorporate linguistic knowledge to guide parser learning by, for example, specifying rules in lexical-expansion algorithms (Bisk and Hockenmaier, 2012;2013) or encoding that information as priors within a Bayesian framework (Garrette et al., 2015).
Baldridge observed is that, cross-linguistically, grammars prefer simpler syntactic structures when possible, and that due to the natural correspondence of categories and syntactic structure, biasing toward simpler categories encourages simpler structures. In previous work, we were able to incorporate this preference into a Bayesian parsing model, biasing PCFG productions toward sim-pler categories by encoding a notion of category simplicity into a prior (Garrette et al., 2015). Baldridge further notes that due to the natural associativity of CCG, adjacent categories tend to be combinable. We previously showed that incorporating this intuition into a Bayesian prior can help train a CCG supertagger (Garrette et al., 2014).
In this paper, we present a novel parsing model that is designed specifically for the capacity to capture both of these universal, intrinsic properties of CCG. We do so by extending our previous, PCFG-based parsing model to include parameters that govern the relationship between constituent categories and the preterminal categories (also known as supertags) to the left and right. The advantage of modeling context within a CCG framework is that while CCM must learn which contexts are likely purely from the data, the CCG categories give us obvious a priori information about whether a context is likely for a given constituent based on whether the categories are combinable. Biasing our model towards both simple categories and connecting contexts encourages learning structures with simpler syntax and that have a better global "fit".
The Bayesian framework is well-matched to our problem since our inductive biases -those derived from universal grammar principles, weak supervision, and estimations based on unannotated data -can be encoded as priors, and we can use Markov chain Monte Carlo (MCMC) inference procedures to automatically blend these biases with unannotated text that reflects the way language is actually used "in the wild". Thus, we learn context information based on statistics in the data like CCM, but have the advantage of additional, a priori biases. It is important to note that the Bayesian setup allows us to use these universal biases as soft constraints: they guide the learner toward more appropriate grammars, but may be overridden when there is compelling contradictory evidence in the data.
Methodologically, this work serves as an example of how linguistic-theoretical commitments can be used to benefit data-driven methods, not only through the construction of a model family from a grammar, as done in our previous work, but also when exploiting statistical associations about which the theory is silent. While there has been much work in computational modeling of the interaction between universal grammar and observ-able data in the context of studying child language acquisition (e.g., Villavicencio, 2002;Goldwater, 2007), we are interested in applying these principles to the design of models and learning procedures that result in better parsing tools. Given our desire to train NLP models in low-supervision scenarios, the possibility of constructing inductive biases out of universal properties of language is enticing: if we can do this well, then it only needs to be done once, and can be applied to any language or domain without adaptation.
In this paper, we seek to learn from only raw data and an incomplete dictionary mapping some words to sets of potential supertags. In order to estimate the parameters of our model, we develop a blocked sampler based on that of Johnson et al. (2007) to sample parse trees for sentences in the raw training corpus according to their posterior probabilities. However, due to the very large sets of potential supertags used in a parse, computing inside charts is intractable, so we design a Metropolis-Hastings step that allows us to sample efficiently from the correct posterior. Our experiments show that the incorporation of supertag context parameters into the model improves learning, and that placing combinability-preferring priors on those parameters yields further gains in many scenarios.

Combinatory Categorial Grammar
In the CCG formalism, every constituent, including those at the lexical level, is associated with a structured CCG category that defines that constituent's relationships to the other constituents in the sentence. Categories are defined by a recursive structure, where a category is either atomic (possibly with features), or a function from one category to another, as indicated by a slash operator: C → {s, s dcl , s adj , s b , np, n, n num , pp, ...} Categories of adjacent constituents can be combined using one of a set of combination rules to form categories of higher-level constituents, as seen in Figure 1. The direction of the slash operator gives the behavior of the function. A category (s\np)/pp might describe an intransitive verb with a prepositional phrase complement; it combines on the right (/) with a constituent with category pp, and then on the left (\) with a noun phrase (np) that serves as its subject. We follow Lewis and Steedman (2014) in allowing a small set of generic, linguistically-plausible unary and binary grammar rules. We further add rules for combining with punctuation to the left and right and allow for the merge rule X → X X of Clark and Curran (2007).

Generative Model
In this section, we present our novel supertagcontext model (SCM) that augments a standard PCFG with parameters governing the supertags to the left and right of each constituent.
The CCG formalism is said to be naturally associative since a constituent label is often able to combine on either the left or the right. As a motivating example, consider the sentence "The lazy dog sleeps", as shown in Figure 2. The word lazy, with category n/n, can either combine with dog (n) via the Forward Application rule (>), or with The (np/n) via the Forward Composition (>B) rule. Baldridge (2008) showed that this tendency for adjacent supertags to be combinable can be used to bias a sequence model in order to learn better CCG supertaggers. However, we can see that if the supertags of adjacent words lazy (n/n) and dog (n) combine, then they will produce the category n, which describes the entire constituent span "lazy dog". Since we have produced a new category that subsumes that entire span, a valid parse must next combine that n with one of the remaining supertags to the left or right, producing either (The·(lazy·dog))·sleeps or The·((lazy·dog)·sleeps). Because we know that one (or both) of these combinations must be valid, we will similarly want a strong prior on the connectivity between lazy·dog and its supertag context: The↔(lazy·dog)↔sleeps.
Assuming T is the full set of known categories, the generative process for our model is: The lazy dog sleeps n Figure 2: Higher-level category n subsumes the categories of its constituents. Thus, n should have a strong prior on combinability with its adjacent supertags np/n and s\np. Parameters: The process begins by sampling the parameters from Dirichlet distributions: a distribution θ ROOT over root categories, a conditional distribution θ BIN Figure 3: The generative process starting with non-terminal A ij , where t x is the supertag for w x , the word at position x, and "A → B C" is a valid production in the grammar. We can see that nonterminal A ij generates nonterminals B ik and C kj (solid arrows) as well as generating left context t i-1 and right context t j (dashed arrows); likewise for B ik and C kj . The triangle under a non-terminal indicates the complete subtree rooted by the node. ple a production type z corresponding to either a (B) binary, (U) unary, or (T) terminal production. Depending on z, we then sample either a binary production u, v and recurse, a unary production u and recurse, or a terminal word w and end that branch. A tree is complete when all branches end in terminal words. See Figure 3 for a graphical depiction of the generative behavior of the process. Finally, since it is possible to generate a supertag context category that does not match the actual category generated by the neighboring constituent, we must allow our process to reject such invalid trees and re-attempt to sample.
Like CCM, this model is deficient since the same supertags are generated multiple times, and parses with conflicting supertags are not valid. Since we are not generating from the model, this does not introduce difficulties (Klein and Manning, 2002).
One additional complication that must be addressed is that left-frontier non-terminal categories -those whose subtree span includes the first word of the sentence -do not have a left-side supertag to use as context. For these cases, we use the special sentence-start symbol S to serve as context. Similarly, we use the end symbol E for the right-side context of the right-frontier.
We next discuss how the prior distributions are constructed to encode desirable biases, using universal CCG properties.

Non-terminal production prior means
For the root, binary, and unary parameters, we want to choose prior means that encode our bias toward cross-linguistically-plausible categories. To formalize the notion of what it means for a category to be more "plausible", we extend the category generator of our previous work, which we will call P CAT . We can define P CAT using a probabilistic grammar (Garrette et al., 2014). The grammar may first generate a start or end category ( S , E ) with probability p se or a special tokendeletion category ( D ; explained in §5) with probability p del , or a standard CCG category C: For each sentence s, there will be one S and one E , so we set p se = 1/(25 + 2), since the average sentence length in the corpora is roughly 25. To discourage the model from deleting tokens (only applies during testing), we set p del = 10 −100 .
For P C , the distribution over standard categories, we use a recursive definition based on the structure of a CCG category. If p = 1 − p, then: 1 The category grammar captures important aspects of what makes a category more or less likely: (1) simplicity is preferred, with a higher p term meaning a stronger emphasis on simplicity; 2 (2) atomic types may occur at different rates, as given by p atom ; (3) modifier categories (A/A or A\A) are more likely than similar-complexity non-modifiers (such as an adverb that modifies a verb); and (4) operators may occur at different rates, as given by p fwd .
We can use P CAT to define priors on our production parameters that bias our model toward rules 1 Note that this version has also updated the probability definitions for modifiers to be sums, incorporating the fact that any A/A is also a A/B (likewise for A\A). This ensures that our grammar defines a valid probability distribution. 2 The probability distribution over categories is guaranteed to be proper so long as pterm > 1 2 since the probability of the depth of a tree will decrease geometrically (Chi, 1999). that result in a priori more likely categories: 3 For simplicity, we assume the production-type mixture prior to be uniform: λ 0 = 1 3 , 1 3 , 1 3 .

Terminal production prior means
We employ the same procedure as our previous work for setting the terminal production prior distributions θ TERM-0 t (w) by estimating word-givencategory relationships from the weak supervision: the tag dictionary and raw corpus (Garrette and Baldridge, 2012;Garrette et al., 2015). 4 This procedure attempts to automatically estimate the frequency of each word/tag combination by dividing the number of raw-corpus occurrences of each word in the dictionary evenly across all of its associated tags. These counts are then combined with estimates of the "openness" of each tag in order to assess its likelihood of appearing with new words.

Context parameter prior means
In order to encourage our model to choose trees in which the constituent labels "fit" into their supertag contexts, we want to bias our context parameters toward context categories that are combinable with the constituent label.
The right-side context of a non-terminal category -the probability of generating a category to the right of the current constituent's category -corresponds directly to the category transitions used for the HMM supertagger of Garrette et al. (2014). Thus, the right-side context prior mean θ RCTX-0 t can be biased in exactly the same way as the HMM supertagger's transitions: toward context supertags that connect to the constituent label.
To encode a notion of combinability, we follow Baldridge's (2008) definition. Briefly, let κ(t, u) ∈ {0, 1} be an indicator of whether t combines with u (in that order). For any binary rule that can combine t to u, κ(t, u)=1. To ensure that our prior captures the natural associativity of CCG, we define combinability in this context to include composition rules as well as application rules. If 3 For our experiments, we normalize PCAT by dividing by c∈T PCAT(c). This allows for experiments contrasting with a uniform prior (1/|T |) without adjusting α values. 4 We refer the reader to the previous work (Garrette et al., 2015) for a fuller discussion and implementation details. atoms have features associated, then the atoms are allowed to unify if the features match, or if at least one of them does not have a feature. In defining κ, it is also important to ignore possible arguments on the wrong side of the combination since they can be consumed without affecting the connection between the two. To achieve this for κ(t, u), it is assumed that it is possible to consume all preceding arguments of t and all following arguments of u. So κ(np, (s\np)/np) = 1. This helps to ensure the associativity discussed earlier. For "combining" with the start or end of a sentence, we define κ( S , u)=1 when u seeks no left-side arguments (since there are no tags to the left with which to combine) and κ(t, E )=1 when t seeks no right-side arguments. So κ( S , np/n)=1, but κ( S , s\np)=0. Finally, due to the frequent use of the unary rule that allows n to be rewritten as np, the atom np is allowed to unify with n if n is the argument. So κ(n, s\np) = 1, but κ(np/n, np) = 0.
The prior mean of producing a right-context supertag r from a constituent category t, P right (r | t), is defined so that combinable pairs are given higher probability than non-combinable pairs. We further experimented with a prior that biases toward both combinability and category likelihood, replacing the uniform treatment of categories with our prior over categories, yielding P right CAT (r | t). If T is the full set of known CCG categories: Distributions P left ( | t) and P left CAT ( | t) are defined in the same way, but with the combinability direction flipped: κ( , t), since the left context supertag precedes the constituent category.

Posterior Inference
We wish to infer the distribution over CCG parses, given the model we just described and a corpus of sentences. Since there is no way to analytically compute these modes, we resort to Gibbs sampling to find an approximate solution. Our strategy is based on the approach presented by Johnson et al. (2007). At a high level, we alternate between resampling model parameters (θ ROOT , θ BIN , θ UN , θ TERM , λ, θ LCTX , θ RCTX ) given the current set of parse trees and resampling those trees given the current model parameters and observed word sequences. To efficiently sample new model parameters, we exploit Dirichlet-multinomial conjugacy. By repeating these alternating steps and accumulating the productions, we obtain an approximation of the required posterior quantities.
Our inference procedure takes as input the distribution prior means, along with the raw corpus and tag dictionary. During sampling, we restrict the tag choices for a word w to categories allowed by the tag dictionary. Since real-world learning scenarios will always lack complete knowledge of the lexicon, we, too, want to allow for unknown words; for these, we assume the word may take any known supertag. We refer to the sequence of word tokens as w and a non-terminal category covering the span i through j − 1 as y ij .
While it is technically possible to sample directly from our context-sensitive model, the high number of potential supertags available for each context means that computing the inside chart for this model is intractable for most sentences. In order to overcome this limitation, we employ an accept/reject Metropolis-Hastings (MH) step. The basic idea is that we sample trees according to a simpler proposal distribution Q that approximates the full distribution and for which direct sampling is tractable, and then choose to accept or reject those trees based on the true distribution P .
For our model, there is a straightforward and intuitive choice for the proposal distribution: the PCFG model without our context parameters: (θ ROOT , θ BIN , θ UN , θ TERM , λ), which is known to have an efficient sampling method. Our acceptance step is therefore based on the remaining parameters: the context (θ LCTX , θ RCTX ).
To sample from our proposal distribution, we use a blocked Gibbs sampler based on the one proposed by Goodman (1998) and used by Johnson et al. (2007) that samples entire parse trees. For a sentence w, the strategy is to use the Inside algorithm (Lari and Young, 1990) to inductively compute, for each potential non-terminal position spanning words w i through w j−1 and category t, going "up" the tree, the probability of generating w i , . . . , w j−1 via any arrangement of productions that is rooted by y ij = t.
We then pass "downward" through the chart, sampling productions until we reach a terminal word on all branches.
∀ y ik , y kj when j > i + 1, where x is either a split point k and pair of categories y ik , y kj resulting from a binary rewrite rule, a single category y ij resulting from a unary rule, or a word w resulting from a terminal rule. The MH procedure requires an acceptance distribution A that is used to accept or reject a tree sampled from the proposal Q. The probability of accepting new tree y given the previous tree y is: A(y | y) = min 1, P (y ) P (y)

Q(y) Q(y )
Since Q is defined as a subset of P 's parameters, it is the case that: After substituting this for each P in A, all of the Q factors cancel, yielding the acceptance distribution defined purely in terms of context parameters: For completeness, we note that the probability of a tree y given only the context parameters is: 5 Before we begin sampling, we initialize each distribution to its prior mean (θ ROOT =θ ROOT-0 , θ BIN t =θ BIN-0 , etc). Since MH requires an initial set of trees to begin sampling, we parse the raw corpus with probabilistic CKY using these initial parameters (excluding the context parameters) to guess an initial tree for each raw sentence.
The sampler alternates sampling parse trees for the entire corpus of sentences using the above procedure with resampling the model parameters. Resampling the parameters requires empirical counts of each production. These counts are taken from the trees resulting from the previous round of sampling: new trees that have been "accepted" by the MH step, as well as existing trees for sentences in which the newly-sampled tree was rejected.
It is important to note that this method of resampling allows the draws to incorporate both the data, in the form of counts, and the prior mean, which includes all of our carefully-constructed biases derived from both the intrinsic, universal CCG properties as well as the information we induced from the raw corpus and tag dictionary.
After all sampling iterations have completed, the final model is estimated by pooling the trees resulting from each sampling iteration, including trees accepted by the MH steps as well as the duplicated trees retained due to rejections. We use this pool of trees to compute model parameters using the same procedure as we used directly above to sample parameters, except that instead of drawing a Dirichlet sample based on the vector of counts, we simply normalize those counts. However, since we require a final model that can parse sentences efficiently, we drop the context parameters, making the model a standard PCFG, which allows us to use the probabilistic CKY algorithm.

Experiments
In our evaluation we compared our supertagcontext approach to (our reimplementation of) the best-performing model of our previous work (Garrette et al., 2015), which SCM extends. We evaluated on the English CCGBank (Hockenmaier and Steedman, 2007), which is a transformation of the Penn Treebank (Marcus et al., 1993); the CTB-CCG (Tse and Curran, 2010) transformation of the Penn Chinese Treebank (Xue et al., 2005); and the CCG-TUT corpus (Bos et al., 2009), built from the TUT corpus of Italian text (Bosco et al., 2000).
Each corpus was divided into four distinct data sets: a set from which we extract the tag dictionaries, a set of raw (unannotated) sentences, a development set, and a test set. We use the same splits as Garrette et al. (2014). Since these treebanks use special representations for conjunctions, we chose to rewrite the trees to use conjunction categories of the form (X\X)/X rather than introducing special conjunction rules. In order to increase the amount of raw data available to the sampler, we supplemented the English data with raw, unannotated newswire sentences from the NYT Gigaword 5 corpus (Parker et al., 2011) and supplemented Italian with the out-of-domain WaCky corpus (Baroni et al., 1999). For English and Italian, this allowed us to use 100k raw tokens for training (Chinese uses 62k). For Chinese and Italian, for training efficiency, we used only raw sentences that were 50 words or fewer (note that we did not drop tag dictionary set or test set sentences).
The English development set was used to tune hyperparameters using grid search, and the same hyperparameters were then used for all three languages. For the category grammar, we used p punc =0.1, p term =0.7, p mod =0.2, p fwd =0.5. For the priors, we use α ROOT =1, α BIN =100, α UN =100, α TERM =10 4 , α λ =3, α LCTX =α RCTX =10 3 . 6 For the context prior, we used σ=10 5 . We ran our sampler for 50 burn-in and 50 sampling iterations. CCG parsers are typically evaluated on the dependencies they produce instead of their CCG derivations directly since there can be many different CCG parse trees that all represent the same dependency relationships (spurious ambiguity), and CCG-to-dependency conversion can collapse those differences. To convert a CCG tree into a dependency tree, we follow Lewis and Steedman (1) a no-context model baseline, Garrette et al. (2015) directly; (2) our supertag-context model, with uniform priors on contexts; (3) supertag-context model with priors that prefer combinability; (4) supertagcontext model with priors that prefer combinability and simpler categories. Results are shown for six different levels of supervision, as determined by the size of the corpus used to extract a tag dictionary.
(2014). We traverse the parse tree, dictating at every branching node which words will be the dependents of which. For binary branching nodes of forward rules, the right side-the argument sideis the dependent, unless the left side is a modifier (X/X) of the right, in which case the left is the dependent. The opposite is true for backward rules. For punctuation rules, the punctuation is always the dependent. For merge rules, the right side is always made the parent. The results presented in this paper are dependency accuracy scores: the proportion of words that were assigned the correct parent (or "root" for the root of a tree).
When evaluating on test set sentences, if the model is unable to find a parse given the constraints of the tag dictionary, then we would have to take a score of zero for that sentence: every dependency would be "wrong". Thus, it is important that we make a best effort to find a parse. To accomplish this, we implemented a parsing backoff strategy. The parser first tries to find a valid parse that has either s dcl or np at its root. If that fails, then it searches for a parse with any root. If no parse is found yet, then the parser attempts to strategically allow tokens to subsume a neighbor by making it a dependent (first with a restricted root set, then without). This is similar to the "deletion" strategy employed by Zettlemoyer and Collins (2007), but we do it directly in the grammar. We add unary rules of the form D →u for every potential supertag u in the tree. Then, at each node spanning exactly two tokens (but no higher in the tree), we allow rules t→ D , v and t→ v, D . Recall that in §3.1, we stated that D is given extremely low probability, meaning that the parser will avoid its use unless it is absolutely necessary. Additionally, since u will still remain as the preterminal, it will be the category examined as the context by adjacent constituents.
For each language and level of supervision, we executed four experiments. The no-context baseline used (a reimplementation of) the best model from our previous work (Garrette et al., 2015): using only the non-context parameters (θ ROOT , θ BIN , θ UN , θ TERM , λ) along with the category prior P CAT to bias toward likely categories throughout the tree, and θ TERM-0 t estimated from the tag dictionary and raw corpus. We then added the supertagcontext parameters (θ LCTX , θ RCTX ), but used uniform priors for those (still using P CAT for the rest). Then, we evaluated the supertag-context model using context parameter priors that bias toward categories that combine with their contexts: P left and P right (see §3.3). Finally, we evaluated the supertag-context model using context parameter priors that bias toward combinability and toward a priori more likely categories, based on the category grammar (P left CAT and P right CAT ). Because we are interested in understanding how our models perform under varying amounts of su-pervision, we executed sequences of experiments in which we reduced the size of the corpus from which the tag dictionary is drawn, thus reducing the amount of information provided to the model. As this information is reduced, so is the size of the full inventory of known CCG categories that can be used as supertags. Additionally, a smaller tag dictionary means that there will be vastly more unknown words; since our model must assume that these words may take any supertag from the full set of known labels, the model must contend with a greatly increased level of ambiguity.
The results of our experiments are given in Table 1. We find that the incorporation of supertagcontext parameters into a CCG model improves performance in every scenario we tested; we see gains of 2-5% across the board. Adding context parameters never hurts, and in most cases, using priors based on intrinsic, cross-lingual aspects of the CCG formalism to bias those parameters toward connectivity provides further gains. In particular, biasing the model toward trees in which constituent labels are combinable with their adjacent supertags frequently helps the model.
However, for English, we found that additionally biasing context priors toward simpler categories using P left CAT /P right CAT degraded performance. This is likely due to the fact that the priors on production parameters (θ BIN , θ UN ) are already biasing the model toward likely categories, and that having the context parameters do the same ends up over-emphasizing the need for simple categories, preventing the model from choosing more complex categories when they are needed. On the other hand, this bias helps in Chinese and Italian.
6 Related Work Klein and Manning (2002)'s CCM is an unlabeled bracketing model that generates the span of part-of-speech tags that make up each constituent and the pair of tags surrounding each constituent span (as well as the spans and contexts of each non-constituent). They found that modeling constituent context aids in parser learning because it is able to capture the observation that the same contexts tend to appear repeatedly in a corpus, even with different constituents. While CCM is designed to learn which tag pairs make for likely contexts, without regard for the constituents themselves, our model attempts to learn the relationships between context categories and the types of the constituents, allowing us to take advantage of the natural a priori knowledge about which contexts fit with which constituent labels.
Other researchers have shown positive results for grammar induction by introducing relatively small amounts of linguistic knowledge. Naseem et al. (2010) induced dependency parsers by handconstructing a small set of linguistically-universal dependency rules and using them as soft constraints during learning. These rules were useful for disambiguating between various structures in cases where the data alone suggests multiple valid analyses. Boonkwan and Steedman (2011) made use of language-specific linguistic knowledge collected from non-native linguists via a questionnaire that covered a variety of syntactic parameters. They were able to use this information to induce CCG parsers for multiple languages. Bisk and Hockenmaier (2012;2013) induced CCG parsers by using a smaller number of linguistically-universal principles to propose syntactic categories for each word in a sentence, allowing EM to estimate the model parameters. This allowed them to induce the inventory of languagespecific types from the training data, without prior language-specific knowledge.

Conclusion
Because of the structured nature of CCG categories and the logical framework in which they must assemble to form valid parse trees, the CCG formalism offers multiple opportunities to bias model learning based on universal, intrinsic properties of the grammar. In this paper we presented a novel parsing model with the capacity to capture the associative adjacent-category relationships intrinsic to CCG by parameterizing supertag contexts, the supertags appearing on either side of each constituent. In our Bayesian formulation, we place priors on those context parameters to bias the model toward trees in which constituent labels are combinable with their contexts, thus preferring trees that "fit" together better. Our experiments demonstrate that, across languages, this additional context helps in weak-supervision scenarios.