A Probabilistic Generative Grammar for Semantic Parsing

We present a generative model of natural language sentences and demonstrate its application to semantic parsing. In the generative process, a logical form sampled from a prior, and conditioned on this logical form, a grammar probabilistically generates the output sentence. Grammar induction using MCMC is applied to learn the grammar given a set of labeled sentences with corresponding logical forms. We develop a semantic parser that finds the logical form with the highest posterior probability exactly. We obtain strong results on the GeoQuery dataset and achieve state-of-the-art F1 on Jobs.


Introduction
Accurate and efficient semantic parsing is a long-standing goal in natural language processing. Existing approaches are quite successful in particular domains Collins, 2005, 2007;Wong and Mooney, 2007;Liang et al., 2011;Kwiatkowski et al., 2010Kwiatkowski et al., , 2011Kwiatkowski et al., , 2013Li et al., 2013;Zhao and Huang, 2014;Dong and Lapata, 2016). However, they are largely domain-specific, relying on additional supervision such as a lexicon that provides the semantics or the type of each token in a set Collins, 2005, 2007;Kwiatkowski et al., 2010Kwiatkowski et al., , 2011Liang et al., 2011;Zhao and Huang, 2014;Dong and Lapata, 2016), or a set of initial synchronous context-free grammar rules (Wong and Mooney, 2007;Li et al., 2013). To apply the above systems to a new domain, additional supervision is necessary. When beginning to read text from a new domain, humans do not need to re-learn basic English gram- : High-level illustration of the setting in which our grammar is applied in this paper. The dark arrows outline the generative process. During parsing, the input is the observed sentence, and we wish to find the most probable logical form and derivation given the training data under the semantic prior.
mar. Rather, they may encounter novel terminology. With this in mind, our approach is akin to that of (Kwiatkowski et al., 2013) where we provide domain-independent supervision to train a semantic parser on a new domain. More specifically, we restrict the rules that may be learned during training to a set that characterizes the general syntax of English. While we do not explicitly present and evaluate an open-domain semantic parser, we hope our work provides a step in that direction.
Knowledge plays a critical role in natural language understanding. Even seemingly trivial sentences may have a large number of ambiguous interpretations. Consider the sentence "Ada started the machine with the GPU," for example. Without additional knowledge, such as the fact that "machine" can refer to computing devices that contain GPUs, or that computers generally contain devices such as GPUs, the reader cannot determine whether the GPU is part of the machine or if the GPU is a device that is used to start machines. Context is highly instrumental to quickly and unambiguously understand sentences.
In contrast to most semantic parsers, which are built on discriminative models, our model is fully generative: To generate a sentence, the logical form is first drawn from a prior. A grammar then recursively constructs a derivation tree top-down, probabilistically selecting production rules from distributions that depend on the logical form (see Figure 1 for a high-level schematic diagram). The semantic prior distribution provides a straightforward way to incorporate background knowledge, such as information about the types of entities and predicates, or the context of the utterance. Additionally, our generative model presents a promising direction to jointly learn to understand and generate natural language.
This article describes the following contributions: • In Section 2, we present our grammar formalism in its general form. • Section 2.2 discusses aspects of the model in its application to the later experiments. • In Section 3, we present a method to perform grammar induction in this model. Given a set of observed sentences and their corresponding logical forms, we apply Markov chain Monte Carlo (MCMC) to infer the posterior distributions of the production rules in the grammar. • Given a trained grammar, we also develop a method to perform parsing in Section 4: to find the k-best logical forms for a given sentence, leveraging the semantic prior to guide its search. • Using the GeoQuery and Jobs datasets, we demonstrate in Section 6 that this framework can be applied to create natural language interfaces for semantic formalisms as complex as Datalog/lambda calculus, which contain variables, scope ambiguity, and superlatives. All code and datasets are available at github. com/asaparov/parser.

Semantic grammar
A grammar in our formalism operates over a set of nonterminals N and a set of terminal S → N : select arg1 VP : delete arg1 VP → V : identity N : select arg2 VP → V : identity N → "tennis" V → "swims" N → "Andre Agassi" V → "plays" N → "Chopin" Figure 2: Example of a grammar in our framework. This grammar operates on logical forms of the form predicate(first argument, second argument). The semantic function select arg1 returns the first argument of the logical form. Likewise, the function select arg2 returns the second argument. The function delete arg1 removes the first argument, and identity returns the logical form with no change. In our use of the framework, the interior production rules (the first three listed above) are examples of rules that we specify, whereas the terminal rules and the posterior probabilities of all rules are learned via grammar induction. We also use a richer semantic formalism than in this example. Section 2.2 provides more detail.
S plays sport(agassi,tennis) N agassi "Andre Agassi" VP plays sport(,tennis) V "plays" N tennis "tennis."  Figure 2. The logical form corresponding to every node is shown in blue beside the respective node. The logical form for V is plays sport(,tennis) and is omitted above to reduce clutter. symbols W. It can be understood as an extension of a context-free grammar (CFG) (Chomsky, 1956) where the generative process for the syntax is dependent on a logical form, thereby coupling syntax with semantics. In the topdown generative process of a derivation tree, a logical form guides the selection of production rules. Production rules in our grammar have the form A → B 1 :f 1 . . . B k :f k where A ∈ N is a nonterminal, B i ∈ N ∪W are right-hand side symbols, and f i are semantic transformation functions. These functions can encode how to "decompose" this logical form when recursively generating the subtrees rooted at each B i . Thus, they enable semantic compositionality. An example of a grammar in this framework is shown in Figure 2, and a derivation tree is shown in Figure 3. Let R be the set of production rules in the grammar and R A be the set of production rules with left-hand nonterminal symbol A.

Generative process
A parse tree (or derivation) in this formalism is a tree where every interior node is labeled with a nonterminal symbol, every leaf is labeled with a terminal, and the root node is labeled with the root nonterminal S. Moreover, every node in the tree is associated with a logical form: let x n be the logical form assigned to the tree node n, and x 0 = x for the root node 0.
The generative process to build a parse tree begins with the root nonterminal S and a logical form x. We expand S by randomly drawing a production rule from R S , conditioned on the logical form x. This provides the first level of child nodes in the derivation tree. So if, for example, the rule S → B 1 :f 1 . . . B k :f k were drawn, the root node would have k child nodes, n 1 , . . . , n k , respectively labeled B 1 , . . . , B k . The logical form associated with each node is determined by the semantic transformation function: x n i = f i (x). These functions describe the relationship between the logical form at a child node and that of its parent node. This process repeats recursively with every right-hand side nonterminal symbol, until there are no unexpanded nonterminal nodes. The sentence is obtained by taking the yield of the terminals in the tree (a concatenation).
The semantic transformation functions are specific to the semantic formalism and may be defined as appropriate to the application. In our parsing application, we define a domainindependent set of transformation functions (e.g., one function selects the left n conjuncts in a conjunction, another selects the n th argument of a predicate instance, etc).

Selecting production rules
In the above description, we did not specify the distribution from which rules are selected from R A . There are many modeling options available when specifying this distribution. In our approach, we choose a hierarchical Dirichlet process (HDP) prior (Teh et al., 2006). Every nonterminal in our grammar A ∈ N will be associated with an HDP hierarchy. For each nonterminal, we specify a sequence of semantic feature functions, {g 1 , . . . , g m }, each of which return a discrete feature (such as an integer) of an input logical form x. We use this sequence of feature functions to define the hierarchy of the HDP: starting with the root node, we add a child node for every possible value of the first feature function g 1 . For each of these child nodes, we add a grandchild node for every possible value of the second feature function g 2 , and so forth. The result is a complete tree of depth m. Each node n in this tree is assigned a distribution G n as follows: where 0 is the root node, π(n) is the parent of n, α are a set of concentration parameters, and H is a base distribution over R A . This base distribution is independent of the logical form x.
To select a rule in the generative process, given the logical form x, we can compute its feature values (g 1 (x), . . . , g m (x)) which specify a unique path in the HDP hierarchy to a leaf node G x . We then draw the production rule from G x . The specified set of production rules and semantic features are included with the code package. The specified rules and features do not change across our experiments.
Take, for example, the derivation in Figure  3. In the generative process where the node VP is expanded, the production rule is drawn from the HDP associated with the nonterminal VP. Suppose the HDP was constructed using a sequence of two semantic features: (predicate, arg2). In the example, the feature functions are evaluated with the logical form plays sport(,tennis) and they return the sequence (plays sport, tennis). This sequence uniquely identifies a path in the HDP hierarchy from the root node 0 to a leaf node n. The production rule VP → V N is drawn from this leaf node G n , and the generative process continues recursively.
In our implementation, we divide the set of nonterminals N into two groups: (1) the set of "interior" nonterminals, and (2) preterminals. The production rules of preterminals are restricted such that the right-hand side contains only terminal symbols. The rules of interior nonterminals are restricted such that only nonterminal symbols appear on the right side.
1. For preterminals, we set H to be a distribution over sequences of terminal symbols as follows: we generate each token in the sequence i.i.d. from a uniform distribution over a finite set of terminals and a special stop symbol with probability φ A . Once the stop symbol is drawn, we have finished gen-erating the rule. Note that we do not specify a set of domain-specific terminal symbols in defining this distribution. 2. For interior nonterminals, we specify H as a discrete distribution over a domainindependent set of production rules. This requires specifying a set of nonterminal symbols, such as S, NP, VP, etc. Since these production rules contain semantic transformation functions, they are specific to the semantic formalism. We emphasize that only the prior is specified here, and we will use grammar induction to infer the posterior. In principle, a more relaxed choice of H may enable grammar induction without pre-specified production rules, and therefore without dependence on a particular semantic formalism or natural language, if an efficient inference algorithm can be developed in such cases.

Induction
We describe grammar induction independently of the choice of rule distribution. Let θ be the random variables in the grammar: in the case of the HDP prior, θ is the set of all distributions G n at every node in the hierarchies. Given a set of sentences y {y 1 , . . . , y n } and corresponding logical forms x {x 1 , . . . , x n }, we wish to compute the posterior p(t, θ|x, y) over the unobserved variables: the grammar θ and the latent derivations/parse trees t {t 1 , . . . , t n }. This is intractable to compute exactly, and so we resort to Markov chain Monte Carlo (MCMC) (Gelfand and Smith, 1990;Robert and Casella, 2010). To perform blocked Gibbs sampling, we pick initial values for t and θ and repeat the following: 1. For i = 1, . . . , n, sample t i |θ, x i , y i . 2. Sample θ|t. However, since the sampling of each tree t depends on θ, and we need to resample all n parse trees before sampling θ, this Markov chain can be slow to mix. Thus, we employ collapsed Gibbs sampling by integrating out θ. In this algorithm, we repeatedly sample where the intersection is taken over tree nodes n ∈ t i labeled with the nonterminal A, r n is the production rule at node n, and 1{·} is 1 if the condition is true and zero otherwise. With θ integrated out, the probability does not necessarily factorize over rules. In the case of the HDP prior, selecting a rule will increase the probability that the same rule is selected again (due to the "rich get richer" effect observed in the Chinese restaurant process). We instead use a Metropolis-Hastings step to sample t i , where the proposal distribution is given by the fully factorized form: After sampling t * i , we choose to accept the new sample with probability where t i , here, is the old sample, and t * i is the newly proposed sample. In practice, this acceptance probability is very high. This approach is very similar in structure to that in Johnson et al. (2007); ; .
If an application requires posterior samples of the grammar variables θ, we can obtain them by drawing from θ|t after the collapsed Gibbs sampler has mixed. Note that this algorithm requries no further supervision beyond the utterances y and logical forms x. However, it is able to exploit additional information such as supervised derivations/parse trees. For example, a lexicon can be provided where each entry is a terminal symbol y i with a corresponding logical form label x i . We evaluate our method with and without such a lexicon.
Refer to Saparov and Mitchell (2016) for details on HDP inference and computing p(r n |x n , t −i ).

Sampling t * i
To sample from equation (3), we use insideoutside sampling (Finkel et al., 2006;Johnson et al., 2007), a dynamic programming approach, where the inside step is implemented using an agenda-driven chart parser (Indurkhya and Damerau, 2010). The algorithm fills a chart, which has a cell for every nonterminal A, sentence start position i, end position j, and logical form x. The algorithm aims to compute the inside probability of every chart cell: that is, for every cell (A, i, j, x), we compute the probability that t * i contains a subtree rooted with the nonterminal A and logical form x, spanning the sentence positions (i, j). Let I (A,i,j,x) be the inside probability at the chart cell (A, i, j, x): Each item in the agenda represents a snapshot of the computation of this expression for a single rule A → B 1 : f 1 . . . B K : f K . The agenda item stores the current position in the rule k, the set of sentence spans that correspond to the first k right-hand side symbols l 1 , . . . , l k+1 , the span of the rule (i, j), the logical form x, and the inside probability of the portion of the rule computed so far. At every iteration, the algorithm pops an item from the agenda and adds it to the chart, and considers the next right-hand side symbol B k .
• If B k is a terminal, it will match it against the input sentence. If the terminal does not match the sentence, this agenda item is discarded and the algorithm continues to the next iteration. If the terminal does match, the algorithm increments the rule. That is, for each possible value of l k+2 , the algorithm constructs a new agenda item containing the same contents as the old agenda item, but with rule position k + 1. • If B k is a nonterminal, the algorithm will expand it (if it was not previously expanded at this cell). The algorithm considers every production rule of the form B k → β, and every possible end position for the next nonterminal l k+2 = l k+1 + 1, . . . , j − 1, and enqueues a new agenda item with rule B k → β, rule position set to 1, span set to (l k , l k+1 ), logical form set to f k (x), and inside probability initialized to 1. The original agenda item is said to be "waiting" for B k to be completed later on in the algorithm. • If the rule is complete (there are no subsequent symbols in the rule of this agenda item), we can compute its inner probabil-ity p(A → B 1 : f 1 . . . B K : f K |x, t −i ). First, we record that this rule was used to complete the left-hand nonterminal A at the cell (A, i, j, x). Then, we consider every agenda item in the chart that is currently "waiting" for the left-hand nonterminal A at this sentence span. The search increments each "waiting" item, adding a new item to the agenda for each, whose log probability is the sum of the log probability of the old agenda item and the log probability of the completed rule.
We prioritize items in the agenda by i − j (so items with smaller spans are dequeued first). This ensures that whenever the search considers expanding B k , if B k was previously expanded at this cell, its inside probability is fully computed. Thus, we can avoid re-expanding B k and directly increment the agenda item. The algorithm terminates when there are no items in the agenda. All that remains is the outside step: to sample the tree given the computed inside probabilities. To do so, we begin with the chart cell (S, 0, |y i |, x i ) where |y i | is the length of sentence y i , and we consider all completed rules at this cell (these rules will be of the form S → β). Each rule will have a computed inside probability, and we can sample the rule from the categorical distribution according to these inside probabilities. Then, we consider the right-hand side nonterminals in the selected rule, and continue sampling recursively. The end result is a tree sampled from equation (3).

Parsing
For a new sentence y * , we aim to find the logical form x * and derivation t * that maximizes Here, θ is a point estimate of the grammar, which may be obtained from a single sample, or from a Monte Carlo average over a finite set of samples.
To perform parsing, we first describe an algorithm to compute the derivation t * that maximizes the above quantity, given the logical form x * and input sentence y * . We will later demonstrate how this algorithm can be used to find the optimal logical form and derivation x * , t * . To find the optimal t * , we again use an agenda-driven chart parser to perform the optimization, with a number of important differences. Each agenda item will keep track the derivation tree completed so far.
The algorithm is very similar in structure to the inside algorithm described above. At every iteration of the algorithm, an item is popped from the agenda and added to the chart, applying one of the three operations available to the inside algorithm. The algorithm begins by expanding the root nonterminal S at (0, |y * |) with the logical form x * .

Agenda prioritization
The most important difference from the inside algorithm is the prioritization of agenda items. For a given agenda item with rule A → B 1 : f 1 . . . B K : f K with logical form x at sentence position (i, j), we aim to assign as its priority an upper bound on equation (5) for any derivation that contains this rule at this position. To do so, we can split the product in the objective n∈t * p(r n |x n * , θ) into a product of two components: (1) the inner probability is the product of the terms that correspond to the subtree of t * rooted at the current agenda item, and (2) the outer probability is the product of the remaining terms, which correspond to the parts of t * outside of the subtree rooted at the agenda item. A schematic decomposition of a derivation tree is shown in Figure 4.
We define an upper bound on the log inner probability I (A,i,j) for any subtree rooted at nonterminal A at sentence span (i, j).
where l 1 = i, l K+1 = j. Note that the left term is a maximum over all logical forms x , and so this upper bound only considers syntactic information. The right term can be maximized using dynamic programming in O(K 2 ). As such, classical syntactic parsing algorithms can be applied to compute I for every chart cell in O(n 3 ). For any terminal symbol T , we define I (T,i,j) = 0. We similarly define O (A,i,j,x) representing a bound on the outer probability at every cell.
where the maximum is taken over t which is a derivation containing a subtree rooted at A at sentence position (i, j). In this expression, t L is the outer-left portion of the derivation tree t, t R is the outer-right portion, and r(t R ) is the set of root vertices of the trees in t R . Using these two upper bounds, we define the priority of any agenda item with rule A → B 1 : f 1 . . . B K : f K at rule position k, with log probability score ρ, and logical form x as: Thm 1. If the priority of agenda items is computed as in equation (8), then at every iteration of the chart parser, the priority of new agenda items will be at most the priority of the current item.

Proof. See supplementary material A.
Thus, the search is monotonic 1 . That is, the maximum priority of items in the agenda never increases.
This property allows us to compute the outer probability bound O (A,i,j,x) for free. Computing it directly is intractable. Consider the expansion step for an agenda item with rule A → B 1 :f 1 . . . B K :f K at rule position k, with log probability score ρ, and logical form x. The nonterminal B k is expanded next at sentence position (l k , l k+1 ), and its outer probability is simply The monotonicity of the search guarantees that any subsequent expansion of B k at (l k , l k+1 ) will not yield a more optimal bound. Monotonicity also guarantees that when the algorithm completes a derivation for the root nonterminal S, it is optimal (i.e. the Viterbi : Decomposition of a parse tree into its left outer parse, inner parse, and its right outer parse. This is one example of such a decomposition. For instance, we may similarly produce a decomposition where the prepositional phrase is the inner parse, or where the verb is the inner parse. The terminals are omitted and only the syntactic portion of the parse is displayed here for conciseness. parse). In this way, we can continue execution to obtain the k-best parses for the given sentence.

Optimization over logical forms
The above algorithm finds the optimal derivation t * , given a sentence y * , logical form x * , and grammar θ. To jointly optimize over both the derivation and logical form, given θ, imagine running the above algorithm repeatedly for every logical form. This approach, implemented naively, is clearly infeasible due to the sheer number of possible logical forms. However, there is a great deal of overlap across the multiple runs, which corresponds to shared substructures across logical forms, which we can exploit to develop an efficient and exact algorithm. At the first step of every run, the root nonterminal is expanded for every logical form. This would create of a new agenda item for every logical form, which are identical in every field except for the logical form (and therefore, its prior probability). Thus, we can represent this set of agenda items as a single agenda item, where instead of an individual logical form x, we store a logical form set X. The outer probability bound is now defined over sets of logical forms: O (A,i,j,X) max x∈X O (A,i,j,x) . We can use this quantity in equation (8) to compute the priority of these "aggregated" agenda items. Thus, this algorithm is a kind of branch-and-bound approach to the combinatorial optimization problem. A sparse representation of a set of logical forms is essential for efficient parsing. Another difference arises after completing the parsing of a rule A → B 1 : f 1 . . . B K : f K with a set of logical forms X, where we need to compute log p(A → B 1 :f 1 . . . B K :f K |x, θ). In the inside algorithm, this was straightforward since there was only a single logical form. But in the parsing setting, X is a set of logical forms, and the aforementioned prob-ability can vary across instances within this set (for the HDP prior, for example, the set may correspond to multiple distinct paths in the HDP hierarchy). Therefore, we divide X into its equivalence classes. More precisely, consider the set of disjoint subsets of X = X 1 . . . X m where X i X j = ∅ for i = j, such that p(A → B 1 :f 1 . . . B K :f K |x , θ) is the same for every x ∈ X i . For each equivalence class X i , we create a "completed nonterminal" item with the appropriate parse tree, log probability, and logical form set X i . With these, we continue inspecting the chart for search states "waiting" for the nonterminal A.
The increment operation is also slightly different in the parser. When we increment a rule A → B 1 : f 1 . . . B K : f K after completing parsing for the symbol B k with logical form set X, we create a new agenda item with the same contents as the old item, but with the rule position increased by one. The log probability of the new agenda item is the sum of the log probabilities of the old agenda item and the completed subtree. Similarly the logical form set of the new agenda item will be the intersection of {f −1 k (x) : x ∈ X} and the logical forms in the old agenda item.

Semantic prior
The modular nature of the semantic prior allows us to explore many different models of logical forms. We experiment with a fairly straightforward prior: Predicate instances are generated left-to-right, conditioned only on the last predicate instance that was sampled for each variable. When a predicate instance is sampled, its predicate, arity, and "direction" 2 are simultaneously sampled from a cat- Domain-specific set of initial synchronous CFG rules, 3. Domain-independent set of lexical templates, 4. Domain-independent set of interior production rules, 5. Domain-specific initial lexicon, 6. Type-checking and type specification for entities. Figure 5: The methods in the top part of the table were evaluated using 10-fold cross validation, whereas those in the bottom part were evaluated with an independent test set.
Logical form: answer(A,smallest(A,state(A))) answer(A,largest(B,(state(A),population(A,B)))) Test sentence: "Which state is the smallest?" "Which state has the most population?" Generated: "What state is the smallest?" "What is the state with the largest population?" Figure 6: Examples of sentences generated from our trained grammar on logical forms in the GeoQuery test set. Generation is performed by computing arg maxy * p(y * |x * , θ).
egorical distribution. Functions like largest, shortest, etc, are sampled in the same process. We again use an HDP to model the discrete distribution conditioned on a discrete random variable. We also follow Wong and Mooney (2007); Li et al. (2013); Zhao and Huang (2014) and experiment with type-checking, where every entity is assigned a type in a type hierarchy, and every predicate is assigned a functional type. We incorporate type-checking into the semantic prior by placing zero probability on typeincorrect logical forms. More precisely, logical forms are distributed according to the original prior, conditioned on the fact that the logical form is type-correct. Type-checking requires the specification of a type hierarchy. Our hierarchy contains 11 types for GeoQuery and 12 for Jobs. We run experiments with and without type-checking for comparison.

Results
To evaluate our parser, we use the GeoQuery and Jobs datasets. Following Zettlemoyer and Collins (2007), we use the same 600 Geo-Query sentences for training and an independent test set of 280 sentences. On Jobs, we use the same 500 sentences for training and 140 for testing. We run our parser with two se-tups: (1) with no domain-specific supervision, and (2) using a small domain-specific lexicon and a set of beliefs (such as the fact that Portland is a city). For each setup, we run the experiments with and without type-checking, for a total of 4 experimental setups. A given output logical form is considered correct if it is semantically equivalent to the true logical form. 3 We measure the precision and recall of our method, where precision is the number of correct parses divided by the number of sentences for which our parser provided output, and recall is the number of correct parses divided by the total number of sentences in each dataset. Our results are shown compared against many other semantic parsers in Figure  5. Our method is labeled GSG for "generative semantic grammar." The numbers for the baselines were copied from their respective papers, and so their specified lexicons/type hierarchies may differ slightly.
Many sentences in the test set contain tokens previously unseen in the training set. In such cases, the maximum possible recall is 88.2 and 82.3 on GeoQuery and Jobs, respectively. Therefore, we also measure the effect of adding a domain-specific lexicon, which maps semantic constants like maine to the noun "maine" for example. This lexicon is analogous to the string-matching and argument identification steps previous parsers. We constructed the lexicon manually, with an entry for every city, state, river, and mountain in GeoQuery (141 entries), and an entry for every city, company, position, and platform in Jobs (180 entries).
Aside from the lexicon and type hierarchy, the only training information is given by the set of sentences y, corresponding logical forms x, and the domain-independent set of interior production rules, as described in section 2.2. In our experiments, we found that the sampler converges rapidly, with only 10 passes over the data. This is largely due to our restriction of the interior production rules to a domainindependent set.
We emphasize that the addition of typechecking and a lexicon are mainly to enable a fair comparison with past approaches. As expected, their addition greatly improves parsing performance. Our method achieves stateof-the-art F1 on the Jobs dataset. However, even without such domain-specific supervision, the parser performs reasonably well.

Related work
Our grammar formalism can be related to synchronous CFGs (SCFGs) (Aho and Ullman, 1972) where the semantics and syntax are generated simultaneously. However, instead of modeling the joint probability of the logical form and natural language utterance p(x, y), we model the factorized probability p(x)p(y|x). Modeling each component in isolation provides a cleaner division between syntax and semantics, and one half of the model can be modified without affecting the other (such as the addition of new background knowledge, or changing the language/semantic formalism). We used a CFG in the syntactic portion of our model (although our grammar is not context-free, due to the dependence on the logical form). Richer syntactic formalisms such as combinatory categorial grammar (Steedman, 1996) or head-driven phrase structure grammar (Pollard and Sag, 1994) could replace the syntactic component in our framework and may provide a more uniform analysis across languages. Our model is similar to lexical functional grammar (LFG) (Ka-plan and Bresnan, 1995), where f -structures are replaced with logical forms. Nothing in our model precludes incorporating syntactic information like f-structures into the logical form, and as such, LFG is realized in our framework. Our approach can be used to define new generative models of these grammatical formalisms. We implemented our method with a particular semantic formalism, but the grammatical model is agnostic to the choice of semantic formalism or the language. As in some previous parsers, a parallel can be drawn between our parsing problem and the problem of finding shortest paths in hypergraphs using A* search Manning, 2001, 2003;Pauls and Klein, 2009;Pauls et al., 2010;Gallo et al., 1993).

Discussion
In this article, we presented a generative model of sentences, where each sentence is generated recursively top-down according to a semantic grammar, where each step is conditioned on the logical form. We developed a method to learn the posterior of the grammar using a Metropolis-Hastings sampler. We also derived a Viterbi parsing algorithm that takes into account the prior probability of the logical forms. Through this semantic prior, background knowledge and other information can be easily incorporated to better guide the parser during its search. Our parser provides state-of-the-art results when compared with past approaches. As a generative model, there are promising applications to interactive learning, caption generation, data augmentation, etc. Richer semantic priors can be applied to perform ontology learning, relation extraction, or context modeling. Applying this work to semisupervised settings is also interesting. The avenues for future work are numerous. and should not be interpreted as necessarily representing the official policies, either expressed or implied of ODNI, IARPA, DARPA, or the US government. The US Government is authorized to reproduce and distribute the reprints for governmental purposed notwithstanding any copyright annotation therein.