A Probabilistic Generative Grammar for Semantic Parsing

Domain-general semantic parsing is a long-standing goal in natural language processing, where the semantic parser is capable of robustly parsing sentences from domains outside of which it was trained. Current approaches largely rely on additional supervision from new domains in order to generalize to those domains. We present a generative model of natural language utterances and logical forms and demonstrate its application to semantic parsing. Our approach relies on domain-independent supervision to generalize to new domains. We derive and implement efficient algorithms for training, parsing, and sentence generation. The work relies on a novel application of hierarchical Dirichlet processes (HDPs) for structured prediction, which we also present in this manuscript. This manuscript is an excerpt of chapter 4 from the Ph.D. thesis of Saparov (2022), where the model plays a central role in a larger natural language understanding system. This manuscript provides a new simplified and more complete presentation of the work first introduced in Saparov, Saraswat, and Mitchell (2017). The description and proofs of correctness of the training algorithm, parsing algorithm, and sentence generation algorithm are much simplified in this new presentation. We also describe the novel application of hierarchical Dirichlet processes for structured prediction. In addition, we extend the earlier work with a new model of word morphology, which utilizes the comprehensive morphological data from Wiktionary.


Introduction
Accurate and efficient semantic parsing is a long-standing goal in natural language processing.Existing approaches are quite successful in particular domains (Zettlemoyer andCollins 2005, 2007;Wong and Mooney 2007;Liang, Jordan, and Klein 2013;Kwiatkowski et al. 2010Kwiatkowski et al. , 2011Kwiatkowski et al. , 2013;;Li, Liu, and Sun 2013;Wang, Kwiatkowski, and Zettlemoyer 2014;Zhao and Huang 2015;Dong and Lapata 2016;Rabinovich, Stern, and Klein 2017).However, they are largely domain-specific, relying on additional supervision such as a lexicon that provides the semantics or the type of each token in a set (Zettlemoyer andCollins 2005, 2007;Kwiatkowski et al. 2010Kwiatkowski et al. , 2011;;Liang, Jordan, and Klein 2013;Wang, Kwiatkowski, and Zettlemoyer 2014;Zhao and Huang 2015;Dong and Lapata 2016;Rabinovich, Stern, and Klein 2017), or a set of initial synchronous context-free grammar rules (Wong and Mooney 2007;Li, Liu, and Sun 2013).To apply the above systems to a new domain, additional supervision is necessary.When beginning to read text from a new domain, humans do not need to re-learn basic English grammar.Rather, they may encounter novel terminology.With this in mind, our approach is akin to that of (Kwiatkowski et al. 2013) where we provide domain-independent supervision to help train a semantic parser.More specifically, our semantic parsing model restricts the rules that may be learned during training to a set that characterizes the general syntax of English.While we do not explicitly present and evaluate an open-domain semantic parser, we hope our work provides a step in that direction.
Knowledge plays a critical role in natural language understanding.Even seemingly trivial sentences may have a large number of ambiguous interpretations.Consider the sentence "Ada started the machine with the GPU," for example.Without additional knowledge, such as the fact that "machine" can refer to computing devices that contain GPUs, or that computers generally contain devices such as GPUs, the reader cannot determine whether the GPU is part of the machine or if the GPU is a device that is used to start machines.Context is highly instrumental to quickly and unambiguously understand sentences.
In contrast to most semantic parsers, which are built on discriminative models, our model is fully generative: To generate a sentence, the logical form is first drawn from a prior.A grammar then recursively constructs a derivation tree top-down, probabilistically selecting production rules from distributions that depend on the logical form.The semantic prior distribution provides a straightforward way to incorporate background knowledge, such as information about the types of entities and predicates, or the context of the utterance.Additionally, our generative model presents a promising direction to jointly learn to understand and generate natural language.Further, our parser can return partial parses of sentences, which is useful for sentences that contain a small number of unseen words, such as definitions of new tokens.This can be exploited to learn new tokens and concepts outside of training.
In section 2, we present a novel application of hierarchical Dirichlet processes (HDPs) to structured prediction.We use this HDP model within our semantic parsing model, where HDPs are used to model dependence on logical forms.A mathematical description of the semantic parsing model is given in section 3.In section 4.1, we describe the algorithms for training, parsing, and generation, including details on their implementation.In section 5, we apply this parsing approach to the GEOQUERY and JOBS datasets (Zelle and Mooney 1996;Tang and Mooney 2000), using the Datalog representation of the provided logical form labels, and demonstrate that the accuracy of the parsed logical forms is comparable to that of the state-of-the-art on these datasets.

Hierarchical Dirichlet processes for structured prediction
In order to describe our novel application of HDPs for structured prediction, which play a central role in our semantic parsing model, we must first define some notation as well as useful properties of Dirichlet processes.
The Dirichlet process (DP) (Ferguson 1973) is a distribution over probability distributions (i.e.samples from a DP are themselves distributions).If a distribution G is drawn from a DP, we can write where the DP is characterized by two parameters: a concentration parameter α > 0 and a base distribution H.The DP has the useful property that E[G] = H, and the concentration parameter α describes the "closeness" of G to the base distribution H.If α is small, G is more different from the base distribution H.If α is large, G is more similar to H.
DPs are often used in statistical machine learning models where observations y 1 , y 2 , . . .are distributed according to G, such as in: y 1 , y 2 , . . .∼ G. (3) The joint probability of y 1 , . . ., y n is given by: where y * k are the unique values of y 1 , . . ., y n , m is the number of such values, n k #{i : k } is the number of times y * k appears in y 1 , . . ., y n , and α n Γ(α)/Γ(α + n) is the normalization term.
In these models, the Chinese restaurant process (CRP) (Aldous 1985) provides a convenient equivalent description: z i+1 = k with probability n k α+i , k new with probability α α+i , where n k #{j ≤ i : z j = k} is the number of times k appears in {z 1 , . . ., z i }, k new max{z 1 , . . ., z i } + 1 is the next integer that doesn't appear in {z 1 , . . ., z i }.The analogy to a restaurant is to imagine a restaurant with a countably infinite sequence of tables, labeled 1, 2, 3, . . .The first person comes into the restaurant and sits at table 1.For each subsequent person that enters the restaurant, they choose to sit at a table with probability proportional to the number of people already sitting at that table.Otherwise, they choose to sit at an empty table with probability proportional to α. z i indicates which table the i th customer chose to sit, n k is the number of people sitting at table k, and k new is the index of the next unoccupied table.Each table is assigned a sample from H, independently and identically distributed (i.i.d.), where φ i is the sample assigned to table i.Each observation y i is the sample from H that is assigned to the table that the i th customer chose to sit (i.e.table z i ).The CRP provides a simple algorithm to generate samples from a DP model.Notice that if α is very large, every customer is likely to choose to sit at a new table, and so each y i is likely to be drawn i.i.d.from H (and therefore, G would be very similar to H).The opposite is true in the case where α is small, where G would be heavily concentrated on a small handful of observations, as each customer is more likely to sit at a table with existing customers.The CRP is exchangeable which is useful property where the joint distribution of table assignments z is independent of their order.That is, for any permutation of the integers σ: p(z 1 , z 2 , . ..) = p(z σ(1) , z σ(2) , . ..).
Note that this presentation of the DP differs from the classical presentation, where the DP is part of a mixture model, as in: where F (θ i ) is a distribution with parameter θ i .If H is a conjugate prior of F , then an efficient Gibbs sampling algorithm is available, for example if H is a Dirichlet distribution and F is a multinomial, or if both H and F are normal distributions.In this manuscript, F is assumed to be the delta function (the distribution whose samples are identical to the input parameter), and no assumptions are made on H other than there exists an efficient way to compute the prior probability p(φ i ).

Hierarchical Dirichlet processes
The DP can be used as a component in larger models.The hierarchical Dirichlet process (HDP) (Teh et al. 2006) is a hierarchy of random variables, where each random variable is a distributed according to a Dirichlet process whose base distribution is given by the parent node in the hierarchy.Suppose each observation y i is coupled with a parameter x i that indicates the source node from which to sample the observation.Let the label of the root node in the hierarchy be 0, and the model can be written: for all nodes in the hierarchy n.An equivalent "Chinese restaurant" representation may be written, which is coined a Chinese restaurant franchise (CRF), where each node n has a restaurant.For simplicity, assume that all x i are leaf nodes, then the CRF is written: for all nodes in the hierarchy n, where n n k #{j ≤ i : z n j = k} is the number of customers at node n sitting at table k, k new max{z n 1 , . . ., z n i } + 1 is the next available table at node n, and u i #{j < i : x j = x i } is the number of previous observations drawn from node n.In this extended metaphor, whenever a customer sits at a new table in the restaurant at node n = 0, a "new customer" appears in the parent node parent(n) which corresponds to this table.The ψ n i are the samples from G n .Note that the above model is valid only when x i is a leaf node.If x i were a parent node, then the output samples ψ x i j are used by both the child nodes of x i as well as the observations y i .In the restaurant metaphor, the customers at node x i not only come from its child nodes but also from the observations.In this case, the ψ x i j that are assigned to the observations come after those assigned to child nodes (the order does not actually matter thanks to exchangeability, so long as the samples/customers are partitioned between the two).More precisely, y i would be equal to used by the child nodes of n (i.e. the number of customers that come from the child nodes of n).

Inferring the source node x
Sections A and B describes how to efficiently obtain posterior samples of z (and therefore, φ and ψ) using Markov chain Monte Carlo (MCMC), given a set of observations y i and the corresponding nodes x i from which they were sampled.But now consider the case where the x are random variables, and we encounter a new observation y * , but the source node x * (from which y * was sampled) is unknown, and we would like to infer it.
That is, we would like to compute: ≈ arg max where p(y This quantity is computed as in equations 59 and 60.The arg max over this objective is a discrete optimization problem, which, if solved naïvely, would require computing the objective function for every node n in the tree.This is intractable if the tree is very large.Therefore, we present a branch-and-bound algorithm to perform this optimization efficiently. Branch-and-bound (Land and Doig 1960) is a method for solving discrete optimization problems.Pseudocode is shown in algorithm 1.Given an objective function f , heuristic h, and search space X, the algorithm returns the k-best elements of X that maximize the objective f .The algorithm requires that the heuristic h be an upper bound for f .That is, for any set S, The algorithm begins by considering the full search space X.A procedure called branch then partitions X into n disjoint subsets X i (this procedure is specific to the optimization problem).Each subset is pushed onto the priority queue, with its key given by the heuristic h(X i ).Then, for each iteration of the main loop, pop a set S from the priority queue, and repeat the process: using branch, partition S into (S 1 , . . ., S n ), and then push each subset into the priority with key h(S i ).If S = {x} is a singleton set only containing the element x, then add it to a list of potential solutions.The algorithm terminates when there are k potential solutions whose objective function values are at least the priority of S, or when the priority queue becomes empty.Once the algorithm terminates, the Algorithm 1: Pseudocode for a generic brand-and-bound algorithm for k-best discrete optimization.objective function values of the returned solutions are at least as large as the heuristic of the remainder of the search space.And since h is an upper bound for f , the returned solutions are guaranteed to be optimal.We develop a branch-and-bound algorithm to perform the optimization in equation 21.The HDP hierarchy provides a convenient search tree structure for the optimization.Let D(n) be the set of descendent nodes of n, including n itself.The function branch(D(n)) is defined to partition D(n) into ({n}, D(c 1 ), . . ., D(c n )) where c i are the child nodes of n.We define a heuristic for D(n): where h x (S) is an upper bound on the prior h x (S) ≥ max x∈S p(x), the max is taken over all occupied tables in the restaurant at node n, and the references to ψ within the sum are for the t th sample, ψ (t) .D(n) can be sparsely represented in the implementation as a simple pointer to n.The heuristic is convenient since it can be computed only using the information available at node n, and so its running time is not a function of the size of the HDP hierarchy, as long as the heuristic on the prior h x (•) is easy to compute.Furthermore, our algorithm avoids the recursion in the computation of p(ψ n new ), since the term p(ψ parent(n new )) was already computed in the computation of the heuristic for the parent node, and our algorithm re-uses it in future heuristic evaluations.

Thm 1
The heuristic h(D(n)) is an upper bound on max x∈D(n) f (x) where f is the objective function given by equation 21.
Proof.Consider any node m ∈ D(n) a descendant of n, and any MCMC sample t.We first aim to show that the quantity within the sum is an upper bound: Since the right-hand side is equal to p(ψ m new = y * ), the bound is trivially true in the case where m = n.So we can assume without loss of generality that m = n, and the righthand side can be written: according to equation 59.Since this expression is a convex combination of 1{ψ Due to equation 59, observe that p(ψ a new = y * ) ≤ p(ψ parent(a) new = y * ) for any node a.In addition, by construction of the HDP, the ψ a k at any node a are a subset of the ψ . These observations extend to all ancestors of a. Applying these two observations to the node m, we can conclude that the above expression is further bounded above by: We have shown that the quantity within the sum of the heuristic is an upper bound.
Since by definition, h x (D(n)) ≥ max x∈D(n) p(x) ≥ p(x * =m), the full heuristic h(D(n)) is an upper bound on f (m), the objective function evaluated at m, for all m ∈ D(n).Therefore, h(D(n)) ≥ max m∈D(n) f (m).
The branch-and-bound algorithm starts with the input set D(0), which is the set of all nodes in the tree, and will efficiently compute the k most probable values of the source node x * , from which the observation y * was sampled.Note that the above algorithm is easily extended to the case where the HDP is part of a mixture model (i.e.F is not a delta function).To do so, replace each instance of 1{ψ n k = y * } with p(y * | y * ∼ F (θ n k ), z, φ), for all n and k.The above algorithm can be generalized to the case where x * is restricted to a subset of the nodes X in the hierarchy: arg max x * p(x * | x * ∈ X, y * , x, y).In this case, the algorithm is started with the input set D(0) ∩ X.The branch function is modified: where c i are the child nodes of n.

Infinite hierarchies
To apply the HDP in our semantic parsing model, we need to be able to handle the case where the HDP hierarchy is infinite (but with finite height).That is, every non-leaf node in the hierarchy may have an infinite number of children.But this makes no difference in the MCMC algorithm to infer z, φ, ψ, since the number of given observations (x, y) is finite.We only need to compute and keep track of the variables that are associated with an observation (either at the current node or a descendant).Thus, the only nodes of the tree that we need to explicitly keep in memory are those of x and their ancestors, as the restaurants at all other nodes are empty.The explicitly-stored tree size is bounded by the product of the number of distinct x i and the height of the tree.
However, the branch-and-bound algorithm to find the most probable source node x * needs to be adapted, since the branch function would otherwise return an infinite number of subsets.Consider any node n that has no observations (i.e. has an empty restaurant).Then by equation 59, p(ψ n new ) = p(ψ where a is the most recent non-empty ancestor of n.For such nodes, the objective function in equation 21 can be simplified Aside from the prior term p(n), all empty descendant nodes of a have the same objective function value, which is independent of n.So to adapt the algorithm to the infinite hierarchy case, the branch function is modified: where (c 1 , . . ., c n ) are the non-empty child nodes of a, and (c n+1 , c n+2 , . ..) are the empty child nodes of a. Next, in algorithm 1, following line 8, we add a new else-if statement to check for the case that S is a set of empty nodes.If so, S is added to C, and we don't continue the search in the empty descendant nodes.The resulting adapted branch-andbound algorithm correctly and efficiently solves the optimization problem for infinite hierarchies.

Modeling dependence on discrete structures
HDPs can be used to learn distributions that depend on sequences of non-negative integers.Consider the data {(x 1 , y 1 ), . . ., (x n , y n )} where each x i ∈ Z h + is a sequence of h non-negative integers.The distribution of y i is dependent on the value of x i .We can use the HDP to learn the relationship of this dependence: construct a hierarchy of height h, where each non-leaf node has a countably infinite number of children, every child node corresponding to a non-negative integer.Here, each x i uniquely identifies a leaf node in the hierarchy by characterizing a path from the root 0 to a leaf: the first integer in the sequence identifies the child of the root node, the second integer identifies the grandchild, and so on.The y i are then sampled from the corresponding leaf node.We can apply MCMC to learn the distributions of the y i , and how those distributions relate to the integer sequences x i .
Given a new observation y * , the branch-and-bound algorithm can be used to find the most probable corresponding integer sequence x * , but we need to be able to convert the output of the branch-and-bound into the corresponding integer sequence.The algorithm will output a list of the k most probable source nodes from which y * is sampled, or sets of empty source nodes (since the HDP hierarchy is infinite).More precisely, let (o 1 , . . ., o k ) be the output of the branch-and-bound algorithm.For each o j , there are two cases: 1. o j is a single leaf node, in which case it is straightforward to convert the node into its corresponding integer sequence.2. o j is the set of empty descendants of a node a.In this latter case, it can be converted into an "incomplete" sequence of integers, where the first L(a) numbers of the sequence correspond to the node a, where L(a) is the level of a.This incomplete sequence represents the set of all integer sequences that begin with the same L(a) integers, that do not already explicitly exist in the tree.
For example, let n a be the a th child of the root node 0 in the HDP hierarchy.Let n a,b be the b th child of n a , and so on.Suppose the training set contains only the sequences (4, 3, 1), (4, 7, 4), and (4, 8, 2).Therefore, the nodes in the HDP with nonempty restaurants are: n 4,3,1 , n 4,7,4 , n 4,8,2 , n 4,3 , n 4,7 , n 4,8 , n 4 , and 0. If the branch-andbound algorithm returns n 4,7,4 , the corresponding output integer sequence is (4, 7, 4).If instead, branch-and-bound returns the set of the empty descendant nodes of n 4 , the corresponding output integer sequence is (4, * \ {3, 7, 8}, * ).The ' * ' is a "wildcard" symbol that represents the set of all non-negative integers.Thus, (4, * \ {3, 7, 8}, * ) represents the set of all integer sequences that start with (4, . ..) but do not start with (4, 3, . ..), (4, 7, . ..), or (4, 8, . ..).This model can be extended to the case where the x i ∈ X have richer structure (e.g.X is the set of labeled trees, graphs, logical forms, etc), i.e. structured prediction.To do so, define d functions f k : X → Z + that characterize an aspect of the input structures x i .We call these functions f k feature functions.For example, if x is a labeled binary tree, f 1 (x) returns the label of the root node, and f 2 (x) returns the label of the left child, etc.The functions serve to map the structures x i into sequences of non-negative integers: Then the above HDP model can be directly used to learn the relationship between these integer sequences and the distribution of the observations y i .
For a new observation y * , the branch-and-bound algorithm will return the k most likely integer sequences (possibly with wildcard symbols) that represent the unknown structure x * .To convert the integer sequence (w 1 , . . ., w d ) into the corresponding structure in X , we can compute: Our code implements three functions to perform the above mapping between integer sequences and more structured representations in X : 1. get_feature(f , X): Given a feature function f and a set X ⊆ X , return {f (x) : x ∈ X}. 2. set_feature(f , X old , w): Given a feature function f , a set X old ⊆ X , and a non-negative integer w ∈ Z + , return X old ∩ f −1 (w).This function is used in the case that w k is an integer (not a wildcard).3. exclude_features(f , X old , W ): Given a feature function f , a set X old ⊆ X , and a finite set of non-negative integers W ∈ Z * + , return X old \ f −1 (W ).This is used in the case that w k is a wildcard * \ W .

Related work
The HDP hierarchy in our proposed model in section 2.4 resembles a decision tree (Russell and Norvig 2010).The input features determine the path within the tree, and the output is sampled from a leaf node.Teh (2006) constructs a language model using a hierarchical Pitman-Yor process (HPY), which is a generalization of the HDP that exhibits power-law behavior.In their model, the HPY describes the distribution of the next character in a sequence of characters, conditioned on the previous d characters.The sequence of preceding d characters corresponds to the path in the hierarchy of depth d.Our approach is a novel application of HDPs for structured prediction, where the path in the hierarchy is a random variable which corresponds to the structure we aim to predict.Since the HDP hierarchies are infinite, the model does not a priori impose a limit on the number of possible structures or logical forms.An idea for future work is to replace the HDP in our model with the HPY to better capture power-law behavior which is prevalent in natural language.
Example of a grammar in our framework.This example grammar operates on logical forms of the form predicate(first argument, second argument).The semantic function select_arg1 returns the first argument of the logical form.Likewise, the function select_arg2 returns the second argument.The function delete_arg1 removes the first argument, and identity returns the logical form with no change.In our work, the interior production rules (the first three listed above) are examples of rules that we specify, whereas the terminal rules and the posterior probabilities of all rules are learned via grammar induction.A simplified semantic representation is shown here for the sake of illustration.Our semantic parser uses a richer semantic representation.Section 3.2 provides more detail.

Model: semantic grammar
A grammar in our formalism operates over a set of nonterminals N and a set of terminal symbols W. It can be understood as an extension of a context-free grammar (CFG) (Chomsky 1956) where the generative process for the syntax is dependent on a logical form, thereby coupling syntax with semantics.In the top-down generative process of a derivation tree, a logical form guides the selection of production rules.Production rules in our grammar have the form are right-hand side symbols, and f i are semantic transformation functions.These functions describe how to "decompose" this logical form when recursively generating the subtrees rooted at each B i .Thus, they enable semantic compositionality.An example of a grammar in this framework is shown in figure 1, and a derivation tree is shown in figure 2. Let R be the set of production rules in the grammar and R A be the set of production rules with left-hand nonterminal symbol A.
S borders(pa,nj) Example of a derivation tree under the grammar given in figure 1.The logical form corresponding to every node is shown in blue beside the respective node.The logical form for V is borders(,nj) and is omitted to reduce clutter.

Generative process
A derivation tree in this formalism is a tree where every interior node is labeled with a nonterminal symbol in N , every leaf is labeled with a terminal in W, and the root node is labeled with the root nonterminal S.Moreover, every node in the tree is associated with a logical form: let x n be the logical form assigned to the tree node n, and x 0 = x for the root node 0.
The generative process to build a derivation tree begins with the root nonterminal S and a logical form x. The logical form x is drawn from a prior distribution on logical forms p(x).The generative process expands S by randomly drawing a production rule from R S , conditioned on the logical form x. This provides the first level of child nodes in the derivation tree.For example, if the rule S → B 1 :f 1 . . .B k :f k were drawn, the root node would have k child nodes, n 1 , . . ., n k , respectively labeled with the symbols B 1 , . . ., B k .The logical form associated with each node is determined by the semantic transformation function: x n i = f i (x 0 ).These functions describe the relationship between the logical form at a child node and that of its parent node.This process repeats recursively with every right-hand side nonterminal symbol, until there are no unexpanded nonterminal nodes.The sentence is obtained by taking the yield (i.e. the concatenation) of the terminals in the tree.
The semantic transformation functions are specific to the semantic formalism and may be defined as appropriate to the application.In our semantic parsing experiments in section 5, we define a domain-independent set of transformation functions specific to the Datalog representation of GEOQUERY and JOBS (e.g., one function selects the left n conjuncts in a conjunction, another selects the n th argument of a predicate instance, etc).Some examples of these transformation functions are: • The function select_left returns the left conjunct of a conjunction.For example, given the Datalog expression (river(A), loc(A,B),const(B,stateid(colorado))), this function returns river(A).• The function select_arg2 returns the second argument in an atomic formula.For example, given const(A,stateid(maine)), this function returns stateid(maine).
Semantic transformation functions are allowed to fail, which is useful in defining richer transformation functions and providing more flexibility when designing the production rules of the grammar.If in the generative process, a transformation function returns failure, the generative process is restarted from the root (all progress up to the failure is discarded).As an example, failure enables the definition of transformation functions that check whether the input logical form satisfies a specific property: re- quire_binary_conjunction returns the input logical form, unchanged, if it is a conjunction of length 2; otherwise, it returns failure.Since failure can cause the generative process to repeatedly restart, the process of sampling using the generative process can be expensive.However, our approach does not generate sentences using this algorithm, and as we show in section 4, the performance of the parser is not adversely affected.

Selecting production rules
The above description does not specify the conditional distribution from which rules are selected from R A given the logical form.There are many modeling options available in choosing this distribution, but we need a distribution that captures complex dependencies between the logical form and selected production rule.For example, consider the grammar in figure 1 and the logical form plays_sport( michael_phelps,tennis).When generating a sentence for this logical form, at the root nonterminal S, there is only one production rule available, S → N : select_arg1 VP : delete_arg1, so this rule is selected.Now consider the child node corresponding to the nonterminal VP.Its logical form is plays_sport(,tennis), which is the output of the function delete_arg1 when applied to the logical form at the root node.At this point, there are two choices of production rules with VP on the left-hand side: VP → V N and VP → V.In this case, we want the conditional distribution to select VP → V N, since the most likely sentence that conveys the semantics in the logical form is "plays tennis."In fact, VP → V N should be selected even in when the logical form is plays_sport (,baseball) or plays_sport (,soccer) or almost any other sport.However, if the logical form were plays_sport(,swimming), we want the conditional distribution to give higher probability to VP → V, since the verb phrase "swims" is much more likely.Therefore, a desirable property of the conditional distribution for VP production rules is that the distribution depends primarily on the predicate symbol (e.g.plays_sport) but also depends secondarily on the object argument (e.g.swimming or tennis).
Our model uses the HDP to capture this dependence, as presented in section 2.4.Every nonterminal in our grammar A ∈ N will be associated with an HDP hierarchy.For each nonterminal, we specify a sequence of semantic feature functions, {g 1 , . . ., g m }, each of which, when given input logical form x, returns a non-negative integer.The HDP hierarchy is a complete infinite tree of height m, where every parent node has an infinite number of child nodes, one for each non-negative integer.The base distribution H at the root of the HDP is over R A .
Take, for example, the derivation in figure 2. In the generative process where the node VP is expanded, the production rule is drawn from the HDP associated with the nonterminal VP.Suppose the HDP was constructed using a sequence of two semantic features: (predicate, arg2).In the example, the feature functions are evaluated with the logical form borders (,nj) and they return a sequence of two integers, the first is the identifier for the symbol borders and the second is the identifier for the symbol nj.This sequence uniquely identifies a path in the HDP hierarchy from the root node 0 to a leaf node n.The production rule VP → V N is drawn from this leaf node G n , and the generative process continues recursively.As desired, the distribution of the selected production rule G n depends on the HDP source node n, which itself depends primarily on the first feature and secondarily on the second feature and so on (in this example, the predicate and arg2 of the logical form are the first and second features, respectively).
In our implementation, the set of nonterminals N is divided into two disjoint groups: (1) the set of "interior" nonterminals, and (2) preterminals.The production rules of preterminals are restricted such that the right-hand side contains only terminal symbols.The rules of interior nonterminals are restricted such that only nonterminal symbols appear on the right side.
1.For preterminals, H is a distribution over sequences of terminal symbols.The sequence of terminal symbols is distributed as follows: first sample the length of the terminal from a geometric distribution (i.e. the number of words) and then generate each word in the sequence i.i.d.from a uniform distribution over a finite set of (initially unknown) terminals.Note that we do not specify a set of domainspecific terminal symbols in defining this distribution.2. For interior nonterminals, H is a discrete distribution over a domain-independent set of production rules, which we specify.Since the production rules contain transformation functions, they are specific to the semantic formalism.However, prior knowledge of the English language can be encoded in these specified production rules, which dramatically improves the statistical efficiency of our model and obviates the need for massive training sets to learn English syntax.It is nonetheless tedious to design these rules while maintaining domain-generality.Once specified, however, these rules in principle can be re-used in new tasks and domains without further changes.
We emphasize that only the prior is specified here, and our algorithm uses grammar induction to infer the posterior.In principle, a more relaxed choice of H may enable grammar induction without pre-specified production rules, and therefore without dependence on a particular semantic formalism or natural language, if an efficient inference algorithm can be developed in such cases.

Modeling morphology
The grammar model is easily modified to incorporate word morphology.To do so, we add an additional step to the generative process after generating the terminal symbols.Instead of the terminal symbols constituting the tokens of the sentence directly, the terminal symbols are instead word roots coupled with morphological flags that indicate their inflection.For example, in the grammar in figure 1, rather than having multiple rules for the various inflections of the verb "to border", such as V → "borders", V → "bordered", V → "bordering", there would only be a single production rule for the root: V → "border".In order to produce the various inflections, the logical form is augmented to carry morphology information.Semantic transformation functions may add or modify morphological flags.For example, suppose we have the rule VP → V : add_third_person,add_present_tense where add_third_person is a function that adds the 3RD flag (indicating third person) to the logical form, and add_present_tense is a function that adds the PRES flag (indicating present tense) to the logical form.These morphological flags are copied into the terminal symbols, and as a final step, the roots are inflected according to the morphological flags (e.g."border[3RD,PRES]" is inflected to "borders").See figure 3 for an example of a derivation tree for a grammar that models morphology.
S borders(pa,nj) N pa "Pennsylvania" VP borders(,nj) V "border"[3RD,PRES] "borders" Example of a derivation tree under a grammar with a model of morphology.The logical form corresponding to every node is shown in blue beside the respective node.The logical form for V is borders(,nj)[3RD,PRES] and is omitted to reduce clutter.Morphology is not modeled for proper nouns such as "Pennsylvania" and "NJ." If a root with morphological flags has multiple inflections, such as "octopus"[PL] (PL indicates plural), the generative process selects one uniformly at random.During inference (i.e.parsing), this morphological model has the effect of performing morphological and syntactic-semantic parsing jointly, as we will show in the next section.Wiktionary (Wikimedia Foundation 2020) provides comprehensive high-quality morphology information for English verbs, common nouns, adjectives, and adverbs.Our implementation uses Wiktionary to construct a mapping between uninflected roots and inflected words, which is used in both directions: (1) given root and morphological flags, find the corresponding set of inflections, or (2) given an inflected word, find the corresponding set of roots and morphological flags.Note that only (2) is necessary for parsing and training, whereas (1) is necessary for generation.
Although this morphology model is implemented in our code, we do not use it in our experiments on GEOQUERY and JOBS.The morphology model is used in the experiments described later in the thesis of Saparov (2022).

Training
In this section, we describe how to infer the latent derivation trees t {t 1 , . . ., t n }, given a collection of sentences y {y 1 , . . ., y n } and logical form labels x {x 1 , . . ., x n }, where each derivation tree t i is distributed according to the conditional distribution described by the generative process in section 3.1 above.
We describe grammar induction independently of the choice of rule distribution.We wish to compute the posterior p(t | x, y) of the latent derivation trees.This is intractable to compute exactly, and so we resort to MCMC.To perform blocked Gibbs sampling, we pick initial values for t and repeat the following: where N is the set of nonterminals, and the intersection is taken over the nodes n ∈ t i labeled with the nonterminal A in the i th derivation tree, and r n is the production rule at node n.Note that the probability does not necessarily factorize over rules, as is the case when using the HDP.So in order to sample t i , we use Metropolis-Hastings (MH), where the proposal distribution is given by the fully factorized form: The algorithm for sampling t * i is detailed in section 4.1.1.After sampling t * i , we choose to accept the new sample with probability where t i , here, is the old sample, and t * i is the newly proposed sample.In practice, this acceptance probability is very high.This approach is very similar in structure to that in Johnson, Griffiths, and Goldwater (2007); Blunsom and Cohn (2010); Cohn, Blunsom, and Goldwater (2010).
Computing the conditional probabilities of the production rules p(r n | x n , t −i ) and p( n∈t i r n | x, t −i ) (as well as the quantities required in sampling t * i ) depends on the model for selecting production rules.In our semantic parsing model, which uses an HDP model, these quantities can be computed using equations 61 and 64.Our parsing method only keeps the last MCMC sample (N samples = 1), so for each node in every derivation tree m ∈ t j , the production rule at that node r m corresponds to a customer in the Chinese restaurant representation of the HDP associated with the nonterminal at node m.When resampling the derivation tree t i , our method removes all customers that correspond to a production rule in t i .Then it is straightforward to compute conditional probabilities according to equations 61 and 64 with the remaining customers.Once a new t i is sampled, the customers corresponding to production rules in t i are added to their respective restaurants.
There may be additional random variables in the grammar apart from the derivation trees, such as α in the HDPs.We perform Gibbs sampling steps for these variables after each loop of resampling the trees t i , i = 1, . . ., n.The grammar induction algorithm is summarized: Pick initial values for t and α and repeat the following, 1.For i = 1, . . ., n, sample t * i | α, t −i , x i , y i from the distribution given by equation 33.Then accept this sample as the new value for t i with probability given by equation 34. 2. Perform the Gibbs sampling step for α | t.
In all our experiments, we run the above loop for 10 iterations.Note that this algorithm requires no further supervision beyond the utterances y and logical forms x.However, it is able to exploit additional information such as supervised derivation trees: if t ⊆ t is a subset of derivation trees that are supervised, the Gibbs sampling algorithm simply avoids resampling the trees in t.These supervised derivation trees do not necessarily need to be rooted in the nonterminal S. For example, a lexicon can be provided where each entry is a terminal symbol y i with a corresponding logical form label x i .In our experiments on GEOQUERY and JOBS, we evaluate our method with and without such a lexicon.

Sampling t *
i .To sample from equation 33, we use inside-outside sampling (Finkel, Manning, and Ng 2006;Johnson, Griffiths, and Goldwater 2007), a dynamic programming approach.For every nonterminal A ∈ N , sentence start position i, end position j, and logical form x, let I (A,i,j,x) be the probability that t * i has a node n with the label A and logical form x and spans the sentence from i to j.This is known as the inside probability.Similarly, for all production rules in the grammar A → B 1 :f 1 . . .B K :f K , sentence boundary positions between the right-hand side nonterminals l 1 < . . .< l K+1 , and logical forms x, let I (S→B 1 :f 1 ...B K :f K ,l,x) be the probability that t * i has a node n with the label A and logical form x and has child nodes B u , each with logical forms f u (x), and each spanning the sentence from l u to l u+1 .This is known as the inside rule probability.Note that we don't need to compute all possible inside probabilities for all logical forms (in many applications, the set of logical forms is infinite).Therefore, we compute these inside probabilities top-down, beginning at the root nonterminal I (S,0,|y i |,x i ) with the known logical form x i where |y i | is the length of sentence y i .The following formula can be used to compute this quantity recursively: If f u (x) returns failure, then I (B u ,l u ,l u+1 ,f u (x)) = 0. Note that in the case that A is a preterminal, where w is a terminal.Aside from the inside probabilities that were required to compute the root inside probability I (S,0,|y i |,x i ) , all other inside probabilities are 0. In our code, this recursion is implemented iteratively, in order to avoid any issues with limited stack size and to share code with the parsing and generation algorithms.We also take care not to recompute previously computed inside probabilities.All that remains is the outside step: sample the derivation tree using the computed inside probabilities.To do so, start with the root nonterminal S at positions i = 0 to j = |y i | and logical form x i , and consider all production rules with S on the left-hand side S → B 1 :f 1 . . .B K :f K and all sentence boundaries between the right-hand side nonterminals l 1 < . . .< l K < l K+1 where l 1 = i and l K+1 = j.Sample a production rule and sentence boundaries with probability proportional to the inside rule probability I (S→B 1 :f 1 ...B K :f K ,l,x i ) .Next, consider each right-hand side nonterminal of the selected rule B u , start position l u , end position l u+1 , and logical form f u (x i ), and recursively repeat the sampling procedure.The end result is a tree sampled from equation 33.

Parsing
For a new sentence y * , we aim to find the logical form x * and derivation t * that maximizes These samples of t are obtained from the above training procedure.For the parsing approach presented in this section, it is assumed that N samples = 1, and so p(x * , t * | y * , x, y) ≈ p(x * , t * | y * , t), where t is the last MH sample from the training procedure.Thus, we can write the objective function for parsing: This is a discrete optimization problem, which we solve using branch-and-bound (see algorithm 1).The algorithm starts by considering the set of all derivation trees of y * and partitions it into a number of subsets (the "branch" step).For each subset S, we compute an upper bound on the log probability of any derivation in S (the "bound" step).This bound is given by equations 43, 44, and 45.Having the computed the bound for each subset, we push them onto a priority queue, prioritized by the bound.We then pop the subset with the highest bound and repeat this process, further subdividing this set into subsets, computing the bound for each subset, and pushing them onto the queue.Eventually, we will pop a subset containing a single derivation which is provably optimal, if its objective function value according to the above equation is at least the priority of the next item in the queue.We can continue the algorithm to obtain the top-k derivations/logical forms.Since this algorithm operates over sets of logical forms (where each set is possibly infinite), we must implement a data structure to sparsely represent such sets of formulas, as well as algorithms to perform set operations, such as intersection and subtraction.Each set of derivations is sparsely represented in our implementation as a single incomplete derivation tree (i.e. the leaf nodes may be either terminals or nonterminals) and a logical form set.The logical form set represents the logical form of the root node of every derivation tree in the set.The logical forms at the other nodes can be computed by using the semantic transformation functions.In addition, every nonterminal node with non-zero children has two integer indices that indicate its start and end positions in the sentence y * .This data structure represents the set of all derivation trees whose nodes match the nodes in the incomplete derivation tree at the given sentence positions.In addition, each derivation tree set has an integer counter to indicate to the branch function how to subdivide the set.Each such set of derivation trees is also called a search state.As an example, consider the input sentence "Trenton is the capital of New Jersey."Now consider a search state where the incomplete derivation tree contains only a single node labeled NP with start position 3 and end position 7 (corresponding to "capital of New Jersey") and the logical form set is the set of all logical forms.This search state represents the set of all derivation trees that have a node with label NP and is the common ancestor of the terminals in "capital of New Jersey." Given a set of derivation trees, the branch function is defined in algorithm 2. The branch-and-bound algorithm is started with a derivation tree set whose incomplete derivation tree has a single root node with nonterminal S, the set of all logical forms, start position 0, and end position |y * |.
Algorithm 2: Pseudocode for branch in the branch-and-bound algorithm for the parser, which aims to maximize equation 41. for sentence positions j k+1 such that j k < j k+1 < j do /* the operation where X m is the logical form set of S m , and S * is a new derivation tree set with counter 1, the incomplete derivation tree is identical to that of S except c k is substituted with the incomplete derivation tree of S m , the logical form set at the root is X k,l , and the end position of c k+1 is j k+1 The other missing piece in algorithm 2 is line 28, which depends on the model for selecting production rules.Our semantic parsing model uses an HDP model, and is able to directly use the algorithm described in section 2.2 to compute X * .Algorithm 4 may also be used here to return the m th most likely logical form(s).
The above branch-and-bound algorithm requires a heuristic function that, for an input search state (set of derivation trees), returns an upper bound on the objective function in equation 41 over all derivation trees in the set.This heuristic function determines the order of the search states to visit.The product in the objective n∈t * p(r n | x n * , t) can be decomposed accordingly into a product of two components: (1) the inner probability at a node n ∈ t * is the product of the terms that correspond to the subtree rooted at n, and (2) the outer probability is the product of the remaining terms, which correspond to the parts of t * outside of the subtree rooted at n.
To help define this heuristic, we define an upper bound on the log inner probability I (A,i,j) for any derivation tree rooted at nonterminal A at start position i and end position j in the sentence.
where l 1 = i, l K+1 = j.Note that the left term is a maximum over all logical forms x ′ , and so this upper bound only considers syntactic information.Computing the left term depends on the model for selecting production rules.Since our semantic parsing model uses an HDP, it uses the branch-and-bound approach in section 2.2 to compute this term.The right term can be maximized using dynamic programming with running time O(K 2 ).As such, classical syntactic parsing algorithms can be applied to compute I for every chart cell in O(n 3 ).For any terminal symbol w ∈ W, we define I (w,i,j) = 0. We now define the upper bound heuristic on any search state S with an incomplete derivation tree that has root node n, start position i, end position j, and logical form set X.If n has no child nodes: Else, if n has a nonterminal child node without children, the production rule at n is A → B 1 :f 1 . . .B K :f K , k is the smallest index of a nonterminal child node, and m is the value of the counter: Else, if all the nonterminal child nodes of n has children, and m is the value of the counter: The max in the third term of equation 44 can be computed via dynamic programming with running time O(K 2 ).In the equation for ρ, the sum over m ∈ S \ n is over all nodes in the incomplete derivation tree of S, excluding n.To avoid recomputing ρ every time h is invoked, our implementation stores it in every search state.Its initial value is 0. In algorithm 2, on line 22, the log probability of the new search state S * is equal to the sum of the log probability of the old state S and the log probability of S m .In line 30, the log probability of the new search state is equal to the sum of the log probability of the old search state S and log p(A → B 1 : Our implementation then uses this quantity directly as ρ in the above heuristic.The heuristic also has the nice property that when a search state is marked COMPLETE, its heuristic value is equal to the logarithm of the objective, aside from the prior term.Thus, when computing the objective function, such as checking the termination condition in the branch-and-bound, we only need to compute the prior term.With a sufficiently tight upper bound on the objective, this algorithm ignores a very large number of subproblems whose upper bound is too high.Figure 4 shows the search tree for the branch-and-bound algorithm.By ignoring sets of derivation trees with an upper bound smaller than that of the highest-scoring element in the search queue, the parser can ignore a large number of improbable logical forms and derivations.Thus, with a good upper bound, the parser can run in sublinear time with respect to the size of the theory.The parser resembles a generalized version of the Earley parsing algorithm (Earley 1970).

Generating sentences
In contrast with parsing, given a new logical form x * , natural language generation is the task of finding the unknown sentence y * and derivation tree t * .A straightforward way to do this in our model is to sample t * | t, x * , and simply compute y * = yield(t * ).The sampling follows the generative process directly.
However, in many situations, it is desirable to find the sentence y * and derivation t * that maximize: As with parsing, we assume that N samples = 1.This is also a discrete optimization problem, albeit simpler than parsing, and we again apply branch-and-bound.Similar to the case in parsing, each search state represents a set of derivation trees, represented by an incomplete derivation tree, except that the nodes do not have any sentence positions, since y * is not known, and there is only a single logical form rather than a set of logical forms, since x * is known.The branch function for generation is shown in algorithm 5.The algorithm is started with a derivation tree set whose incomplete derivation tree consists of a single node labeled S with logical form x * .
Algorithm 5: Pseudocode for branch and expand in the branch-and-bound algorithm for generating the most likely sentence(s), given a logical form, which aims to maximize equation 48.
1 function branch(derivation tree set S)

2
L is an empty list S * is a new derivation tree set with counter 1, the incomplete derivation tree consists of a root node n with nonterminal A, logical form x, and for each child node c i , the nonterminal is B i , and logical form is The heuristic upper bound for a search state S is simply: where the sum is over all nodes in the incomplete derivation tree of S, excluding the root node n.Note that, just as in parsing, the algorithm keeps track of this quantity in each search state as the log probability ρ, and so log h(S) = ρ.

Figure 5
Examples of sentences and logical form labels from GEOQUERY.

Figure 6
Examples of sentences and logical form labels from JOBS.
Just as in parsing, in order to execute line 15, we can use the augmented branch-andbound in algorithm 4 to return the m th best derivation tree that maximizes the objective over the set of derivation trees rooted at B k with logical form f k (x).Whenever line 17 is first executed for a given nonterminal B k and logical form f k (x), initialize the priority queue Q in algorithm 4 with: Q.push(S * , h(S * )) where S * is the search state with an incomplete derivation tree consisting of a single node at the root with nonterminal B k and logical form f k (x).The implementation for our inside-outside sampler, branch-andbound parser and generator is available at github.com/asaparov/grammar.

Semantic parsing experiments on GEOQUERY and JOBS
To evaluate our parser, we use the GEOQUERY and JOBS datasets (Zelle and Mooney 1996;Tang and Mooney 2000).GEOQUERY contains 880 questions about U.S. geography.Each question is labeled with a logical form in Datalog.The dataset includes a database called GEOBASE, which, when each logical form is executed, returns the answer to the corresponding question.The JOBS dataset contains 640 questions about computer-related job postings (from the USENET group austin.jobs).Each question is also labeled with a Datalog logical form, similar to the semantic formalism of GEOQUERY.Most question in the two datasets are interrogative sentences, but there are some imperative sentences.Figures 5 and 6 showcases some examples from each dataset, respectively.The task is semantic parsing: given each sentence, predict the logical form that represents its meaning.
We created a semantic grammar for the Datalog representation of GEOQUERY and JOBS, specifying the "interior" production rules and implementing the semantic transformation functions and their inverses. 2We experiment with a simple prior for the logical forms: Let x be a Datalog logical form, and x a,i is the i th predicate or "function" node in x in prefix order whose smallest variable is a ("smallest" in the sense that A is smaller than B is smaller than C etc).For example, size(A,B) is a predicate node whose smallest variable is A, and most(B,C,...) is a "function" node whose smallest variable is B. The prior probability of x is given by p(x) ∝ a,i p(x a,i | x a,i−1 ) where the conditional p(x a,i | x a,i−1 ) is modeled with an HDP as in section 2.4.This HDP has height 2: the first feature function is the predicate or "function" symbol of the input node (e.g.size or most), and the second feature function is the arity and "order" of the arguments (e.g.size(A) vs size(A,B) vs size(B,A)).
We also follow Wong and Mooney (2007); Li, Liu, and Sun (2013); Zhao and Huang (2015) and experiment with type-checking, where every entity is assigned a type from a type hierarchy (e.g.alaska has type state, state has supertype polity, etc), and every predicate is assigned a functional type (e.g.population has type polity → int → bool, etc).We incorporate type-checking into the semantic prior by assigning zero probability to type-incorrect logical forms.More precisely, logical forms are distributed according to the original prior, conditioned on the fact that the logical form is typecorrect.Type-checking requires the specification of a type hierarchy.Our hierarchy contains 11 types for GEOQUERY and 12 for JOBS.We run experiments with and without type-checking for comparison.
Following Zettlemoyer and Collins (2007), we use the same 600 GEOQUERY sentences for training and an independent test set of 280 sentences.On JOBS, we use the same 500 sentences for training and 140 for testing.We run our parser with two setups: (1) with no domain-specific supervision, and (2) using a small domain-specific lexicon and a set of beliefs (such as the fact that Portland is a city).For each setup, we run the experiments with and without type-checking, for a total of 4 experimental setups.A given output logical form is considered correct if it is semantically equivalent to the true logical form. 3In these experiments, we did not use a model of morphology in the grammar.We measure the precision and recall of our method, where precision is the number of correct parses divided by the number of sentences for which our parser provided output, and recall is the number of correct parses divided by the total number of sentences in each dataset.Our results are shown compared against many other semantic parsers in table 1.Our method is labeled PWL-LM.The numbers for the baselines were copied from their respective papers, and so their specified lexicons/type hierarchies may differ slightly.All code for these experiments is available at github.com/asaparov/parser.
Many sentences in the test set contain tokens previously unseen in the training set.In such cases, the maximum possible recall is 88.2 and 82.3 on GEOQUERY and JOBS, respectively.Therefore, we also measure the effect of adding a domain-specific lexicon, which maps semantic constants like maine to the noun "Maine" for example.This lexicon is analogous to the string-matching and argument identification steps in some other semantic parsers.We constructed the lexicon manually, with an entry for

Table 1
Results of semantic parsing experiments on the GEOQUERY and JOBS datasets (Saparov, Saraswat, and Mitchell 2017).Precision, recall, and F1 scores are shown.The methods in the top portion of the table were evaluated using 10-fold cross validation, whereas those in the bottom portion were evaluated with an independent test set.As a consequence, the methods evaluated using 10-fold cross validation were trained on 792 GEOQUERY examples and tested on 88 examples for each fold (hence the additional supervision label "A" in the above table).In contrast, the methods evaluated using an independent test set were trained on 600 GEOQUERY examples and tested on 280 examples.The domain-independent set of interior production rules (labeled "D" in the above table) is described in section 3.2.Some of the above methods use the preprocessed version of data from Dong and Lapata (2016), where entity names and numbers in the training and test sets are replaced with typed placeholders.This provides the same additional information as a typed domain-specific lexicon.
Logical form: answer(A,smallest(A,state(A))) answer(A,largest(B,(state(A),population(A,B)))) Test sentence: "Which state is the smallest?""Which state has the most population?"Generated: "What state is the smallest?""What is the state with the largest population?"

Figure 7
Examples of sentences generated from our trained grammar on logical forms in the GEOQUERY test set (Saparov, Saraswat, and Mitchell 2017).Generation is performed by computing arg max y * ,t * p(y * , t * | x * , t) as described in section 4.3.
every city, state, river, and mountain in GEOQUERY (141 entries), and an entry for every city, company, position, and platform in JOBS (180 entries).Aside from the lexicon and type hierarchy, the only training information is given by the set of sentences y, corresponding logical forms x, and the domain-independent set of interior production rules, as described in section 3.2.In our experiments, we found that the sampler converges rapidly, with only 10 passes over the data.This is largely due to our restriction of the interior production rules to a domain-independent set, which provides significant information about English syntax.
We emphasize that the addition of type-checking and a lexicon are mainly to enable a fair comparison with past approaches.As expected, their addition greatly improves parsing performance.At the time of the publication of our method (Saparov, Saraswat, and Mitchell 2017), we achieved state-of-the-art F1 on the JOBS dataset.However, even without such domain-specific supervision, the parser performs reasonably well.This is a promising indication that this parser will work effectively in the broader NLU system (described in Saparov ( 2022)), and is able to correctly parse sentences with complex and nested semantics.However, we notice a common error is the incorrect determination of scope of functions like highest, shortest, etc.This is likely due to the fact that the semantic prior does not explicitly model the scope of these functions (it assumes a uniform probability on all possible scopes).Thus, a more explicit model of scope might further improve parsing performance.We found that the semantic parsing problem is easier if the logical forms of each sentence are more similar to the syntactic structure of that sentence.In the extreme case, the logical forms would be identical to the sentences themselves, in which case parsing would be trivial.Thus, there is an inevitable balancing act in designing a semantic grammar and logical formalism for natural language, where on one hand we want the parsing problem to be as simple as possible, but on the other hand, we want the logical forms to be useful for downstream tasks, such as question-answering and reasoning, and ideally the logical forms of two distinct sentences that have the same meaning should be equivalent.These are important lessons to keep in mind when designing a domain-general semantic grammar and logical formalism.

Related work
Our grammar formalism can be related to synchronous CFGs (SCFGs) (Aho and Ullman 1972), where the semantics and syntax are generated simultaneously.However, instead of modeling the joint probability of the logical form and natural language utterance p(x, y), we model the factorized probability p(x)p(y | x), where the logical form x may have its own complex prior distribution p(x).Modeling each component in isolation provides a cleaner division between syntax and semantics, and one half of the model can be modified without affecting the other, and this is instrumental in larger NLU model (described in Saparov (2022)) since in that model, the logical form is derived from a larger theory containing background knowledge.We used a CFG in the syntactic portion of our model.Note that due to the coupling with semantics, our formalism is more powerful than purely syntactic CFGs: The sets of strings generated by grammars in our formalism is strictly larger than those generated by plain CFGs.In fact, any indexed grammar can be converted into a grammar in our formalism, where the stack of indices can be interpreted as the logical form (Aho 1968).Linear indexed grammars (LIGs) are strictly less powerful than indexed grammars, and are weakly equivalent to combinatory categorial grammars (CCGs), head grammars, and tree-adjoining grammars (Vijay-Shanker and Weir 1994), which in turn are strictly more powerful than CFGs.Richer syntactic formalisms such as CCGs (Steedman 1997) or head-driven phrase structure grammars (HPSGs) (Proudian and Pollard 1985) could replace the syntactic component in our framework and may provide a more uniform analysis across languages.Our model is similar to lexical functional grammar (LFG) (Kaplan and Bresnan 1995), where f -structures are replaced with logical forms.Nothing in our model precludes incorporating syntactic information like f -structures into the logical form, and as such, LFG is realized in our framework.Including a model of morphology in our grammar furthers the comparison to LFG.Our approach can be used to define new generative models of these grammatical formalisms.We implemented our method with a particular semantic formalism, but the grammatical model is agnostic to the choice of semantic formalism or the language.As in some previous parsers, our parsing problem can be related to the problem of finding shortest paths in hypergraphs using A* search (Klein andManning 2001, 2003;Pauls and Klein 2009;Pauls, Klein, and Quirk 2010;Gallo, Longo, and Pallottino 1993).

Future work
There is significant room for future work and exploration in the subject presented in this manuscript.In this section, we discuss shortcomings of various aspects of our approach, and give suggestions for how to overcome them.
The performance of our parser and generator depend heavily on the production rules of the grammar.Although the preterminal production rules are induced during training, we had to specify the other production rules by hand.While this does give us a great deal of control over the grammar, and enables us to incorporate prior knowledge about the English language into the grammar, it is very time-consuming.It would be valuable to look into ways in which these production rules can be induced from data.Recall that every production rule in our grammar is annotated with semantic transformation functions.These functions are intimately tied with the semantic formalism and effectively implement a theory of formal semantics.It would also be valuable to explore whether these transformation functions can be learned as well.One promising direction would be to decompose the semantic transformation functions into a sequence of elementary "instructions."Each semantic transformation function could then be equivalently written as short programs in a simple programming language.We could then induce the semantic transformation functions by searching over the space of these short programs, perhaps by attempting to add or remove instructions, etc.However, it is not clear how much grammar induction would improve our current grammar for English.But such an approach would certainly help to learn grammar for other languages, about which we have much less knowledge.The statistical efficiency of our approach could greatly aid in natural language processing for low-resource languages, for which training data is very scarce.
During parsing, our method uses an upper bound on the objective function (as defined in equations 43, 44, and 45) that takes into account syntactic information.While this works well enough for our purposes, it may be possible to further improve the performance of the parser by defining tighter upper bound, possibly by taking into account semantic information.
Our semantic parsing model assumes that the sentences are noise-less: there are no spelling or grammatical errors in the utterances.This assumption helps to simplify the problem and to focus the scope of the thesis more onto language understanding and reasoning.But real-world language is noisy, and thus further work to extend the semantic parsing model to noisy settings is warranted.To properly handle grammatical errors, additional "incorrect" production rules must be added to the grammar, such as a rule where the grammatical number of the subject noun and the verb do not agree, or a rule where the subject is dropped entirely (and left to be inferred from context).Grammar induction could be used to learn these "incorrect" production rules.A possible way to handle spelling errors is to add another step to the generative process (as described in section 3.1).This extra step would take the correctly-spelled sentence as its input and create errors, such as insertions, deletions, or substitutions of characters.During inference, this process is inverted: Given the noisy sentence as input, the parser first needs to infer the correctly-spelled sentence (which is now latent), and then proceed with the parsing algorithm as described earlier in this chapter.
The syntactic component of our grammatical formalism is a CFG, which is a projective model of grammar.That is, in any derivation tree of a sentence, the leaves of any subtree form a contiguous substring of the sentence.For example, in the sentence "John saw a dog which was a Yorkshire Terrier yesterday," the object noun phrase is "dog which was a Yorkshire Terrier," which appears contiguously in the sentence.However, natural languages exhibit non-projectivity, such as in the example "John saw a dog yesterday which was a Yorkshire Terrier," where the object noun phrase is now split by the adverb "yesterday" (McDonald et al. 2005).However, techniques such as feature passing can be used to model non-projective phenomena such as syntactic movement in non-transformational models of grammar (Gazdar 1981).In principle it is also possible to replace the CFG with a non-projective grammar formalism such as a mildly non-projective dependency grammar (Kuhlmann 2013;Bodirsky, Kuhlmann, and Möhl 2005).

Modeling context
The logical forms in our model are assumed to be context-independent.Conditioned on the theory, they are independently and identically distributed.This assumption greatly simplifies the natural language that we need to be able to parse.While it helps to focus the scope of the thesis, it is not representative of real-world language.In real language, the distribution of a sentence is highly dependent on the sentences that precede it, even when conditioned on the theory, which contains all of the background knowledge.For example, this assumption disallows inter-sentential coreference (e.g.pronouns that can refer to objects mentioned in other sentences).Our model also assumes that the universe of discourse does not vary, and so the sentence "All of the children are asleep" would mean that, literally, every child in the universe is sleeping.The more likely meaning of the sentence is that all of the children within the local area, such as the home or town, are sleeping.The definite article "the" often indicates the uniqueness of an object: "the tallest mountain" indicates that there is exactly one tallest mountain.However, this is not the case in the example: "A cat walked into the room.The cat purred."Here, "the cat" does not imply that there is exactly one cat in the universe.Rather, it means that the cat is unique in the context.The universe of discourse can change across sentences (and sometimes even within sentences).Relaxing the assumption that logical forms are context-independent would enable our parser to correctly understand these example sentences.To relax this assumption, our model must be augmented with a model of context.See Saparov (2022) for a more concrete proposal for a model of context.

A. Gibbs sampling for the Dirichlet process
The Chinese restaurant process (CRP) representation of the Dirichlet process enables efficient inference using Markov chain Monte Carlo (MCMC) methods.Suppose that we are given y {y 1 , . . ., y n } observations and we wish to infer the values of the latent variables: φ i and z i .A Gibbs sampling algorithm can be derived, where initial values for φ i and z i are selected, φ   i is to assign each observation to its own table: φ (0) i = y i and z (0) i = i for i = 1, . . ., n.Note that the value of φ (t) i is deterministic and equal to y z (t) j for all z (t) j = i.Thus, only z (t) i needs to be sampled at each iteration (for all i = 1, . . ., n).In Gibbs sampling, each random variable is sampled m = 0: In the case where m = 0: If z n i is sampled to be a "new" table, a new customer will appear in the parent node of n, and its table assignment must be sampled next.This new customer may itself be assigned to a new table, and so this process continues recursively until a customer sits at a non-empty table, or a customer sits at an empty table at the root node 0. The computation required in this recursive sampling procedure overlaps heavily with that in computing the probabilities in equations 59 and 60, so they should be done simultaneously to avoid wasted computation.
In our code, for each iteration of Gibbs sampling, we traverse the tree nodes n in prefix order, and resample z n i in random order.In many applications, including in our semantic parsing approach, we need to compute the probability of a new observation y n+1 , given its source node x n+1 and previous observations (x, y): p(y n+1 | x n+1 , x, y) = p(y n+1 | x n+1 , z)p(z | x, y)dz, ≈ 1 N samples z (t) ∼z|x,y p(y n+1 | x n+1 , z (t) , φ (t) , ψ (t) ). ( The integral is approximated as a sum over posterior samples of z, which can be obtained using the MCMC algorithm described above.However, we find in our experiments that the posterior is concentrated at a single point, and it suffices to keep only the final sample (i.e.N samples = 1) as a point estimate of z, ψ, φ.In either case, we can compute the quantity within the sum: p(y n+1 | x n+1 , z, φ, ψ) = p(ψ This quantity can be computed as in equations 59 and 60 (but since we are not resampling z n i , we don't exclude any customers in the n m k terms).The above can be extended to the case where rather than 1 new observation, there are k new observations, and we want to compute their joint probability: (64) So to compute this, first compute the probability of the first observation y n+1 alone.Next, add (x n+1 , y n+1 ) to the HDP (treat them as part of x and y) and compute the probability of y n+2 alone.Repeat until all k probabilities are computed and then return the product.We observe that the joint probability does not factorize over each observation.This is due to the "rich get richer" effect observed in the Chinese restaurant process: If one observation is sampled, the same observation is more likely to be sampled in the future, since future customers are more likely to sit at tables with existing customers.And so the distribution is not i.i.d.But as the number of customers n becomes very large, the effect of α and any single customer on the distribution of the next observation becomes negligible, and so the distribution becomes more i.i.d.: x n+i , (66 when n is large.

B.1 Learning the concentration parameter α
We learn the concentration parameter from the data by placing a Gamma prior on α: An auxiliary variable sampling method can be used to infer α, which is described in appendix A of Teh et al. (2006) and section 6 of Escobar and West (1995).The Gibbs sampling step for α n is: w n ∼ Beta(α n + 1, n n ), (69) Here, max{z n i } is the number of occupied tables in restaurant n.The above updates assume that each node n has an independent α n .However, in many scenarios, we wish to tie the concentration parameters together to improve statistical efficiency.Suppose we constrain all the concentration parameters at each level in the hierarchy to be equal.Let L(n) be defined as the level of the node n (i.e.L(0) = 0 and L(n) = L(parent(n)) + 1).Let α i be the concentration parameter at level i, and so α n = α L(n) .Let its prior be α i ∼ Gamma(a i , b i ).Then, the Gibbs sampling step for α i is:

1LXA
function branch(derivation tree set S) 2 is an empty list 3 n is the root of the incomplete derivation tree of S 4 is the logical form set at n 5 i is the start sentence position of n 6 j is the end sentence position of n 7 if n has no child nodes 8 is the nonterminal symbol of n 9 return expand(A, i, j, X) /* see algorithm 3 */ 10 else if n has a nonterminal child node with no children 11 A → B 1 :f 1 . . .B K :f K is the production rule at n 12 c k is the first nonterminal child node of n with no children 13 B k is the nonterminal symbol of c k 14 i k is the start sentence position of c k 15 j k is the end sentence position of c k 16 m is the counter of S 17 S m is the m th most probable set of derivation trees with root nonterminal B k , start position i k , end position j k , whose logical forms are a subset of f k (X) {f k (x) = fail : x ∈ X}, according to equation 41 18 if S m exists 19

26 A
( a new derivation tree set identical to S except its counter is m + 1)25 else → B 1 :f 1 . . .B K :f K is the production rule at n 27 m is the counter of S 28 X m is the m th most likely set of logical forms according to p(A → B 1 :f 1 . . .B K :f K | x ∈ X, t)29 if X m exists 30 L.add( a new derivation tree set identical to S except its logical form set is X m , and is marked as COMPLETE ) 31 L.add( a new derivation tree set identical to S except its counter is m + 1) 32 return L

21 A
has a nonterminal child node with no children 9 A → B 1 :f 1 . . .B K :f K is the production rule at n 10 c k is the first nonterminal child node of n with no children 11 B k is the nonterminal symbol of c k 12 m is the counter of S 13 ρ is the log probability of S 14 if f k (x) fails return ∅ 15 S m is the m th most probable derivation tree with root nonterminal B k and logical form f k (x), according to equation 48 16 if S m exists 17 S * = S ∩ S m is a new derivation tree set with counter 1, the incomplete derivation tree is identical to that of S except c k is substituted with the incomplete derivation tree of S m , and the log probability is the sum of ρ and the log probability of S m 18 L.add(S * ) 19 L.add( a new derivation tree set identical to S except its counter is m + 1) 20 else → B 1 :f 1 . . .B K :f K is the production rule at n 22 ρ is the log probability of S 23 L.add( a new derivation tree set identical to S except its log probability is the sum of ρ and p(A → B 1 :f 1 . . .B K :f K | x, t), and is marked COMPLETE ) 24 return L 25 function expand(nonterminal A, logical form x) 26 L is an empty list 27 if A is a preterminal 28 for rules A → w do 29 S * is a new derivation tree set where the incomplete derivation tree consists of a root node n with nonterminal A, logical form x, and child node w 30 L.add(S * ) 31 else 32 for rules A → B 1 :f 1 . . .B K :f K do 33 for each iteration t, we sample new values of φ (max{zn i } − s n ), b i − {n:L(n)=i} log w n   .(71)This is the approach we implement in our semantic parsing model during training.Our semantic parsing model has several HDP hierarchies, and each has its own set of hyperparameters a and b.As an example, for one such hierarchy in our semantic parsing model (corresponding to the nonterminal VP R ), the hyperparameters are a 1 = 100, a 2 = 10, b 1 = 0.1, b 2 = 1, but the other hierarchies have similar values for their hyperparameters.