Learning Structured Natural Language Representations for Semantic Parsing

We introduce a neural semantic parser which is interpretable and scalable. Our model converts natural language utterances to intermediate, domain-general natural language representations in the form of predicate-argument structures, which are induced with a transition system and subsequently mapped to target domains. The semantic parser is trained end-to-end using annotated logical forms or their denotations. We achieve the state of the art on SPADES and GRAPHQUESTIONS and obtain competitive results on GEOQUERY and WEBQUESTIONS. The induced predicate-argument structures shed light on the types of representations useful for semantic parsing and how these are different from linguistically motivated ones.


Introduction
Semantic parsing is the task of mapping natural language utterances to machine interpretable meaning representations.
Despite differences in the choice of meaning representation and model structure, most existing work conceptualizes semantic parsing following two main approaches.
Under the second approach, the utterance is first parsed to an intermediate task-independent representation tied to a syntactic parser and 1 Our code will be available at https://github.com/cheng6076/scanner. then mapped to a grounded representation (Kwiatkowski et al., 2013;Reddy et al., , 2014Krishnamurthy and Mitchell, 2015;Gardner and Krishnamurthy, 2017).
A merit of the two-stage approach is that it creates reusable intermediate interpretations, which potentially enables the handling of unseen words and knowledge transfer across domains (Bender et al., 2015).
The successful application of encoder-decoder models (Bahdanau et al., 2015;Sutskever et al., 2014) to a variety of NLP tasks has provided strong impetus to treat semantic parsing as a sequence transduction problem where an utterance is mapped to a target meaning representation in string format (Dong and Lapata, 2016;Jia and Liang, 2016;Kočiský et al., 2016). Such models still fall under the first approach, however, in contrast to previous work (Zelle and Mooney, 1996;Zettlemoyer and Collins, 2005;Liang et al., 2011) they reduce the need for domain-specific assumptions, grammar learning, and more generally extensive feature engineering. But this modeling flexibility comes at a cost since it is no longer possible to interpret how meaning composition is performed. Such knowledge plays a critical role in understand modeling limitations so as to build better semantic parsers. Moreover, without any taskspecific prior knowledge, the learning problem is fairly unconstrained, both in terms of the possible derivations to consider and in terms of the target output which can be ill-formed (e.g., with extra or missing brackets).
In this work, we propose a neural semantic parser that alleviates the aforementioned problems. Our model falls under the second class of approaches where utterances are first mapped to an intermediate representation containing natural language predicates. However, rather than using an external parser (Reddy et al., 2014 or manually specified CCG grammars (Kwiatkowski et al., 2013), we induce intermediate representations in the form of predicateargument structures from data. This is achieved with a transition-based approach which by design yields recursive semantic structures, avoiding the problem of generating ill-formed meaning representations. Compared to existing chart-based semantic parsers (Krishnamurthy and Mitchell, 2012;Cai and Yates, 2013;Berant et al., 2013;Berant and Liang, 2014), the transition-based approach does not require feature decomposition over structures and thereby enables the exploration of rich, non-local features. The output of the transition system is then grounded (e.g., to a knowledge base) with a neural mapping model under the assumption that grounded and ungrounded structures are isomorphic. 2 As a result, we obtain a neural network that jointly learns to parse natural language semantics and induce a lexicon that helps grounding.
The whole network is trained end-to-end on natural language utterances paired with annotated logical forms or their denotations. We conduct experiments on four datasets, including GEOQUERY (which has logical forms; Zelle and Mooney 1996), SPADES (Bisk et al., 2016), WEBQUESTIONS (Berant et al., 2013), and GRAPHQUESTIONS (Su et al., 2016) (which have denotations). Our semantic parser achieves the state of the art on SPADES and GRAPH-QUESTIONS, while obtaining competitive results on GEOQUERY and WEBQUESTIONS. A sideproduct of our modeling framework is that the induced intermediate representations can contribute to rationalizing neural predictions (Lei et al., 2016). Specifically, they can shed light on the kinds of representations (especially predicates) useful for semantic parsing. Evaluation of the induced predicate-argument relations against syntax-based ones reveals that they are interpretable and meaningful compared to heuristic baselines, but they sometimes deviate from linguistic conventions.

Preliminaries
Problem Formulation Let K denote a knowledge base or more generally a reasoning system, and x an utterance paired with a grounded meaning representation G or its denotation y. Our prob-  lem is to learn a semantic parser that maps x to G via an intermediate ungrounded representation U . When G is executed against K, it outputs denotation y.

Grounded
Meaning Representation We represent grounded meaning representations in FunQL (Kate et al., 2005) amongst many other alternatives such as lambda calculus (Zettlemoyer and Collins, 2005), λ-DCS (Liang, 2013) or graph queries (Holzschuher and Peinl, 2013;Harris et al., 2013). FunQL is a variablefree query language, where each predicate is treated as a function symbol that modifies an argument list. For example, the FunQL representation for the utterance which states do not border texas is: answer(exclude(state(all), next to(texas))) where next to is a domain-specific binary predicate that takes one argument (i.e., the entity texas) and returns a set of entities (e.g., the states bordering Texas) as its denotation. all is a special predicate that returns a collection of entities. exclude is a predicate that returns the difference between two input sets.
An advantage of FunQL is that the resulting s-expression encodes semantic compositionality and derivation of the logical forms. This property makes FunQL logical forms natural to be generated with recurrent neural networks (Vinyals et al., 2015;Choe and Charniak, 2016;. However, FunQL is less expressive than lambda calculus, partially due to the elimination of variables. A more compact logical formulation which our method also applies to is λ-DCS (Liang, 2013). In the absence of anaphora and composite binary predicates, conversion algorithms exist between FunQL and λ-DCS. However, we leave this to future work.

Ungrounded Meaning Representation
We also use FunQL to express ungrounded meaning representations. The latter consist primarily of natural language predicates and domain-general predicates. Assuming for simplicity that domaingeneral predicates share the same vocabulary in ungrounded and grounded representations, the ungrounded representation for the example utterance is: answer (exclude(states(all), border(texas))) where states and border are natural language predicates. In this work we consider five types of domain-general predicates illustrated in Table 1. Notice that domain-general predicates are often implicit, or represent extra-sentential knowledge. For example, the predicate all in the above utterance represents all states in the domain which are not mentioned in the utterance but are critical for working out the utterance denotation. Finally, note that for certain domain-general predicates, it also makes sense to extract natural language rationales (e.g., not is indicative for exclude). But we do not find this helpful in experiments.
In this work we constrain ungrounded representations to be structurally isomorphic to grounded ones. In order to derive the target logical forms, all we have to do is replacing predicates in the ungrounded representations with symbols in the knowledge base. 3

Modeling
In this section, we discuss our neural model which maps utterances to target logical forms. The semantic parsing task is decomposed in two stages: we first explain how an utterance is converted to an intermediate representation (Section 3.1), and then describe how it is grounded to a knowledge base (Section 3.2).

Generating Ungrounded Representations
At this stage, utterances are mapped to intermediate representations with a transition-based algorithm. In general, the transition system generates the representation by following a derivation tree (which contains a set of applied rules) and some canonical generation order (e.g., pre-order). For FunQL, a simple solution exists since the representation itself encodes the derivation. Consider 3 As a more general definition, we consider two semantic graphs isomorphic if the graph structures governed by domain-general predicates, ignoring local structures containing only natural language predicates, are the same (Section 5).
again answer (exclude(states(all), border(texas))) which is tree structured. Each predicate (e.g., border) can be visualized as a non-terminal node of the tree and each entity (e.g., texas) as a terminal. The predicate all is a special case which acts as a terminal directly. We can generate the tree top-down with a transition system reminiscent of recurrent neural network grammars (RN-NGs; ). Similar to RNNG, our algorithm uses a buffer to store input tokens in the utterance and a stack to store partially completed trees. A major difference in our semantic parsing scenario is that tokens in the buffer are not fetched in a sequential order or removed from the buffer. This is because the lexical alignment between an utterance and its semantic representation is hidden. Moreover, some domain-general predicates cannot be clearly anchored to a token span. Therefore, we allow the generation algorithm to pick tokens and combine logical forms in arbitrary orders, conditioning on the entire set of sentential features. Alternative solutions in the traditional semantic parsing literature include a floating chart parser (Pasupat and Liang, 2015) which allows to construct logical predicates out of thin air.
Our transition system defines three actions, namely NT, TER, and RED, explained below.
NT(X) generates a Non-Terminal predicate. This predicate is either a natural language expression such as border, or one of the domain-general predicates exemplified in Table 1 (e.g., exclude). The type of predicate is determined by the placeholder X and once generated, it is pushed onto the stack and represented as a non-terminal followed by an open bracket (e.g., 'border ('). The open bracket will be closed by a reduce operation.
TER(X) generates a TERminal entity or the special predicate all. Note that the terminal choice does not include variable (e.g., $0, $1), since FunQL is a variable-free language which sufficiently captures the semantics of the datasets we work with. The framework could be extended to generate directed acyclic graphs by incorporating variables with additional transition actions for handling variable mentions and co-reference.
RED stands for REDuce and is used for subtree completion. It recursively pops elements from the stack until an open non-terminal node is encountered. The non-terminal is popped as well, after which a composite term representing the entire  subtree, e.g., border(texas), is pushed back to the stack. If a RED action results in having no more open non-terminals left on the stack, the transition system terminates. Table 2 shows the transition actions used to generate our running example.
The model generates the ungrounded representation U conditioned on utterance x by recursively calling one of the above three actions. Note that U is defined by a sequence of actions (denoted by a) and a sequence of term choices (denoted by u) as shown in Table 2. The conditional probability p(U |x) is factorized over time steps as: where I is an indicator function.
To predict the actions of the transition system, we encode the input buffer with a bidirectional LSTM (Hochreiter and Schmidhuber, 1997) and the output stack with a stack-LSTM (Dyer et al., 2015). At each time step, the model uses the representation of the transition system e t to predict an action: where e t is the concatenation of the buffer representation b t and the stack representation s t . While the stack representation s t is easy to retrieve as the top state of the stack-LSTM, obtaining the buffer representation b t is more involved. This is because we do not have an explicit buffer representation due to the non-projectivity of semantic parsing. We therefore compute at each time step an adaptively weighted representation of b t (Bahdanau et al., 2015) conditioned on the stack representation s t . This buffer representation is then concatenated with the stack representation to form the system representation e t . When the predicted action is either NT or TER, an ungrounded term u t (either a predicate or an entity) needs to be chosen from the candidate list depending on the specific placeholder X. To select a domain-general term, we use the same representation of the transition system e t to compute a probability distribution over candidate terms: To choose a natural language term, we directly compute a probability distribution of all natural language terms (in the buffer) conditioned on the stack representation s t and select the most relevant term (Jia and Liang, 2016;Gu et al., 2016): When the predicted action is RED, the completed subtree is composed into a single representation on the stack. For the choice of composition function, we use a single-layer neural network as in Dyer et al. (2015), which takes as input the concatenated representation of the predicate and arguments of the subtree.

Generating Grounded Representations
Since we constrain the network to learn ungrounded structures that are isomorphic to the target meaning representation, converting ungrounded representations to grounded ones becomes a simple lexical mapping problem. For simplicity, hereafter we do not differentiate natural language and domain-general predicates.
To map an ungrounded term u t to a grounded term g t , we compute the conditional probability of g t given u t with a bi-linear neural network: where u t is the contextual representation of the ungrounded term given by the bidirectional LSTM, g t is the grounded term embedding, and W ug is the weight matrix.
The above grounding step can be interpreted as learning a lexicon: the model exclusively relies on the intermediate representation U to predict the target meaning representation G without taking into account any additional features based on the utterance. In practice, U may provide sufficient contextual background for closed domain semantic parsing where an ungrounded predicate often maps to a single grounded predicate, but is a relatively impoverished representation for parsing large open-domain knowledge bases like Freebase. In this case, we additionally rely on a discriminative reranker which ranks the grounded representations derived from ungrounded representations (see Section 3.4).

Training Objective
When the target meaning representation is available, we directly compare it against our predictions and back-propagate. When only denotations are available, we compare surrogate meaning representations against our predictions (Reddy et al., 2014). Surrogate representations are those with the correct denotations, filtered with rules (see Section 4). When there exist multiple surrogate representations, 4 we select one randomly and back-propagate.
Consider utterance x with ungrounded meaning representation U , and grounded meaning representation G. Both U and G are defined with a sequence of transition actions (same for U and G) and a sequence of terms (different for U and G). Recall that a = [a 1 , · · · , a n ] denotes the transition action sequence defining U and G; let u = [u 1 , · · · , u k ] denote the ungrounded terms (e.g., predicates), and g = [g 1 , · · · , g k ] the grounded terms. We aim to maximize the likelihood of the grounded meaning representation p(G|x) over all training examples. This likelihood can be decomposed into the likelihood of the grounded action sequence p(a|x) and the grounded term sequence p(g|x), which we optimize separately.
For the grounded action sequence (which by design is the same as the ungrounded action sequence and therefore the output of the transition system), we can directly maximize the log likelihood log p(a|x) for all examples: where T denotes examples in the training data.
For the grounded term sequence g, since the intermediate ungrounded terms are latent, we maximize the expected log likelihood of the grounded terms u [p(u|x) log p(g|u, x)] for all examples, which is a lower bound of the log likelihood log p(g|x) by Jensen's Inequality: The final objective is the combination of L a and L g , denoted as L G = L a + L g . We optimize this objective with the method described in Lei et al. (2016) and .

Reranker
As discussed above, for open domain semantic parsing, solely relying on the ungrounded representation would result in an impoverished model lacking sentential context useful for disambiguation decisions. For all Freebase experiments, we followed previous work (Berant et al., 2013;Berant and Liang, 2014;Reddy et al., 2014) in additionally training a discriminative ranker to rerank grounded representations globally.
The discriminative ranker is a maximumentropy model (Berant et al., 2013). The objective is to maximize the log likelihood of the correct answer y given x by summing over all grounded candidates G with denotation y (i.e., [[G]] K = y): where f (G, x) is a feature function that maps pair (G, x) into a feature vector. We give details on the features we used in Section 4.2.

Experiments
In this section, we verify empirically that our semantic parser derives useful meaning representations. We give details on the evaluation datasets and baselines used for comparison. We also describe implementation details and the features used in the discriminative ranker.

Datasets
We evaluated our model on the following datasets which cover different domains, and use different types of training data, i.e., pairs of natural language utterances and grounded meanings or question-answer pairs. GEOQUERY (Zelle and Mooney, 1996) contains 880 questions and database queries about US geography. The utterances are compositional, but the language is simple and vocabulary size small. The majority of questions include at most one entity. SPADES (Bisk et al., 2016) contains 93,319 questions derived from CLUEWEB09 (Gabrilovich et al., 2013) sentences. Specifically, the questions were created by randomly removing an entity, thus producing sentence-denotation pairs (Reddy et al., 2014). The sentences include two or more entities and although they are not very compositional, they constitute a largescale dataset for neural network training. WE-BQUESTIONS (Berant et al., 2013) contains 5,810 question-answer pairs. Similar to SPADES, it is based on Freebase and the questions are not very compositional.
However, they are real questions asked by people on the Web. Finally, GRAPHQUESTIONS (Su et al., 2016) contains 5,166 question-answer pairs which were created by showing 500 Freebase graph queries to Amazon Mechanical Turk workers and asking them to paraphrase them into natural language.

Implementation Details
Amongst the four datasets described above, GEO-QUERY has annotated logical forms which we directly use for training. For the other three datasets, we treat surrogate meaning representations which lead to the correct answer as gold standard. The surrogates were selected from a subset of candidate Freebase graphs, which were obtained by entity linking. Entity mentions in SPADES have been automatically annotated with Freebase entities (Gabrilovich et al., 2013). For WEBQUESTIONS and GRAPHQUESTIONS, we follow the procedure described in . We identify potential entity spans using seven handcrafted partof-speech patterns and associate them with Freebase entities obtained from the Freebase/KG API. 5 We use a structured perceptron trained on the entities found in WEBQUESTIONS and GRAPHQUES-TIONS to select the top 10 non-overlapping entity disambiguation possibilities. We treat each possibility as a candidate input utterance, and use the perceptron score as a feature in the discriminative reranker, thus leaving the final disambiguation to the semantic parser.
Apart from the entity score, the discriminative ranker uses the following basic features. The first feature is the likelihood score of a grounded representation aggregating all intermediate representations. The second set of features include the embedding similarity between the relation and the utterance, as well as the similarity between the relation and the question words. The last set of features includes the answer type as indicated by the last word in the Freebase relation (Xu et al., 2016).
We used the Adam optimizer for training with an initial learning rate of 0.001, two momentum parameters [0.99, 0.999], and batch size 1. The dimensions of the word embeddings, LSTM states, entity embeddings and relation embeddings are [50,100,100,100]. The word embeddings were initialized with Glove embeddings (Pennington et al., 2014). All other embeddings were randomly initialized.
in the literature. GEOQUERY results are shown in Table 5. The first block contains symbolic systems, whereas neural models are presented in the second block. We report accuracy which is defined as the proportion of the utterance that are correctly parsed to their gold standard logical forms. All previous neural systems (Dong and Lapata, 2016;Jia and Liang, 2016) treat semantic parsing as a sequence transduction problem and use LSTMs to directly map utterances to logical forms. SCAN-NER yields performance improvements over these systems when using comparable data sources for training. Jia and Liang (2016) achieve better results with synthetic data that expands GEO-QUERY; we could adopt their approach to improve model performance, however, we leave this to future work. Table 6 reports SCANNER's performance on SPADES. For all Freebase related datasets we use average F1 (Berant et al., 2013) as our evaluation metric. Previous work on this dataset has used a semantic parsing framework similar to ours where natural language is converted to an intermediate syntactic representation and then grounded to Freebase. Specifically, Bisk et al. (2016) evaluate the effectiveness of four different CCG parsers on the semantic parsing task when varying the amount of supervision required. As can be seen, SCANNER outperforms all CCG variants (from unsupervised to fully supervised) without having access to any manually annotated derivations or Models Accuracy Zettlemoyer and Collins (2005) 79.3 Zettlemoyer and Collins (2007) 86. 1 Kwiatkowksi et al. (2010) 87.9 Kwiatkowski et al. (2011) 88.6 Kwiatkowski et al. (2013) 88.0 Zhao and Huang (2015) 88.9 Liang et al. (2011) 91.1 Dong and Lapata (2016) 84.6 Jia and Liang (2016) 85.0 Jia and Liang (2016) with extra data 89.1 SCANNER 86.7  (Bisk et al., 2016) 24.8 Semi-supervised CCG (Bisk et al., 2016) 28.4 Neural baseline 28.6 Supervised CCG (Bisk et al., 2016) 30.9 Rule-based system (Bisk et al., 2016) 31.4 SCANNER 31.5 lexicons. For fair comparison, we also built a neural baseline that encodes an utterance with a recurrent neural network and then predicts a grounded meaning representation directly (Ture and Jojic, 2016;Yih et al., 2016). Again, we observe that SCANNER outperforms this baseline. Results on WEBQUESTIONS are summarized in Table 3. SCANNER obtains performance on par with the best symbolic systems (see the first block in the table). It is important to note that Bast and Haussmann (2015) develop a question answering system, which contrary to ours cannot produce meaning representations whereas Berant and Liang (2015) propose a sophisticated agenda-based parser which is trained borrowing ideas from imitation learning.  learns a semantic parser via intermediate representations which they generate based on the output of a dependency parser. SCANNER performs competitively despite not having access to any linguistically-informed syntactic structures. The second block in Table 3 reports the results of several neural systems. Xu et al. (2016) represent the state of the art on WEBQUESTIONS. Their system uses Wikipedia to prune out erroneous candidate answers extracted from Freebase. Our model would also benefit from a similar post-processing step. As in previous experiments, SCANNER outperforms the neural baseline, too.
Finally, Table 4 presents our results on GRAPHQUESTIONS. We report F1 for SCANNER, the neural baseline model, and three symbolic sys-   Table 7. The first row shows the percentage of exact matches between the predicted representations and the human annotations. The second row refers to the percentage of structure matches, where the predicted representations have the same structure as the human annotations, but may not use the same lexical terms. Among structurally correct predictions, we additionally compute how many tokens are correct, as shown in the third row. As can be seen, the induced meaning representations overlap to a large extent with the human gold standard. We also evaluated the intermediate representations created by SCANNER on the other three (Freebase) datasets.
Since creating a manual gold standard for these large datasets is time-consuming, we compared the induced representations against the output of a syntactic parser. Specifically, we converted the questions to event-argument structures with EASY-CCG (Lewis and Steedman, 2014), a high coverage and high accuracy CCG parser. EASYCCG extracts predicate-argument structures with a labeled F-score of 83.37%. For further comparison, we built a simple baseline which identifies predicates based on the output of the Stanford POS-  Table 8: Evaluation of predicates induced by SCANNER against EASYCCG. We report F1(%) across datasets. For SPADES, we also provide a breakdown for various utterance types.
tagger  following the ordering VBD ≫ VBN ≫ VB ≫ VBP ≫ VBZ ≫ MD. As shown in Table 8, on SPADES and WE-BQUESTIONS, the predicates learned by our model match the output of EASYCCG more closely than the heuristic baseline. But for GRAPHQUESTIONS which contains more compositional questions, the mismatch is higher. However, since the key idea of our model is to capture salient meaning for the task at hand rather than strictly obey syntax, we would not expect the predicates induced by our system to entirely agree with those produced by the syntactic parser. To further analyze how the learned predicates differ from syntax-based ones, we grouped utterances in SPADES into four types of linguistic constructions: coordination (conj), control and raising (control), prepositional phrase attachment (pp), and subordinate clauses (subord). Table 8 also shows the breakdown of matching scores per linguistic construction, with the number of utterances in each type. In Table 9, we provide examples of predicates identified by SCANNER, indicating whether they agree or not with the output of EASYCCG. As a reminder, the task in SPADES is to predict the entity masked by a blank symbol ( ).
As can be seen in Table 8, the matching score is relatively high for utterances involving coordination and prepositional phrase attachments.
The model will often identify informative predicates (e.g., nouns) which do not necessarily agree with linguistic intuition. For example, in the utterance wilhelm maybach and his son started maybach in 1909 (see Table 9), SCANNER identifies the predicateargument structure son(wilhelm maybach) rather than started(wilhelm maybach). We also observed that the model struggles with control and subordinate constructions. It has difficulty distinguishing control from raising predicates as exemplified conj the boeing company was founded in 1916 and is headquartered in , illinois . nstar was founded in 1886 and is based in boston , . the is owned and operated by zuffa , llc , headquarted in las vegas , nevada . hugh attended and then shifted to uppingham school in england .
was incorporated in 1947 and is based in new york city . the ifbb was formed in 1946 by president ben weider and his brother . wilhelm maybach and his son started maybach in 1909 .
was founded in 1996 and is headquartered in chicago .
control threatened to kidnap russ . has also been confirmed to play captain haddock . hoffenberg decided to leave .
is reportedly trying to get impregnated by djimon now . for right now , are inclined to trust obama to do just that .
agreed to purchase wachovia corp . ceo john thain agreed to leave . so nick decided to create . salva later went on to make the non clown-based horror . eddie dumped debbie to marry when carrie was 2 .
pp is the home of the university of tennessee . chu is currently a physics professor at . youtube is based in , near san francisco , california . mathematica is a product of . jobs will retire from . the nab is a strong advocacy group in . this one starred robert reed , known mostly as .
is positively frightening as detective bud white .
subord the is a national testing board that is based in toronto . is a corporation that is wholly owned by the city of edmonton . unborn is a scary movie that stars .
's third wife was actress melina mercouri , who died in 1994 . sure , there were who liked the shah .
founded the , which is now also a designated terrorist group .
is an online bank that ebay owns . zoya akhtar is a director , who has directed the upcoming movie . imelda staunton , who plays , is genius .
is the important president that american ever had . plus mitt romney is the worst governor that has had . Table 9: Informative predicates identified by SCANNER in various types of utterances. Yellow predicates were identified by both SCANNER and EASYCCG, red predicates by SCANNER alone, and green predicates by EASYCCG alone.
in the utterance ceo john thain agreed to leave from Table 9, where it identifies the control predicate agreed. For subordinate clauses, SCANNER tends to take shortcuts identifying as predicates words closest to the blank symbol.

Discussion
We presented a neural semantic parser which converts natural language utterances to grounded meaning representations via intermediate predicate-argument structures.
Our model essentially jointly learns how to parse natural language semantics and the lexicons that help grounding. Compared to previous neural semantic parsers, our model is more interpretable as the intermediate structures are useful for inspecting what the model has learned and whether it matches linguistic intuition.
An assumption our model imposes is that ungrounded and grounded representations are structurally isomorphic. An advantage of this assumption is that tokens in the ungrounded and grounded representations are strictly aligned. This allows the neural network to focus on parsing and lexical mapping, sidestepping the challenging structure mapping problem which would result in a larger search space and higher variance. On the negative side, the structural isomorphism assumption restricts the expressiveness of the model, especially since one of the main benefits of adopting a two-stage parser is the potential of capturing domain-independent semantic information via the intermediate representation. While it would be challenging to handle drastically nonisomorphic structures in the current model, it is possible to perform local structure matching, i.e., when the mapping between natural language and domain-specific predicates is many-to-one or one-to-many. For instance, Freebase does not contain a relation representing daughter, using instead two relations representing female and child. Previous work (Kwiatkowski et al., 2013) models such cases by introducing collapsing (for manyto-one mapping) and expansion (for one-to-many mapping) operators. Within our current framework, these two types of structural mismatches can be handled with semi-Markov assumptions (Sarawagi and Cohen, 2005;Kong et al., 2016) in the parsing (i.e., predicate selection) and the grounding steps, respectively. Aside from relaxing strict isomorphism, we would also like to perform cross-domain semantic parsing where the first stage of the semantic parser is shared across domains.