Grounded Adaptation for Zero-shot Executable Semantic Parsing

We propose Grounded Adaptation for Zero-shot Executable Semantic Parsing (GAZP) to adapt an existing semantic parser to new environments (e.g. new database schemas). GAZP combines a forward semantic parser with a backward utterance generator to synthesize data (e.g. utterances and SQL queries) in the new environment, then selects cycle-consistent examples to adapt the parser. Unlike data-augmentation, which typically synthesizes unverified examples in the training environment, GAZP synthesizes examples in the new environment whose input-output consistency are verified. On the Spider, Sparc, and CoSQL zero-shot semantic parsing tasks, GAZP improves logical form and execution accuracy of the baseline parser. Our analyses show that GAZP outperforms data-augmentation in the training environment, performance increases with the amount of GAZP-synthesized data, and cycle-consistency is central to successful adaptation.


Introduction
Semantic parsers (Zelle and Mooney, 1996;Zettlemoyer and Collins, 2005;Liang et al., 2011) build executable meaning representations for a range of tasks such as question-answering (Yih et al., 2014), robotic control (Matuszek et al., 2013), and intelligent tutoring systems (Graesser et al., 2005).However, they are usually engineered for each application environment.For example, a languageto-SQL parser trained on an university management database struggles when deployed to a sales database.How do we adapt a semantic parser to new environments where no training data exists?
We propose Grounded Adaptation for Zero-shot Executable Semantic Parsing, which adapts existing semantic parsers to new environments by synthesizing new, cycle-consistent data.In the previous example, GAZP synthesizes high-quality sales questions and SQL queries using the new sales database, then adapts the parser using the synthesized data.This procedure is shown in Figure 1.GAZP is complementary to prior modeling work in that it can be applied to any model architecture, in any domain where one can enforce cycleconsistency by evaluating equivalence between logical forms.Compared to data-augmentation, which typically synthesizes unverified data in the training environment, GAZP instead synthesizes consistency-verified data in the new environment.
GAZP synthesizes data for consistency-verified adaptation using a forward semantic parser and a backward utterance generator.Given a new environment (e.g.new database), we first sample logical forms with respect to a grammar (e.g.SQL grammar conditioned on new database schema).Next, we generate utterances corresponding to these logical forms using the generator.Then, we parse the generated utterances using the parser, keeping those whose parses are equivalent to the original sampled logical form (more in Section 2.4).Finally, we adapt the parser to the new environment by training on the combination of the original data and the synthesized cycle-consistent data.
We evaluate GAZP on the Spider, Sparc, and CoSQL (Yu et al., 2018b(Yu et al., , 2019a,b) ,b) language-to-SQL zero-shot semantic parsing tasks which test on unseen databases.GAZP improves logical form and execution accuracy of the baseline parser on all tasks, successfully adapting the existing parser to new environments.In further analyses, we show that GAZP outperforms data augmentation in the training environment.Moreover, adaptation performance increases with the amount of GAZPsynthesized data.Finally, we show that cycleconsistency is critical to synthesizing high-quality examples in the new environment, which in turn allows for successful adaptation and performance Output is shown in red.First, we train a parser and a utterance generator using training data.We then sample logical forms in the inference environment and generate corresponding utterances.We parse the generated utterances and check for cycle-consistency between the parse and the sampled logical form (see Section 2.4).Consistent pairs of utterance and logical form are used to adapt the parser to the inference environment. improvement.1 2 Grounded Adaptation for Zero-shot Executable Semantic Parsing Semantic parsing involves producing a logical form q that corresponds to an input utterance u, such that executing q in the environment e produces the desired denotation EXE(q, e).In the context of language-to-SQL parsing, q and e correspond to SQL queries and databases.
We propose GAZP for zero-shot semantic parsing, where inference environments have not been observed during training (e.g.producing SQL queries in new databases).GAZP consists of a forward semantic parser F (u, e) → q, which produces a logical form q given an utterance u in environment e, and a backward utterance generator G(q, e) → u.The models F and G condition on the environment by reading an environment description w, which consists of a set of documents d.In the context of SQL parsing, the description is the database schema, which consists of a set of table schemas (i.e.documents).
We assume that the logical form consists of three types of tokens: syntax candidates c s from a fixed syntax vocabulary (e.g.SQL syntax), environment candidates c e from the environment description (e.g.table names from database schema), and utterance candidates c u from the utterance (e.g.values in SQL query).Finally, c e tokens have corresponding spans in the description d.For example, a SQL query q consists of columns c e that directly map to related column schema (e.g.table, name, type) in the database schema w.
In GAZP , we first train the forward semantic parser F and a backward utterance generator G in the training environment e.Given a new inference environment e , we sample logical forms q from e using a grammar.For each q, we generate a corresponding utterance u = G(q, e ).We then parse the generated utterance into a logical form q = F (u , e ).We combine cycle-consistent examples from the new environment, for which q is equivalent to q, with the original labeled data to retrain and adapt the parser.Figure 1 illustrates the components of GAZP.We now detail the sampling procedure, forward parser, backward generator, and cycle-consistency.

Query sampling
To synthesize data for adaptation, we first sample logical forms q with respect to a grammar.We begin by building an empirical distribution over q using the training data.For language-to-SQL parsing, we preprocess queries similar to Zhang et al. (2019) and further replace mentions of columns and values with typed slots to form coarse Algorithm 1 Query sampling procedure.that we remove JOINs which are later filled back deterministically after sampling the columns.Next, we build an empirical distribution P Z over these coarse templates by counting occurrences in the training data.The sampling procedure is shown in Algorithm 1 for the language-to-SQL example.Invalid queries and those that execute to the empty set are discarded.
Given some coarse template z = SELECT key1, text1 WHERE text2 = val, the function d.CANFILL(z) returns whether the database d contains sufficient numbers of columns.In this case, at the minimum, d should have a key column and two text columns.The function d.RANDASSIGNCOLSTOSLOTS() returns a database copy d such that each of its columns is mapped to some identifier text1, key1 etc.
Appendix A.1 quantifies query coverage of the sampling procedure on the Spider task, and shows how to extend Algorithm 1 to multi-turn queries.

Forward semantic parser
The forward semantic parser F produces a logical form q = F (u, e) for an utterance u in the environment e.We begin by cross-encoding u with the environment description w to model coreferences.Since w may be very long (e.g.entire database schema), we instead cross-encode u with each document d i in the description (e.g. each table schema) similar to Zhang et al. (2019).We then combine each environment candidate c e,i across documents (e.g.table columns) using RNNs, such that the final representations capture dependencies between c e from different documents.To produce the logical form q, we first generate a logical form template q whose utterance candidates c u (e.g.SQL values) are replaced by slots.We generate q with a pointerdecoder that selects among syntax candidates c s (e.g.SQL keywords) and environment candidate c e (e.g.table columns).Then, we fill in slots in q with a separate decoder that selects among c u in the utterance to form q. Note that logical form template q is distinct from coarse templates z described in sampling (Section 2.1). Figure 2 describes the forward semantic parser.
Let u denote words in the utterance, and d i denote words in the ith document in the environment description.Let [a; b] denote the concatenation of a and b.First, we cross-encode the utterance and the document using BERT (Devlin et al., 2019), which has led to improvements on a number of NLP tasks.
Next, we extract environment candidates in document i using self-attention.Let s, e denote the start and end positions of the jth environment candidate in the ith document.We compute an intermediate representation x ij for each environment candidate: For ease of exposition, we abbreviate the above self-attention function as x ij = selfattn( − → B i [s : e]) Because x ij do not model dependencies between different documents, we further process x with bidirectional LSTMs (Hochreiter and Schmidhuber, 1997).We use one LSTM followed by selfattention to summarize each ith document: We use another LSTM to build representations for each environment candidate c e,i c e = BiLSTM([x 11 ; x 12 ; ...x 21 ; x 22 ...]) (5) We do not share weights between different LSTMs and between different self-attentions.
Next, we use a pointer-decoder (Vinyals et al., 2015) to produce the output logical form template < l a t e x i t s h a 1 _ b a s e 6 4 = " W a m n h u j n W 8 7 a + T s k w 5 D w 7 / 7 I 1 h e n b e v 1 j l n N r N D f p T z 8 Q k p 0 6 f w < / l a t e x i t > students.school!x 1,2 < l a t e x i t s h a 1 _ b a s e 6 4 = " W a m n h u j n W 8 7 a + T s k w 5 D w 7 / 7 I 1 h e n T d n 8 j W 6 4 M w z e + Q H n I 9 P j s m l h A = = < / l a t e x i t > schools.id!x 2,1 < l a t e x i t s h a 1 _ b a s e 6 4 = " U V i Y e r 9 k Y s K + u P H s W p H

Candidate phrase BiLSTM
Fixed syntax vocabulary SELECT, FROM, WHERE, >, < … k e P 6 M m 6 s x 6 s Z + t l 3 D p n T W a 2 0 R 9 Y r z + 6 k J n D < / l a t e x i t > Figure 2: Forward semantic parser.Model components are shown in purple, inputs in blue, and outputs in red.First, we cross-encode each environment description text and the utterance using BERT.We then extract document-level phrase representations for candidate phrases in each text, which we subsequently encode using LSTMs to form input and environment-level candidate phrase representations.A pointer-decoder attends over the input and selects among candidates to produce the output logical form.
q by selecting among a set of candidates that corresponds to the union of environment candidates c e and syntax candidates c s .Here, we represent a syntax token using its BERT word embedding.The representation for all candidate representations − → c is then obtained as ; c e,2 ; ...c s,1 ; c s,2 ; ...] At each step t of the decoder, we first update the states of the decoder LSTM: Finally, we attend over the document representations given the current decoder state using dotproduct attention (Bahdanau et al., 2015): The score for the ith candidate − → c i is Value-generation.The pervious template decoder produces logical form template q, which is not executable because it does not include utterance candidates c u .To generate full-specified executable logical forms q, we use a separate value pointer-decoder that selects among utterance tokens.The attention input for this decoder is identical to that of the template decoder.The pointer candidates c u are obtained by running a separate BERT encoder on the utterance u.The produced values are inserted into each slot in q to form q.
Both template and value decoders are trained using cross-entropy loss with respect to the groundtruth sequence of candidates.

Backward utterance generator
The utterance generator G produces an utterance u = G(q, e) for the logical form q in the environment e.The alignment problem between q and the environment description w is simpler than that between u and w because environment candidates c e (e.g.column names) in q are described by corresponding spans in w (e.g.column schemas in database schema).To leverage this deterministic alignment, we augment c e in q with relevant spans from w, and encode this augmented logical form q. The pointer-decoder selects among words c v from a fixed vocabulary (e.g.when, where, who) and words c q from q. Figure 3 illustrates the backward utterance generator.

Logical form q
< l a t e x i t s h a 1 _ b a s e 6 4 = " e x x z c q < l a t e x i t s h a 1 _ b a s e 6 4 = " p Z o w U P w i 5 z S r E i k K / e h g Q s r C e r Z f J 6 I I 1 3 d l F f 2 C 9 / g C d 8 Z 0 T < / l a t e x i t > Environment description w < l a t e x i t s h a 1 _ b a s e 6 4 = " C X I r a 8 o i c y x / t x z s H N 1 p c M q L 6 S A = " > A A A B 6 H i c b Z D J S g N B E I Z r 4 h b j F p e b l 8 Y g e A o z o u j N g A c 9 J m A W S I b Q 0 6 l J 2 v Q s d P c o c c g T e P G g i F c f w J N P 4 s 2 j b 2 J n O W j 0 h 4 a P / 6 + i q 8 q L B V f a t j + t z N z 8 w u J S d j m 3 s r q 2 v p H f 3 K q p K J E M q y w S k W w < l a t e x i t s h a 1 _ b a s e 6 4 = " C X I r a 8 o i c y x / t x z s H N 1 p c M q L 6 S A = " > A A A B 6 H i c b Z D J S g N B E I Z r 4 h b j F p e b l 8 Y g e A o z o u j N g A c 9 J m A W S I b Q 0 6 l J 2 v Q s d P c o c c g T e P G g i F c f w J N P 4 s 2 j b 2 J n O W j 0 h 4 a P / 6 + i q 8 q L B V f a t j + t z N z 8 w u J S d j m 3 s r q 2 v p H f 3 K q p K J E M q y w S k W First, we encode the logical form using BERT.
Next, we apply a bidirectional LSTM to obtain the input encoding ← − h enc and another bidirectional LSTM to obtain representations of tokens in the augmented logical form c q.
To represent c v , we use word embeddings from BERT ← .Finally, we apply a pointer-decoder that attends over ← − h enc and selects among candidates ← − c = [c q; c v ] to obtain the predicted utterance.

Synthesizing cycle-consistent examples
Having trained a forward semantic parser F and a backward utterance generator G in environment e, we can synthesize new examples with which to adapt the parser in the new environment e .First, we sample a logical form q using a grammar (Algorithm 1 in Section 2.1).Next, we predict an utterance u = G(q, e ).Because G was trained only on e, many of its outputs are low-quality or do not correspond to its input q.On their own, these examples (u , q) do not facilitate parser adaptation (see Section 3.1 for analyses).
To filter out low-quality examples, we additionally predict a logical form q = F (u , e ), and keep only examples that are cycle consistent -the synthesized logical form q is equivalent to the originally sampled logical form q in e .In the case of SQL parsing, the example is cycle-consistent if executing the synthesized query EXE(q , e ) results in the same denotation (i.e.same set of database records) as executing the original sampled query EXE(q, e ).Finally, we combine cycle-consistent examples synthesized in e with the original training data in e to retrain and adapt the parser.

Experiments
We evaluate performance on the Spider (Yu et al., 2018b), Sparc (Yu et al., 2019b), and CoSQL (Yu et al., 2019a) zero-shot semantic parsing tasks.Table 1 shows dataset statistics.Figure 4 shows examples from each dataset.For all three datasets, we use preprocessing steps from Zhang et al. (2019) to preprocess SQL logical forms.Evaluation consists of exact match over logical form templates (EM) in which values are stripped out, as well as execution accuracy (EX).Official evaluations also recently incorporated fuzz-test accuracy (FX) as tighter variant of execution accuracy.In fuzztesting, the query is executed over randomized database content numerous times.Compared to an execution match, a fuzz-test execution match is less likely to be spurious (e.g. the predicted query coincidentally executes to the correct result).FX implementation is not public as of writing, hence we only report test FX.
Spider.Spider is a collection of databaseutterance-SQL query triplets.The task involves producing the SQL query given the utterance and the database.Figure 2 and 3 show preprocessed input for the parser and generator.
Sparc.In Sparc, the user repeatedly asks questions that must be converted to SQL queries  We do not show Sparc because its data format is similar to CoSQL, but without user dialogue act prediction and without response generation.For our experiments, we produce the output logical form given the data, utterance, and the previous logical form if applicable.During evaluation, the previous logical form is the output of the model during the previous turn (i.e.no teacher forcing on ground-truth previous output).by the system.Compared to Spider, Sparc additionally contains prior interactions from the same user session (e.g.database-utterance-queryprevious query quadruplets).For Sparc evaluation, we concatenate the previous system-produced query (if present) to each utterance.For example, suppose the system was previously asked "where is Tesla born?" and is now asked "how many people are born there?", we produce the utterance

Spider
[PREV] SELECT birth place FROM people WHERE name = 'Tesla' [UTT] how many people are born there ?For training and data synthesis, the ground-truth previous query is used as generation context for forward parsing and backward utterance generation.
CoSQL.CoSQL is combines task-oriented dialogue and semantic parsing.It consists of a number of tasks, such as response generation, user act prediction, and state-tracking.We focus on statetracking, in which the user intent is mapped to a SQL query.Similar to Zhang et al. (2019), we restrict the context to be the previous query and the current utterance.Hence, the input utterance and environment description are obtained in the same way as that used for Sparc.

Results
We primarily compare GAZP with the baseline forward semantic parser, because prior systems produce queries without values which are not executable.We include one such non-executable model, EditSQL (Zhang et al., 2019), one of the top parsers on Spider at the time of writing, for reference.However, EditSQL EM is not directly comparable because of different outputs.Due to high variance from small datasets, we tune the forward parser and backward generator using cross-validation.We then retrain the model with early stopping on the development set using hyperparameters found via cross-validation.For each task, we synthesize 100k examples, of which ∼40k are kept after checking for cycle-consistency.The adapted parser is trained using the same hyperparameters as the baseline.Please see appendix A.2 for hyperparameter settings.Table 2 shows that adaptation by GAZP results in consistent performance improvement across Spider, Sparc, and CoSQL in terms of EM, EX, and FX.We also examine the performance breakdown across query classes and turns (details in appendix A.4). First, we divide queries into difficulty classes based on the number of SQL components, selections, and conditions (Yu et al., 2018b).For example, queries that contain more components such as  nested subqueries, column selections, and aggregators, etc are considered to be harder.Second, we divide multi-turn queries into how many turns into the interaction they occur for Sparc and CoSQL (Yu et al., 2019b,a).We observe that the gains in GAZP are generally more pronounced in more difficult queries and in turns later in the interaction.Finally, we answer the following questions regarding the effectiveness of cycle-consistency and grounded adaptation.
Does adaptation on inference environment outperform data-augmentation on training environment?For this experiment, we synthesize data on training environments instead of inference environments.The resulting data is similar to data augmentation with verification.As shown in the "syntrain" row of Table 3, retraining the model on the combination of this data and the supervised data leads to overfitting in the training environments.A method related to data-augmentation is jointly supervising the model using the training data in the reverse direction, for example by generating utterance from query (Fried et al., 2018;Cao et al., 2019).For Spider, we find that this dual objective (57.2 EM) underperforms GAZP adaptation (59.1 EM).Our results indicate that adaptation to the new environment significantly outperforms augmentation in the training environment.
How important is cycle-consistency?For this experiment, we do not check for cycle-consistency and instead keep all synthesized queries in the inference environments.As shown in the "nocycle" row of Table 3, the inclusion of cycle-consistency effectively prunes ∼60% of synthesized examples, which otherwise significantly degrade performance.This shows that enforcing cycle-consistency is crucial to successful adaptation.
In another experiment, we keep examples that have consistent logical forms, as deemed by string match (e.g.q == q ), instead of consistent denotation from execution.The "EM consistency" row of Table 3 shows that this variant of cycleconsistency also improves performance.In particular, EM consistency performs similarly to execution consistency, albeit typically with lower execution accuracy.
How much GAZP synthesized data should one use for grounded adaptation?For this experiment, we vary the amount of cycle-consistent syn- thesized data used for adaptation.Figure 5 shows that that adaptation performance generally increases with the amount of synthesized data in the inference environment, with diminishing return after 30-40k examples.

Related work
Semantic parsing.Semantic parsers parse natural language utterances into executable logical forms with respect to an environment (Zelle and Mooney, 1996;Zettlemoyer and Collins, 2005;Liang et al., 2011).In zero-shot semantic parsing, the model is required to generalize to environments (e.g.new domains, new database schemas) not seen during training (Pasupat and Liang, 2015;Zhong et al., 2017;Yu et al., 2018b).For languageto-SQL zero-shot semantic parsing, a variety of methods have been proposed to generalize to new databases by selecting from table schemas in the new database (Zhang et al., 2019;Guo et al., 2019).Our method is complementary to these work -the synthesis, cycle-consistency, and adaptation steps in GAZP can be applied to any parser, so long as we can learn a backward utterance generator and evaluate logical-form equivalence.
Data augmentation.Data augmentation transforms original training data to synthesize artificial training data.Krizhevsky et al. (2017) crop and rotate input images to improve object recognition.Dong et al. (2017) and Yu et al. (2018a) respectively paraphrase and back-translate (Sennrich et al., 2016;Edunov et al., 2018) questions and documents to improve question-answering.Jia and Liang (2016) perform data-recombination in the training domain to improve semantic parsing.Hannun et al. (2014) superimpose noisy background tracks with input tracks to improve speech recognition.Our method is distinct from dataaugmentation in the following ways.First, we synthesize data on logical forms sampled from the new environment instead of the original environment, which allows for adaptation to the new environments.Second, we propose cycle-consistency to prune low-quality data and keep high-quality data for adaptation.Our analyses show that these core differences from data-augmentation are central to improving parsing performance.
Cycle-consistent generative adversarial models (cycle-GANs).In cycle-GAN (Zhu et al., 2017;Hoffman et al., 2018), a generator forms images that fools a discriminator while the discriminator tries distinguish generated images from naturally occurring images.The the adversarial objectives of the generator and the discriminator are optimized jointly.Our method is different from cycle-GANs in that we do not use adversarial objectives and instead rely on matching denotations from executing synthesized queries.This provides an exact signal compared to potentially incorrect outputs by the discriminator.Morevoer, cycle-GANs only synthesize the input and verify whether the input is synthesized (e.g. the utterance looks like a user request).In contrast, GAZP synthesizes both the input and the output, and verifies consistency between the input and the output (e.g. the utterance matches the query).

Conclusion and Future work
We proposed GAZP to adapt an existing semantic parser to new environments by synthesizing cycle-consistent data.GAZP improved parsing performance on three zero-shot parsing tasks.Our analyses showed that GAZP outperforms data augmentation, performance improvement scales with the amount of GAZP-synthesized data, and cycleconsistency is central to successful adaptation.
In principle, GAZP applies to any problems that lack annotated data and differ between training and inference environments.One such area is robotics, where one trains in simulation because it is prohibitively expensive to collect annotated trajectories in the real world.In future work, we will consider how to interpret environment specifications to facilitate grounded adaptation in these other areas.

A Appendix
A.1 Coverage and multi-turn sampling When we build an empirical distribution over templates on the training set of Spider, we observe a 85% coverage of dev set templates.That is, 85% of dev set examples have a query whose template occurs in the training set.In other words, while this simple template-filling sampling scheme doesn't provide full coverage over the dev set as a complex grammar would, it covers a large portion of examples.
For Sparc and CoSQL, the sampling procedure is similar to Algorithm 1.However, because there are two queries (one previous, one current), we first sample a previous query z 1 from P temp (z), then sample the current query z 2 from P temp (z|z 1 ).As before, the empirical template distributions are obtained by counting templates in the training set.We use 300-dimensional LSTMs throughout the model.The BERT model we use is Distil-BERT (Sanh et al., 2020), which we optimize with Adam (Kingma and Ba, 2015) with an initial learning rate of 5e − 5. We train for 50 epochs with a batch size of 10 and gradient clipping with a norm of 20.We use dropout after BERT, after encoder LSTMs, and before the pointer scorer.The values for these dropouts used by our leaderboard submissions are shown in Table 4 and Table 5.For each task, these rates are tuned using 3-fold cross-validation with a coarse grid-search over values {0.1, 0.3} for each dropout with a fixed seed.

A.2 Hyperparameters
A single training run of the forward parser took approximately 16 hours to run on a single NVIDIA Titan X GPU.Each task required 3 folds in addition to the final official train/dev run.For each fold, we grid-searched over dropout rates, which amounts to 8 runs.In total, we conducted 27 runs on a Slurm cluster.Including pretrained BERT parameters, the final forward parser contains 142 million parameters.The final backward utterance generator contains 73 million parameters.list all the last name of owners in alphabetical order .

A.3 Synthesized examples
In order to quantify the distribution of synthesized examples, we classify synthesized queries according to the difficulty criteria from Spider (Yu et al., 2018b).Compared to the Spider development set, GAZP-synthesized data has an average of 0.60 vs. 0.47 joins, 1.21 vs. 1.37 conditions, 0.20 vs. 0.26 group bys, 0.23 vs. 0.25 order bys, 0.07 vs. 0.04 intersections, and 1.25 vs. 1.32 selection columns per query.This suggests that GAZP queries are similar to real data.
Moreover, we example a random sample of 60 synthesized examples.Out of the 60, 51 are correct.Mistakes come from aggregation over wrong columns (e.g."has the most course" becomes order by sum T2.grade) and underspecification (e.g."lowest of the stadium who has the lowest age").There are grammatical errors (e.g."that has the most" becomes "that has been most"), but most questions are fluent and sensible (e.g."find the name and district of the employee that has the highest evaluation bonus").A subset of these queries are shown in Table 6 In addition to the main experiment results in Table 2 of Section 3.1, we also examine the performance breakdown across query classes and turns.
GAZP improves performance on harder queries.First, we divide queries into difficulty classes following the classification in Yu et al. (2018b).These difficulty classes are based on the number of SQL components, selections, and conditions.For example, queries that contain more SQL keywords such as GROUP BY, ORDER BY, INTERSECT, nested subqueries, column selections, and aggregators, etc are considered to be harder.Yu et al. (2018b) shows examples of SQL queries in the four hardness categories.Note that extra is a catch-all category for queries that exceed qualifications of hard, as a result it includes artifacts (e.g.set exclusion operations) that may introduce other confounding factors.Tables 7, 8, and 9 respectively break down the performance of models on Spider, Sparc, and CoSQL.We observe that the gains in GAZP are generally more pronounced in more difficult queries.This finding is consistent across tasks (with some variance) and across three evaluation metrics.
One potential explanation for this gain is that the generalization problem is exacerbated in more difficult queries.Consider the example of languageto-SQL parsing, in which we have trained a parser on an university database and are now evaluating it on a sales database.While it is difficult to produce simple queries in the sales database due to ta lack of training data, it is likely even more difficult to produce nested queries, queries with groupings, queries with multiple conditions, etc.Because GAZP synthesizes queries -including difficult ones -in the sales database, the adapted parser learns to handle these cases.In contrast, simpler queries are likely easier to learn, hence adaptation does not help as much.
GAZP improves performance in longer interactions.For Sparc and CoSQL, which include multi-turn interactions between the user and the system, we divide queries into how many turns into the interaction they occur.This classification in described in Yu et al. (2019b) and Yu et al. (2019a).Tables 10 and 11 respectively break down the performance of models on Sparc and CoSQL.We observe that the gains in GAZP are more pronounced in turns later in the interaction.Against, this finding is consistent not only across tasks, but across the three evaluation metrics.
A possible reason for this gain is that the conditional sampling procedure shown in Algorithm 1 improves multi-turn parsing by synthesizing multi-turn examples.How much additional variation should we expect in a multi-turn setting?Suppose we discover T coarse-grain templates by counting the training data, where each coarse-grain template has S slots on average.For simplicity, let us ignore value slots and only consider column slots.Given a new database with N columns, the number of possible filled queries is on the order of O T × S N .For K turns, the number of possi-ble queries sequences is then O T × S N K .
This exponential increase in query variety may improve parser performance on later-turn queries (e.g.those with a previous interaction), which in turn reduce cascading errors throughout the interaction.

Figure 3 :
Figure3: Backward utterance generator.Model components are shown in purple, inputs in blue, and outputs in red.First, we encode the input logical form along with environment description for each of its symbols.we subsequently encode using LSTMs to form the input and environment-level candidate token representations.A pointer-decoder attends over the input and selects among candidate representations to produce the output utterance.

Figure 4 :
Figure 4: Examples from (a) Spider and (b) CoSQL.Context and output are respectively shown in purple and blue.We do not show Sparc because its data format is similar to CoSQL, but without user dialogue act prediction and without response generation.For our experiments, we produce the output logical form given the data, utterance, and the previous logical form if applicable.During evaluation, the previous logical form is the output of the model during the previous turn (i.e.no teacher forcing on ground-truth previous output).
Appendix A.3 shows examples of synthesized adaptation examples and compares them to real examples.

Figure 5 :
Figure 5: Effect of amount of synthesized data on adaptation performance on the development set.EM and EX denote template exact match and logical form execution accuracy, respectively.The x-axis shows the number of cycle-consistent examples synthesized in the inference environments (e.g.all databases in the development set).
select last name from Owners order by last name how many friend are there ?select count ( * ) from Friend what is the id of the votes that has been most distinct contestants ?"select T2.vote id from CONTESTANTS as T1 join VOTES as T2 on T1.contestant number = T2.contestantnumber group by ( T2.vote id ) order by count ( T1.contestant number ) desc limit 1 what are the name of higher ?select name from Highschooler how many car makers has the horsepower of 81 ?select count ( * ) from cars data as T1 join car names as T2 on T1.Id = T2.MakeId join model list as T3 on T2.Model = T3.Model join car makers as T4 on T3.Maker = T4.Id where T1.Horsepower = '81' what are the starts of hiring who are located in the city of Bristol ?select T2.Start from from employee as T1 join hiring as T2 on T1.Employee ID = T2.Employee ID where T1.City = 'Bristol' find the name and district of the employee that has the highest evaluation bonus .select T2.Name , T4.District from evaluation as T1 join employee as T2 on T1.Employee ID = T2.Employee ID join hiring as T3 on T2.Employee ID = T3.Employee ID join shop as T4 on T3.Shop ID = T4.Shop ID order by T1.Bonus desc limit 1 what is the cell number of the owners with the largest charges amount ?select T1.cell number from Owners as T1 join Charges as T2 order by T2.charge amount desc limit 1 what is the minimum , average , and maximum grade of all high schooler ?select min ( grade ) , avg ( grade ) , max ( grade ) from Highschooler what is the age of the teacher who has the most course ?select T1.Age from teacher as T1 join course arrange as T2 on T1.Teacher ID = T2.Teacher ID group by T2.Teacher ID order by sum ( T2.Grade ) desc limit 1 arXiv:2009.07396v2 [cs.CL] 17 Sep 2020Figure 1: Grounded Adaptation for Zero-shot Executable Semantic Parsing.GAZP adapts a parser to new inference environments.Data and models for training and inference environments are respectively shown in blue and purple. Note e x i t >

Table 2 :
GROUP, ORDER, INTERSECT, Development set evaluation results on Spider, Sparc, and CoSQL.EM is exact match accuracy of logical form templates without values.EX is execution accuracy of fully-specified logical forms with values.FX is execution accuracy from fuzz-testing with randomized databases.Baseline is the forward parser without adaptation.EditSQL is a state-of-the-art language-to-SQL parser that produces logical form templates that are not executable.

Table 3 :
Ablation performance on development sets.For each one, 100,000 examples are synthesized, out of which queries that do not execute or execute to the empty set are discarded."nocycle" uses adaptation without cycleconsistency."syntrain" uses data-augmentation on training environments."EM consistency" enforces logical form instead of execution consistency.

Table 4 :
Dropout rates for the forward parser.

Table 5 :
Dropout rates for the backward generator.

Table 6 :
Examples of synthesized queries

Table 7 :
. Difficulty breakdown for Spider test set.

Table 8 :
Difficulty breakdown for Sparc test set.

Table 9 :
Difficulty breakdown for CoSQL test set.

Table 10 :
Turn breakdown for Sparc test set

Table 11 :
Turn breakdown for CoSQL test set.