Improving Compositional Generalization in Semantic Parsing

Generalization of models to out-of-distribution (OOD) data has captured tremendous attention recently. Specifically, compositional generalization, i.e., whether a model generalizes to new structures built of components observed during training, has sparked substantial interest. In this work, we investigate compositional generalization in semantic parsing, a natural test-bed for compositional generalization, as output programs are constructed from sub-components. We analyze a wide variety of models and propose multiple extensions to the attention module of the semantic parser, aiming to improve compositional generalization. We find that the following factors improve compositional generalization: (a) using contextual representations, such as ELMo and BERT, (b) informing the decoder what input tokens have previously been attended to, (c) training the decoder attention to agree with pre-computed token alignments, and (d) downsampling examples corresponding to frequent program templates. While we substantially reduce the gap between in-distribution and OOD generalization, performance on OOD compositions is still substantially lower.


Introduction
Neural models trained on large datasets have recently shown great performance on data sampled from the training distribution. However, generalization to out-of-distribution (OOD) scenarios has been dramatically lower (Sagawa et al., 2019;Kaushik et al., 2020). A particularly interesting case of OOD generalization is compositional generalization, the ability to systematically generalize to test examples composed of components seen during training. For example, we expect a model that observed the questions "What is the capital of France?" and "What is the * The authors contributed equally. population of Spain?" at training time to generalize to questions such as "What is the population of the capital of Spain?". While humans generalize systematically to such compositions (Fodor et al., 1988), models often fail to capture the structure underlying the problem, and thus miserably fail (Atzmon et al., 2016;Loula et al., 2018;Bahdanau et al., 2019b;Ruis et al., 2020).
Semantic parsing, mapping natural language utterances to structured programs, is a task where compositional generalization is expected, as substructures in the input utterance and output program often align. For example, in "What is the capital of the largest US state?", the span "largest US state" might correspond to an argmax clause in the output program. Nevertheless, prior work (Finegan-Dollak et al., 2018;Herzig and Berant, 2019;Keysers et al., 2020) has shown that data splits that require generalizing to new program templates result in drastic loss of performance. However, past work did not investigate how different modeling choices interact with compositional generalization.
In this paper, we thoroughly analyze the impact of different modeling choices on compositional generalization in 5 semantic parsing datasets-four that are text-to-SQL datasets, and DROP, a dataset for executing programs over text paragraphs. Following Finegan-Dollak et al. (2018), we examine performance on a compositional split, where target programs are partitioned into "program templates", and templates appearing at test time are unobserved at training time. We examine the effect of standard practices, such as contextualized representations ( §3.1) and grammar-based decoding ( §3.2). Moreover, we propose novel extensions to decoder attention ( §3.3), the component responsible for aligning sub-structures in the question and program: (a) supervising attention based on precomputed token alignments, (b) attending over constituency spans, and (c) encouraging the decoder attention to cover the entire input utterance. Lastly, we also propose downsampling examples from frequent templates to reduce dataset bias ( §3.4).
Our main findings are that (i) contextualized representations, (ii) supervising the decoder attention, (iii) informing the decoder on coverage of the input by the attention mechanism, and (iv) downsampling frequent program templates, all reduce the gap in generalization when comparing standard iid splits to compositional splits. For SQL, the gap in exact match accuracy between in-distribution and OOD is reduced from 84.6 → 62.2 and for DROP from 96.4 → 77.1. While this is a substantial improvement, the gap between in-distribution and OOD generalization is still significant. All our code and data are publicly available at http://github.com/ inbaroren/improving-compgen-in-semparse.

Compositional Generalization
Natural language is compositional in a sense that complex utterances are interpreted by understanding the structure of the utterance and the meaning of its parts (Montague, 1973). For example, the meaning of "a person below the tree" can be composed from the meaning of "a person", "below" and "tree". By virtue of compositionality, an agent can derive the meaning of new utterances, even at first encounter. Thus, we expect our systems to model this compositional nature of language and generalize to new utterances, generated from subparts observed during training but composed in novel ways. This sort of model generalization is often called compositional generalization.
Recent work has proposed various benchmarks to measure different aspects of compositional generalization, showing that current models struggle in this setup. Lake and Baroni (2018) introduce a benchmark called SCAN for mapping a command to actions in a synthetic language, and proposed a data split that requires generalizing to commands that map to a longer sequence of actions than observed during training. Bahdanau et al. (2019a) study the impact of modularity in neural models on the ability to answer visual questions about pairs of objects that were not observed during training. Bahdanau et al. (2019b) assess the ability of models trained on CLEVR (Johnson et al., 2017) to interpret new referring expressions composed of parts observed at training time. Keysers et al. (2020)  pose a data split such that the test set contains new combinations of knowledge-base constants (entities and relations) that were not seen during training. Ruis et al. (2020) proposed gSCAN, which focuses on compositional generalization when mapping commands to actions in a situated environment.
In this work, we focus on a specific kind of compositional data split, proposed by Finegan-Dollak et al. (2018), that despite its simplicity leads to large drops in performance. Finegan-Dollak et al. (2018) propose to split semantic parsing data such that a model cannot memorize a mapping from question templates to programs. To achieve this, they take question-program pairs, and anonymize the entities in the question-program pair with typed variables. Thus, questions that require the same abstract reasoning structure now get mapped to the same anonymized program, referred to as program template. For example, in the top two rows of Figure 1, after anonymizing the name of a river to the typed variable river name0, two lexicallydifferent questions map to the same program template. Similarly, in the bottom two rows we see two different questions that map to the same program even before anonymization.
The data is then split in a manner such that a program template and all its accompanying questions belong to the same set, called the program split. This ensures that all test-set program templates are unobserved during training. For example, in a iid split of the data, it is possible that the question "what is the capital of France?" will appear in the training set, and the question "Name Spain's capital." will appear in the test set. Thus, the model only needs to memorize a mapping from question templates to program templates. However, in the program split, each program template is in either the training set or test set, and thus a model must generalize at test time to new combinations of predicates and entities (see Figure 1 -Program split).
We perform the compositional split proposed by Finegan-Dollak et al. (2018) on four text-to-SQL datasets from Finegan-Dollak et al. (2018) and one dataset for mapping questions to QDMR programs (Wolfson et al., 2020) on DROP (Dua et al., 2019). Exact experimental details are in §4.

Model
Finegan -Dollak et al. (2018) convincingly showed that a program split leads to low semantic parsing performance. However, they examined only a simple baseline parser, disregarding many standard variations that have been shown to improve indistribution generalization, and might affect OOD generalization as well. In this section, we describe variants to both the model and training, and evaluate their effect on generalization in §5. We examine well-known choices, such as the effect of contextualized representations ( §3.1) and grammar-based decoding ( §3.2), as well as several novel extensions to the decoder attention ( §3.3), which include (a) eliciting supervision (automatically) for the decoder attention distribution, (b) allowing attention over question spans, and (c) encouraging attention to cover all of the question tokens. For DROP, where the distribution over program templates is skewed, we also examine the effect of reducing this bias by downsampling frequent program templates ( §3.4).
Baseline Semantic Parser A semantic parser maps an input question x into a program z, and in the supervised setup is trained from (x, z) pairs. Similar to Finegan-Dollak et al. (2018), our baseline semantic parser is a standard sequence-tosequence model (Dong and Lapata, 2016) that encodes the question x with a BiLSTM encoder (Hochreiter and Schmidhuber, 1997) over GloVe embeddings (Pennington et al., 2014), and decodes the program z token-by-token from left to right with an attention-based LSTM decoder (Bahdanau et al., 2015).

Contextualized Representations
Pre-trained contextualized representations revolutionized natural language processing in recent years, and semantic parsing has been no exception (Guo et al., 2019;. We hypothesize that better representations for question tokens should improve compositional generalization, because they reduce language variability and thus may help improve the mapping from input to output tokens. We evaluate the effect of using ELMO  and BERT (Devlin et al., 2019) to represent question tokens. 1

Grammar-Based Decoding
A unique property of semantic parsing, compared to other generation tasks, is that programs have a clear hierarchical structure that is based on the target formal language. Decoding the output program token-by-token from left to right (Dong and Lapata, 2016;Jia and Liang, 2016) can thus generate programs that are not syntactically valid, and the model must effectively learn the syntax of the target language at training time. Grammar-based decoding resolves this issue and has been shown to consistently improve in-distribution performance (Rabinovich et al., 2017;Yin and Neubig, 2017). In grammar-based decoding, the decoder outputs the abstract syntax tree of the program based on a formal grammar of the target language. At each step, a production rule from the grammar is chosen, eventually outputting a topdown left-to-right linearization of the program tree. Because decoding is constrained by the grammar, the model outputs only valid programs. We refer the reader to the aforementioned papers for details on grammar-based decoding.
Compositional generalization involves combining known sub-structures in novel ways. In grammar-based decoding, the structure of the output program is explicitly generated, and this could potentially help compositional generalization. We discuss the grammars used in this work in §4.

Decoder Attention
Semantic parsers use attention-based decoding: at every decoding step, the model computes a distribution (p 1 . . . p n ) over the question tokens x = (x 1 , . . . , x n ) and the decoder computes its next prediction based on the weighted average n i=1 p i · h i , where h i is the encoder representation of x i . Attention has been shown to both improve in-distribution performance (Dong and Lapata, 2016) and also lead to better compositional How many yards longer was L. 's pass to E. than V. Y. 's shortest pass ?
generalization (Finegan-Dollak et al., 2018), by learning a soft alignment between question and program tokens. Since attention is the component in a sequence-to-sequence model that aligns parts of the input to parts of the output, we propose new extensions to the attention mechanism, and examine their effect on compositional generalization.
(a) Attention Supervision Intuitively, learning good alignments between question and program tokens should improve compositional generalization: a model that correctly aligns the token largest to the predicate max should output this predicate when encountering largest in novel contexts.
To encourage learning better alignments, we supervise the attention distribution computed by the decoder to attend to specific question tokens at each time-step (Liu et al., 2016). We use an offthe-shelf word aligner to produce a "gold" alignment between question and program tokens (where program tokens correspond to grammar rules in grammar-based decoding) for all training set examples. Then, at every decoding step where the next prediction symbol's "gold" alignment is to question tokens at indices I, we add the term − log i∈I p i to the objective, pushing the model to put attention probability mass on the aligned tokens. We use the FastAlign word alignment package (Dyer et al., 2013), based on IBM model 2, which is a generative model that allows to extract word alignments from parallel corpus without any annotated data. Figure 2 shows an example question-program pair and the alignments induced by FastAlign.
(b) Attention over Spans Question spans can align to subtrees in the corresponding program. For example, in Fig. 1, largest state aligns to state.area = (select max . . . from state). Similarly, in a question such as "What does Lionel Messi do for a living?", the multiword phrase "do for a living" aligns to the KB relation Profession. Allowing the model to directly attend to multi-token phrases could induce more meaningful alignments that improve compo-sitional generalization.
Here, rather than computing an attention distribution over input tokens (x 1 , . . . x n ), we compute a distribution over the set of spans corresponding to all constituents (including all tokens) as predicted by an off-the-shelf constituency parser (Joshi et al., 2018). Spans are represented using a self-attention mechanism over the hidden representations of the tokens in the span, as in Lee et al. (2017).
(c) Coverage Questions at test time are sometimes similar to training questions, but include new information expressed by a few tokens. A model that memorizes a mapping from question templates to programs can ignore this new information, hampering compositional generalization. To encourage models to attend to the entire question, we add the attention-coverage mechanism from See et al. (2017) to our model. Specifically, at each decoding step the decoder holds a coverage vector c = (c 1 , . . . , c n ), where c i corresponds to the sum of attention probabilities over x i in all previous time steps. The coverage vector is given as another input to the decoder, and a loss term is added that penalizes attending to tokens with high coverage: n i=1 min(c i , p i ), encouraging the model to attend to tokens not yet attended to.

Downsampling Frequent Program Templates
Training a semantic parser can be hampered if the training data contains a highly skewed distribution over program templates, i.e., a large fraction of the training examples correspond to the same template.
In such a biased environment, the model might memorize question-to-template mappings instead of modeling the underlying structure of the problem. We propose to downsample examples from frequent templates such that the resulting training data has a more balanced template distribution.
Our initial investigation showed that the distribution over program templates in DROP is highly skewed (20 templates out of 111 constitute 90% of the data), leading to difficulties to achieve any generalization to examples from the program split. Thus, in DROP, for any program template in the training set where there are more than 20 examples, we randomly sample 20 examples for training. Downsampling is related to AFLite (Sakaguchi et al., 2020;Bras et al., 2020), an algorithmic approach to bias reduction in datasets. AFLite is applied when bias is hard to define; as we have direct access to a skewed program distribution, we can take a much simpler approach for reducing bias.

Datasets
We create iid and program splits for five datasets according to the procedure of Finegan-Dollak et al. (2018) as described in §2: 2 Four text-to-SQL datasets from Finegan-Dollak et al. (2018) and one dataset for mapping questions to QDMR programs (Wolfson et al., 2020) in DROP (Dua et al., 2019). Similar to prior work (Finegan-Dollak et al., 2018), we train and test models on anonymized programs, that is, entities are replaced with typed variables ( §2). Table 1 gives an example question and program for each of these datasets.
• ATIS: questions for a flight-booking task (Price, 1990;Dahl et al., 1994). SQL Grammar: We adapt the SQL grammar developed for ATIS (Lin et al., 2019) to cover the four SQL datasets. To achieve that, additional data normalization steps were taken (see appendix), such as rewriting programs to have a consistent SQL style. The grammar uses the DB schema to produce domain-specific production rules, e.g., in ATIS table name → FLIGHTSalias0, column name → FLIGHTSalias0.MEAL DESCRIPTION, and value → class type0. At inference time, we enforce context-sensitive constraints that eliminate production rules that are invalid given the previous context. For example, in the WHERE clause, the set of column name rules is limited to columns that are part of previously mentioned tables. These constraints reduce the number of syntactically invalid programs, but do not eliminate them completely.
DROP Grammar: We manually develop a grammar over QDMR programs to perform grammar-based decoding for DROP, similar to . This grammar contains typed operations required for answering questions, such as, ARITHMETIC diff(NUM, NUM) → NUM, SELECT num(PassageSpan) → NUM, and SELECT → PassageSpan.
Because QDMR programs are executed over text paragraphs (rather than a KB), QDMR operators receive string arguments as inputs (analogous to KB constants), which we remove for anonymization (Table 1). This results in program templates that include only the logical operations required for finding the answer. While such programs cannot be executed asis on a database, they are sufficient for the purpose of testing compositional generalization in semantic parsing, and can be used as "layouts" in a neural module network approach .

Experiments
We now present our empirical evaluation of compositional generalization.

Experimental Setup
We create training/development/test splits using both an iid split and a program split, such that the number of examples is similar across splits. Table 2 presents exact statistics on the number of unique examples and program templates for all datasets. There are much fewer new templates in the development and test sets for the iid split than for the program split, thus the iid split requires less compositional generalization. In DROP, we report results for the downsampled dataset ( §3.4), and analyze downsampling below.
Evaluation Metric We evaluate models using exact match (EM), that is, whether the predicted program is identical to the gold program. In addition, we report relative gap, defined as 1 − EMprogram EM iid , where EM program and EM iid are the EM on the program and iid splits, respectively. This metric measures the gap between in-distribution generalization and OOD generalization, and our goal is to minimize it (while additionally maximizing EM iid ).
We select hyper-parameters by tuning the learning rate, batch size, dropout, hidden dimension, and use early-stopping w.r.t. development set EM (specific values are in the appendix). The results reported are averaged over 5 different random seeds.
Evaluated Models Our goal is to measure the impact of various modeling choices on compo-

Dataset: ATIS
x: what is the distance from airport code0 airport to city name0 ? z: select distinct airport service.miles distant from airport as airport , airport service as airport service , city as city where airport.airport code = "airport code0" and airport.airport code = airport service.airport code and city.city code = airport service.city code and city.city name = "city name0" Dataset: SCHOLAR x: What papers has authorname0 written? z: select distinct paper.paperid from author as author , paper as paper , writes as writes where author.authorname = "authorname0" and writes.authorid = author.authorid and writes.paperid = paper.paperid Dataset: ADVISING x: Can undergrads enroll in the course number0 ? z: select distinct course.advisory requirement , course.enforced requirement , course.name from course as course where course.department = "department0" and course.number = number0 Dataset: DROP x: How many yards longer was Johnson's longest touchdown compared to his shortest touchdown of the first quarter? z: ARITHMETIC diff( SELECT num( ARGMAX( SELECT ) ) SELECT num( ARGMIN( FILTER( SELECT ) ) ) )  sitional generalization. We term our baseline sequence-to-sequence semantic parser SEQ2SEQ, and denote the parser that uses grammar-based decoding by GRAMMAR ( §3.2). Use of contextualized representations in these parsers is denoted by +ELMO and +BERT ( §3.1). We also experiment with the proposed additions to the decoder attention ( §3.3). In a parser, use of (a) auxiliary attention supervision obtained from FastAlign is denoted by +ATTNSUP, (b) use of attention over  constituent spans by +ATTNSPAN, and (c) use of attention-coverage mechanism by +COVERAGE.

Main Results
Below we present the performance of our various models on the test set, and discuss the impact of these modeling choices. For SQL, we present results averaged across the four datasets, and report the exact numbers for each dataset in Table 9.   Baseline Performance The top-row in Table 3 shows the performance of our baseline SEQ2SEQ model using GloVe representations. In SQL, it achieves 74.9 EM on the iid split and 10.8 EM on the program split, and in DROP, 45.4 EM and a surprisingly low 1.6 EM on the iid and program splits, respectively. A possible reason for the low program split performance on DROP is that programs include only logical operations without any KB constants ( §4), making generalization to new compositions harder than in SQL (see also analysis in §5.3). As observed by Finegan-Dollak et al. (2018), there is a large relative gap in performance on the iid vs. program split. Table 3 shows that contextualized representations consistently improve absolute performance and reduce the relative   gap in DROP. In SQL, contextualized representations improve absolute performance and reduce the relative gap in the SEQ2SEQ model, but not in the GRAMMAR model. The relative gap is reduced by roughly 7 points in SQL, and 17 points in DROP. As ELMO performs slightly better than BERT, we present results only for ELMO in some of the subsequent experiments, and report results for BERT in Table 9. Table 3 shows that grammar-based decoding both increases accuracy and reduces the relative gap on DROP in all cases. In SQL, grammar-based decoding consistently decreases the absolute performance compared to SEQ2SEQ. We conjecture this is because our SQL grammar contains a large set of rules meant to support the normalized SQL structure of Finegan-Dollak et al. (2018), which makes decoding this structure challenging. We provide further in-depth comparison of performance in §5.3. Table 4 shows that attention supervision has a substantial positive effect on compositional generalization, especially in SQL. In SQL, adding auxiliary attention supervision to a SEQ2SEQ model improves the program split EM from 10.8 → 18.5, and combining with ELMO leads to an EM of 20.3. Overall, using ELMO and ATTNSUP reduces the relative gap from 84.6 → 70.6 compared to SEQ2SEQ. In DROP, attention supervision improves iid performance and reduces the relative gap for GRAMMAR using GloVe representations, but does not lead to additional improvements when combined with ELMO.

Attention Supervision
Attention-coverage Table 5 shows that attention-coverage improves absolute performance and compositional generalization in all cases. Interestingly, in SQL, best results are obtained without the attention coverage loss term, but still providing the coverage vector as additional input to the decoder. In SQL, adding attention-coverage improves program split EM from 10.8 → 17.
Combining coverage with ELMO and ATTNSUP leads to our best results, where program split EM reaches 25.4, and the relative gap drops from 84.6 → 62.2 (with a slight drop in iid split EM). In DROP, using attention-coverage mechanism with auxiliary coverage loss improves iid performance from 53.2 → 64.6 and reduces the relative gap from 96 → 93.1.
Attention over Spans Table 6 shows that, without ELMO, attention over spans improves iid and program split EM in both SQL and DROP, but when combined with ELMO differences are small and inconsistent.
Downsampling Frequent Templates Table 7 shows that for DROP, where the distribution over program templates is extremely skewed, downsampling training examples for frequent templates leads to better compositional generalization in all models. For example, without downsampling (w/o DS), program split EM drops from 13.2 → 0.8 for the GRAMMAR+ELMO model.
Takeaways We find that contextualized representations, attention supervision, and attention coverage generally improve program split EM and reduce the relative gap, perhaps at a small cost to iid split EM. In DROP, grammar-based decoding is important, as well as downsampling of frequent templates. Overall the gap between in-distribution and OOD performance dropped from 84.6 → 62.2 for SQL, and from 96.4 → 77.1 for DROP. While this improvement is significant, it leaves much to be desired in terms of models and training procedures that truly close this gap.

Analysis
Error Analysis We analyze the errors of each model on the program split development set for all  SQL datasets and label each example with one of three categories (Table 8): Seen programs are errors resulting from outputting program templates that appear in the training set, while new programs are wrong programs that were not observed in the training set. Invalid syntax errors are outputs that are syntactically invalid programs. Table 8 shows that for SEQ2SEQ models, those that improve compositional generalization also increase the frequency of new programs and invalid syntax errors. Grammarbased models output significantly more new programs than SEQ2SEQ models, and less invalid syntax errors. 3 Overall, the correlation between successful compositional generalization and the rate of new programs is inconsistent. We further inspect 30 random predictions of multiple models on both the program split and the iid split (Table 10). Semantically equivalent errors are predictions that are equivalent to the target programs. Semantically similar is a relaxation of the former category (e.g., an output that represents "flights that depart at time0", where the gold program represents "flights that depart after time0"). Limited divergence or significant divergence corresponds to invalid programs that slightly or significantly diverge from the target output, respectively. Table 10 shows that adding attentionsupervision, attention-coverage, and attention over spans increases the number of predictions that are semantically close to the target programs. We also find that the frequency of correct typed variables in predictions is significantly higher when using   attention-supervision and attention-coverage compared to the baseline model (p < 0.05). In addition, the errors of the GRAMMAR model tend to be closer to the target program compared to SEQ2SEQ.
Compositional Generalization in DROP Our results show that compositional generalization in DROP is harder than in the SQL datasets. We hypothesized that this could be due to the existence of KB relations in SQL programs after program anonymization, while QDMR programs do not contain any arguments. To assess that, we further anonymize the predicates in all SQL programs in all four datasets, such that the SQL programs do not contain any KB constants at all (similar to DROP). We split the data based on this anonymization, and term it the KB-free split. On the development set, when moving from a program split to a KB-free split, the average accuracy drops from 14.5 → 9.8. This demonstrates that indeed a KB-free split is harder than the program split from Finegan-Dollak et al. (2018), partially explaining the difference between SQL and DROP.

Conclusion
We presented a comprehensive evaluation of compositional generalization in semantic parsers by analyzing the performance of a wide variety of models across 5 different datasets. We experimented with well-known extensions to sequenceto-sequence models and also proposed novel extensions to the decoder's attention mechanism. Moreover, we proposed reducing dataset bias towards a heavily skewed program template distribution by downsampling examples from frequent templates.
We find that our proposed techniques improve generalization to OOD examples. However, the generalization gap between in-distribution and OOD data remains high. This suggests that future research in semantic parsing should consider more drastic changes to the prevailing encoder-decoder approach to address compositional generalization.

A SQL Style
SQL programs vary in style across datasets. We address a specific difference concerning the syntax to neutralize an interaction with the models analyzed in this analysis, and allow comparability across models and datasets. We standardize the form <table1> <join> <table2> ON <condition> by replacing <join> with a comma and adding <condition> to the WHERE clause.

B SQL Grammar Development
Our SQL grammar is a context-free grammar. We fit an existing implementation for text-to-SQL (Lin et al., 2019) to the datasets we experimented with. Examples for grammar rules are in Table 12. At each step, a sequence of non-terminal or terminal expressions (right side) is derived from some nonterminal (left side).
The SQL programs in the text-to-SQL datasets have aliases for all tables, sub-queries, and custom fields. Also, each column in the program is preceded by an aliased table or a sub-query. To allow the model to generate all aliases, we add terminal rules based on the dataset schema. We modify the rules to create sub-queries and fields so that the use of aliases is enforced, and we add the alias patterns for custom field and tables. We add the table names in the schema, concatenated with the alias patterns, to table name. We define col ref as the concatenation of an aliased table and a column of this table. Additionally, we add valid combinations of aliased variables and schema entities.
To allow comparability with SEQ2SEQ models, we use only examples that are parsed by the grammar in the development and test sets, eliminating 39 examples from ADVISING, 18 from ATIS and one example from GEOQUERY. The grammar covers at least 95% of each train set.
During inference we enforce contextual rules. For example, forcing the derivation of from clause to have the tables that were selected in select results. We check validity by executing the programs against the dataset database in Mysql server 5.7. Some of the programs in our datasets were not executable due to inconsistent use of aliases, or partial column references. We were not able to automatically fix all the programs. We relaxed our constraints to allow the generation of all target programs, hence allowing some invalid outputs.  For all models, we use pre-trained GloVe embeddings of size 100, and the target embedding dimension is 100.Encoder hidden size is selected from {200, 300}. Dropout is kept fixed at p = 0.5. We train each model with five random seeds. We perform a grid-search and use accuracy on the development set for model selection. ELMO and BERT representations are concatenated to the trainable 100 dimension GloVe embeddings. For BERT we use the top layer of the bert-base-uncased model. ELMO and BERT based models are trained with Noam learning scheduler, with 800 600, or 400 warmup steps.
For the ATTNSUP and COVER-AGEmodels, the additional loss term scaling hyper-parameter was tuned using the values {0.0, 0.1, 0.5, 1.0, 2.5, 5.0}. For our best performing models, SEQ2SEQ+COVERAGE+ELMO, on all datasets, we used an encoder-decoder hidden size of 300, with coverage loss parameter 0. Learning rate was set to 0.0001 for ATIS, and 0.001 for the other datasets.
Global structure query select core, groupby clause, orderby clause, limit select core select with distinct , select results, from clause, "WHERE", where clause Select clause select results select result, ",", select result select result function From clause source single source, ",", source single source source subq source subq "(", query,")", "AS", subq alias source table "TABLE PLACEHOLDER", "AS", table name Where clause where clause expr, ",", where conj where conj "AND", where clause Group by clause groupby clause "GROUP BY", group clause group clause expr, group clause Expressions expr value, "BETWEEN", value, "AND", value value col ref Terminal rules table name "FLIGHTalias0" column name "FLIGHT ID" col ref "FLIGHTalias0.FLIGHT ID" col alias "DERIVED FIELDalias0" subq alias "DERIVED TABLEalias0" DROP hyper-parameters Similar to SQL, we perform a grid-search to choose hyper-parameters based on the development set accuracy. We tune the following parameters in the specified range and select a single value for all experiments (denoted by bold): learning rate for Adam optimizer in range {0.001, 0.0005}, batch-size in {4, 16, 32, 64}, and hidden-size for the encoder-decoder LSTMs in {100, 200}. Dropout is kept fixed at p = 0.2, gradient clipping is performed with norm-threshold= 5.0, beam-size is set to 5, and training is stopped early if the development set accuracy does not improve for 15 consecutive epochs.  Table 13: iid development set exact match for all models on the DROP dataset. We no not create a programsplit development set for DROP, one containing templates not seen in training or test. Instead, we use the same iid development set to choose the best model for both iid and program split settings. Note that this is a more challenging setting, since the model selection for the program split is also done on the basis of an in-distribution development set.