On the Potential of Lexico-logical Alignments for Semantic Parsing to SQL Queries

Large-scale semantic parsing datasets annotated with logical forms have enabled major advances in supervised approaches. But can richer supervision help even more? To explore the utility of fine-grained, lexical-level supervision, we introduce SQUALL, a dataset that enriches 11,276 WIKITABLEQUESTIONS English-language questions with manually created SQL equivalents plus alignments between SQL and question fragments. Our annotation enables new training possibilities for encoderdecoder models, including approaches from machine translation previously precluded by the absence of alignments. We propose and test two methods: (1) supervised attention; (2) adopting an auxiliary objective of disambiguating references in the input queries to table columns. In 5-fold cross validation, these strategies improve over strong baselines by 4.4% execution accuracy. Oracle experiments suggest that annotated alignments can support further accuracy gains of up to 23.9%.


Introduction
The availability of large-scale datasets pairing natural utterances with logical forms (Dahl et al., 1994;Wang et al., 2015;Zhong et al., 2017;Yu et al., 2018, inter alia) has enabled significant progress on supervised approaches to semantic parsing (Jia and Liang, 2016;Xiao et al., 2016;Dong andLapata, 2016, 2018, inter alia). However, the provision of logical forms alone does not indicate important finegrained relationships between individual words or phrases and logical form tokens. This is unfortunate because researchers have in fact hypothesized that the lack of such alignment information hampers progress in semantic parsing (Zhang et al., 2019, pg. 80). * Equal contribution; listed in alphabetical order.  Figure 1: Two examples from SQUALL. The table-question-answer triplets come from WIKITABLE-QUESTIONS. We provide the logical forms as SQL plus alignments between question and logical form. In the bottom example, for instance, "the highest" ↔ ORDER BY and LIMIT 1, as indicated by both matching highlight color ( blue ) and circled-number labels ( 2 ).
We address this lack by introducing SQUALL, 1 the first large-scale semantic-parsing dataset with manual lexical-to-logical alignments; and we investigate the potential accuracy boosts achievable from such alignments. The starting point for SQUALL is WIKITABLEQUESTIONS (WTQ; Pasupat and Liang, 2015), containing data tables, English questions regarding the tables, and table-based answers. We manually enrich the 11,276-instance subset of WTQ's training data that is translatable to SQL by providing expert annotations, consisting not only of target logical forms in SQL, but also labeled alignments between the input question tokens (e.g., "how many") and their corresponding SQL fragments (e.g., COUNT(. . .)). Figure 1 shows two SQUALL instances.
These new data enable training of encoderdecoder neural models that incorporates manual alignments. Consider the bottom example in Figure 1: A decoder can benefit from knowing that ORDER BY . . . LIMIT 1 comes from "the highest" (where rank 1 is best); and an encoder should match "who" with the "athlete" column even though the two strings have no overlapping tokens. We implement these ideas with two training strategies: 1. Supervised attention that guides models to produce attention weights mimicking human judgments during both encoding and decoding. Supervised attention has improved both alignment and translation quality in machine translation (Liu et al., 2016;Mi et al., 2016), but has only been applied in semantic parsing to heuristically generated alignments (Rabinovich et al., 2017) due to the lack of manual annotations.
2. Column prediction that infers which column in the data table a question fragment refers to.
Using BERT features, our models reach 54.1% execution accuracy on the WTQ test set, surpassing the previous weakly-supervised state-of-the-art 48.8% (where weak supervision means access to only the answer, not the logical form of the question). More germane to the issue of alignment utility, in 5-fold cross validation, our additional fine-grained supervision improves execution accuracy by 4.4% over models supervised with only logical forms; ablation studies indicate that mappings between question tokens and columns help the most. Additionally, we construct oracle models that have access to the full alignments during test time to show the unrealized potential for our data, seeing improvements of up to 23.9% absolute logical form accuracy.
Through annotation-cost and learning-curve analysis, we conclude that lexical alignments are cost-effective for training parsers: lexical alignments take less than half the time to annotate as a logical form does, and we can improve execution accuracy by 2.5 percentage points by aligning merely 5% of the logical forms in the training set.
Our contributions are threefold: 1) we release a high-quality semantic parsing dataset with manually-annotated logical forms; 2) we label the alignments between the English questions and the corresponding logical forms to provide additional supervision; 3) we propose two training strategies that use our alignments to improve strong base models. Our dataset and code are publicly available at https://www.github.com/tzshi/squall.

Task: Table-based Semantic Parsing
Our task is to answer questions about structured tables through semantic parsing to logical forms (LFs). Formally, the input x = (q, T ) consists of a question q about a table T , and the goal of a semantic parser is to reproduce the target LF y for q (and thus have high LF accuracy) or, in a less strict setting, to generate any query LF y that, when executed against T , yields the correct output z (and thus have high execution accuracy).
In a weakly supervised setting, training examples consist only of input-answer pairs (x, z ). Recent datasets (Zhong et al., 2017;Yu et al., 2018, inter alia) provide enough logical forms, i.e., (x, y ) training pairs, to learn from mappings from x to y in a supervised setting. Unsurprisingly, supervised models are more accurate than weakly supervised ones. However, training supervised models is still challenging: both x and y are structured, so models typically generate y in multiple steps, but the training data cannot reveal which parts of x generate which parts of y and how they are combined.
Just as adding supervised training improves accuracy over weak supervision, we explore whether even finer-grained supervision further helps. Since no large-scale datasets furnishing fine-grained supervision exist (to the best of our knowledge), we introduce SQUALL.

SQUALL: Our New Dataset
SQUALL is based on WIKITABLEQUESTIONS (WTQ; Pasupat and Liang, 2015). WTQ is a largescale question-answering dataset that contains diverse and challenging crowd-sourced questionanswer pairs over 2,108 semi-structured Wikipedia tables. Most of the questions are more than simple table-cell look-ups and are highly compositional, a fact that motivated us to study lexical mappings between questions and logical forms. We hand-generate SQL equivalents of the WTQ queries and align question tokens with corresponding SQL query fragments. 2 We leave lexical alignments of other text-to-SQL datasets and cross-dataset model generalization  to future work.

Data Annotation
We annotated WTQ's training fold in three stages: database construction, SQL query annotation, and alignment. Two expert annotators familiar with SQL annotated half of the dataset each and then checked each other's annotations and resolved all conflicts via discussion. See Appendix C for the annotation guidelines.

Database Construction
Tables encode semistructured information. Each table column usually contains data of the same type: e.g., text, numbers, dates, etc., as is typical in relational databases. While pre-processing the WTQ tables, we considered both basic data types (e.g., raw text, numbers) and composite types (e.g., lists, binary tuples), and we suffixed column names with their inferred data types (e.g., number in Figure 1). For annotation consistency, all tables were assigned the same name w and columns were given the sequential names c1, c2,. . . in the database schema, but we kept the original table headers for feature extraction. We additionally added a special column id to every table denoting the linear order of its rows. See Appendix D for details.
Conversion of Queries to SQL For every question in WTQ's training fold, we manually created its corresponding SQL query, choosing the shortest when there are multiple possibilities, for instance, we wrote "SELECT MAX(c1) FROM w" instead of "SELECT c1 FROM w ORDER BY c1 DESC LIMIT 1". An exception is that we opted for less table structure-dependent versions even if their complexity was higher. As an example, if the table listed games (c2) pre-sorted by date (c1), and the question was "what is the next game after A?", we wrote "SELECT c2 FROM w WHERE c1 > (SELECT c1 FROM w WHERE c2 = A) ORDER BY c1 LIMIT 1" instead of "SELECT c2 FROM w WHERE id = (SELECT id FROM w WHERE c2 = A) + 1". Out of 14,149 questions spanning 1,679 tables, 2 SQL is a widely adopted formalism. Other formalisms including LambdaDCS (Pasupat and Liang, 2015), have been used on WTQ. SQL and LambdaDCS can express roughly the same percentage of queries: 81% (our finding) vs. 79% (analysis of a 200-question sample by Pasupat and Liang, 2016). We leave automatic conversion to and from SQL to other formalisms and vice versa to future work.  SQUALL provided SQL queries for 11,468 questions, or 81.1%. The remaining 18.9% consisted of questions with non-deterministic answers (e.g., "show me an example of . . . "), questions requiring additional pre-processing (e.g., looking up a date inside a text-based details column), and cases where SQL queries would be insufficiently expressive (e.g., "what team has the most consecutive wins?").
Alignment Annotation Given a tokenized question/LF pair, the annotators selected and aligned corresponding fragments from the two sides. The selected tokens did note need to be contiguous, but they had to be units that decompose no further. For the example in Figure 1, there were three alignment pairs, where the non-contiguous "ORDER BY . . . LIMIT 1" was treated as an atomic unit and aligned to "the highest" in the input. Additionally, not all tokens on either side needed to be aligned. For instance, SQL keywords SELECT, FROM and question tokens "what", "is", etc. were mostly unaligned. Table 1 shows that the same question phrase was aligned to a range of SQL expressions, and vice versa. Overall, 49.8% of question tokens were aligned. Comparative and superlative question tokens were the most frequently aligned, while many function words were unaligned; see Appendix E for part-of-speech distributions of the aligned and unaligned tokens. Except for the four keywords in the basic structure "SELECT . . . FROM w WHERE . . .", 90.2% of SQL keywords were aligned. The rest of the unaligned SQL tokens include d= (alignment ratio of 18.0%), AND (25.5%) and column names (86.1%). The first two cases arose because equality checks and conjunctions of filtering conditions are often implicit in natural language.
Inter-Annotator Agreement and Annotation Cost The two annotators' initial SQL annotation agreement in a pilot trial 3 was 70.4% and after discussion, they agreed on 94.5% of data instances; similarly, alignment agreement rose from 75.1% to 93.3%. With respect to annotation speed, an average SQL query took 33.9 seconds to produce and an additional 15.0 seconds to enrich with alignments: the cost of annotating 100 instances with alignment enrichment was comparable to that of 144 instances with only logical forms.

Post-processing
Literal values in the SQL queries such as "25,000" in Figure 1 and "star one" in Figure 3 are often directly copied from the input questions. We thus adapted WikiSQL's (Zhong et al., 2017) task setting, where all literal values correspond to spans in the input questions. We used our alignment to generate gold selection spans, filtering out instances where literal values could not be reconstructed through fuzzy match from the gold spans. After post-processing, SQUALL contained 11,276 table-question-answer triplets with logical form and lexical alignment annotations.

(State-of-the-Art) Base Model: Seq2seq with Attention and Copying
Recent state-of-the-art text-to-SQL models extend the sequence-to-sequence (seq2seq) framework with attention and copying mechanisms (Zhong et al., 2017;Dong andLapata, 2016, 2018;Suhr et al., 2020, inter alia). We adopt this strong neural paradigm as our base model. The seq2seq model generates one output token at a time via a probability distribution conditioned on both the input sequence representations and the partially-generated output sequence: , where x and y are the feature representations for the input and output sequences, and <i denotes a prefix. The last token of y must be a special <STOP> token that terminates the output generation. The per-token probability distribution is modeled through Long-Short Term Memory networks (LSTMs, Hochreiter and Schmidhuber, 1997) and 3 In the pilot study, the annotators independently labeled questions over the same 50 tables. We report the percentage of cases where one annotator accepted the other annotator's labels. 4 In Appendix §B, we show that on SQUALL, our base model is competitive with a state-of-the-art system  benchmarked on the Spider dataset . multi-layer perceptrons (MLPs): The training objective is the negative log likelihood of the gold y , defined for each timestep as

Question and Table Encoding
An input x contains a length-n question q = q 1 , . . . , q n and a table with m columns c = c 1 , . . . , c m . The input question is represented through a bi-directional LSTM (bi-LSTM) encoder that summarizes information from both directions within the sequence. Inputs to the bi-LSTM are concatenations of word embeddings, character-level bi-LSTM vectors, partof-speech embeddings, and named entity type embeddings. We denote the resulting feature vector associated with q i as q i . For column names, the representation c j concatenates the final hidden states of two LSTMs running in opposite directions that take the concatenated word embeddings, character encodings, and column data type embeddings as inputs. We also experiment with pre-trained BERT feature extractors (Devlin et al., 2019), where we feed the BERT model with the question and the columns as a single sequence delimited by the special [SEP] token, and we take the final-layer representations of the question words and the last token of each column as their representations.

Attention in Encoding
To enhance feature interaction between the question and the table schema, for each question word representation q i , we use an attention mechanism to determine its relevant columns and calculate a linearly-weighted context vector q i as follows: Then we run another bi-LSTM by concatenating the question representation q and context representation q as inputs to derive a column-sensitive representation q i for each question word q i . We apply a similar procedure to get the column representation c j for each column.
Attention in Decoding During decoding, to allow LSTMs to capture long-distance dependencies from the input, we add attention-based features to the recurrent feature definition of Eq. (1): (6) SQL Token Prediction with Copying Mechanism Since each output token can be an SQL keyword, a column name or a literal value, we factor the probability defined in Eq.
(2) into two components: one that decides the type t i ∈ {KEY, COL, STR} of y i : and another that predicts the token conditioned on the type t i . For token type KEY, we predict the keyword token with another MLP: For COL and STR tokens, the model selects directly from the input column names c or question q via a copying mechanism. We define a probability distribution with softmax-normalized bilinear scores: Similarly, we define literal string copying from q with another bilinear scoring matrix W STR .

Using Alignments in Model Training
The model design in §4 includes many latent interactions within and across the encoder and the decoder. We now describe how our manual alignments can enable direct supervision on such previously latent interactions. Our alignments can be used as supervision for the necessary attention weights ( §5.1). In an oracle experiment where we replace induced attention with manual alignments, the jump in logical form accuracy shows alignments are valuable, if only the models could reproduce them ( §5.2). Moreover, alignments enable a column-prediction auxiliary task ( §5.3).
The loss function L of our full model is a linear combination of the loss terms of the seq2seq model, supervised attention, and column prediction: where we define L att and L CP below.

Supervised Attention
Our annotated lexical alignments resemble our base model's attention mechanisms. At the encoding stage, question tokens and the relevant columns are aligned (e.g., "who" ↔ column "athlete") which should induce higher weights in both question-tocolumn and column-to-question attention (Eq. (3) and Eq. (4)); similarly, for decoding, annotation reflects which question words are most relevant to the current output token. Inspired by improvements from supervised attention in machine translation (Liu et al., 2016;Mi et al., 2016), we train the base model's attention mechanisms to minimize the Euclidean distance 5 between the human-annotated alignment vector a and the model-generated attention vector a: The vector a is a one-hot vector when the annotation aligns to a single element, or a represents a uniform distribution over the subset in cases where the annotation aligns multiple elements.

Oracle Experiments with Manual Alignments
To present the potential of alignment annotations for models with supervised attention, we first assume a model that can flawlessly reproduce our annotations within the base model. During training and inference, we feed the true alignment vectors in place of the attention weights to the encoder and/or decoder. Table 2 shows the resultant logical form accuracies. Access to oracle alignments provides up to 23.9% absolute higher accuracy over the base model. This wide gap suggests the high potential for training models with our lexical alignments.   (2015)

Column Prediction
Wang et al. (2019) show the importance of inferring token-column correspondence in a weaklysupervised setting; SQUALL enables full supervision for an auxiliary task that directly predicts the corresponding column c j for each question token q i . We model this auxiliary prediction as: For the corresponding loss L CP over tokens that match columns, we use cross-entropy.
Exact-match Features: An Unsupervised Alternative A heuristic-based, albeit lower-coverage, alternative to manual alignment is to use questions' mentions of column names. Thus, we use automatically-generated exact-match features in our baseline models for comparison in our experiments. For question encoders, we include two embeddings derived from binary exact-match features: indicators of whether the token appears in (1) any of the column headers and (2) any of the table cells. Similarly, for the column encoders, we also include an exact-match feature of whether the column name appears in the question.

Experiments
Setup We randomly shuffle the tables in SQUALL and divide them into five splits. For each setting, we report the average logical form accuracy ACC LF (output LF exactly matches the target LF) and execution accuracy ACC EXE (output LF may  not match the target LF, but its execution yields the gold-standard answer) as well as the standard deviation of five models, each trained with four of the splits as its training set and the other split as its dev set. We denote the base model from §4 as SEQ2SEQ and our model trained with both proposed training strategies in §5 as ALIGN. The main baseline model we compare with, SEQ2SEQ + , is the base model enhanced with the automaticallyderived exact-match features ( §5.3). See Appendix Appendix A for model implementation details. Table 3 presents the WTQ test-set ACC EXE of ALIGN compared with previous models. Unsurprisingly, SQUALL's supervision allows our models to surpass weakly supervised models. Single models trained with BERT feature extractors exceed prior state-of-the-art by 5.3%. However, our main scientific interest is not these numbers per se, but how beneficial additional lexical supervision is.

Effect of Alignment Annotations
To examine the utility of lexical alignments as a finer-grained type of supervision, we compare ALIGN with SEQ2SEQ + in Table 4. Both have access to logical form supervision, but ALIGN additionally uses lexical alignments during training. ALIGN improves SEQ2SEQ by 2.3% with BERT and 3.1% without, showing that lexical alignment annotation is more beneficial than automatically-derived exact-match column reference features. 6   Figure 2 shows what happens if ALIGN has access to all the training logical forms, but only a percentage of the accompanying alignments. Surprisingly, more than half of the accuracy improvement comes from as little as 5% of the alignment annotations. Because the cost of aligning an example is less than half of that for writing a logical form ( §3.1), we conclude that annotating lexical alignments is a cost-effective approach on a fixed budget.

Effect of Individual Strategies
Where Do Our Models Improve the Most? According to   est gains with respect to SEQ2SEQ + on the subtask of column selection (+4.9%), compared with a +2.0% improvement on generating correct SQL templates. The gain is larger on complex SQL templates (i.e., those with more aggregation functions and nested queries). 7 which demonstrates the effectiveness of reinforcing question-column correspondence through supervised attention and a column prediction auxiliary task.  Table 8: Recall against hand-annotated alignments and average entropy of the attention distributions in the question-to-column (q2c), column-to-question (c2q) and decoder-to-question (d2q) modules, comparing models trained with supervised encoder/decoder attention, none (SEQ2SEQ + ), or both strategies (ALIGN).
judgments. This is an arguably surprising benefit, since the supervised decoder was not trained with q2c supervision, and so one might have expected it to perform similarly to SEQ2SEQ + . However, one needs to be careful in interpreting these results, as machine-induced attention distributions are not intended for direct human interpretation (Jain and Wallace, 2019; Wiegreffe and Pinter, 2019).
Qualitative Analysis Our additional supervision helps when the question has little textual overlap with the referred columns. Figure 3 shows an example. With finer-grained supervision, ALIGN learns the column "Serial Name" corresponds to the question word "show", but SEQ2SEQ + selects the wrong column "Co-Star".

Related Work
Attention and Alignments Explicit supervision for attention mechanisms (Bahdanau et al., 2015) is helpful for many tasks, including machine translation (Liu et al., 2016;Mi et al., 2016), image captioning (Liu et al., 2017), and visual question  2019) argue that structured alignment is crucial to text-to-SQL models and they induce latent alignments in a weakly-supervised setting. In contrast, we take a fully-supervised approach and train models with manual alignments.
Lexical Focus and Semantic Parsing Our lexical alignment annotations are similar to semantic lexicons in lexicalized-grammar-based semantic parsing Collins, 2005, 2007;Kwiatkowski et al., 2010;Krishnamurthy and Mitchell, 2012;Artzi and Zettlemoyer, 2013). Those lexicons are usually well-typed to support semantic composition. It is an interesting future direction to explore how to model analogous compositional aspects with our type-flexible alignments through, for example, syntax-based alignment (Zhang and Gildea, 2004).
Annotator Rationales A related direction to enriching annotations is supplying annotator rationales (Zaidan et al., 2007), i.e., evidence supporting the annotations in addition to the final labels. Many recent datasets on machine reading comprehension and question answering, such as HotpotQA  and CoQA (Reddy et al., 2019), include such intermediate annotations at dataset release. Dua et al. (2020) show that these annotator rationales improve model accuracy for a given annotation budget on machine reading comprehension. The alignments we provide could, at a stretch, be considered a type of rationale for the output SQL annotation.
Text-to-SQL Datasets There is growing interest in both the database and NLP communities in text-to-SQL applications. Widely-used domainspecific datasets include ATIS (Price, 1990;Dahl et al., 1994), GeoQuery (Zelle andMooney, 1996;Popescu et al., 2003), Restaurants (Tang and Mooney, 2000;Popescu et al., 2003), and Scholar (Iyer et al., 2017). WikiSQL (Zhong et al., 2017) is among the first large-scale datasets with questionlogical form pairs querying a wide range of data tables extracted from Wikipedia, but WikiSQL's logical forms are generated from a limited set of templates. In contrast, WTQ questions are authored by humans under no specific constraints, and as a result WTQ includes more diverse semantics and logical operations. The family of Spider datasets (Yu et al., , 2019a contain queries even more complex than in WTQ, including a higher percentage of nested queries and multiple table joins. We leave extensions of lexical alignments to Spider's complex-structure queries to future work.

Conclusion
We introduce SQUALL, the first large-scale semantic parsing dataset with both hand-produced target logical forms and manually-derived lexical alignments between questions and SQL queries. Our dataset enables finer-grained supervision than existing datasets have previously supported. We incorporate the alignments into encoder-decoder-based neural models through supervised attention and an auxiliary task of column prediction. Experiments confirm our intuition that finer-grained supervision is helpful to model training. Our oracle studies also show that there is large unrealized further potential for our annotations. Thus, it remains an exciting challenge for future research to use our lexical alignment annotations more effectively. Our annotation cost analysis shows that collecting additional lexical alignments is more costeffective for improving model accuracy than having only logical forms. We hope that our findings will help future dataset design decisions and extensions of other existing datasets. One potential future direction is to further investigate the utility of lexical alignments in a cross-dataset/domain evaluation setting

A Model Implementation Details
We use and compare two different feature extractors in our experiments. For bi-LSTM encoders, we concatenate 100-dimensional word embeddings initialized from pre-trained GLoVE embeddings (Pennington et al., 2014), 8-dimensional part-of-speech and 8-dimensional named-entity embeddings as input to the LSTM encoders. Tokens that appear less than five times are replaced with a special "UNK" token. For the BERT setting, we fine-tune a BERT base model 8 and use the 768-dimensional final-layer representations. For the decoder, we embed previously decoded tokens, such as keywords, into 256-dimensional vectors and feed them as next-timestep input to the decoder LSTM. Both the encoder and decoder LSTMs have 128 hidden units and 2 layers. If the decoder predicts question words as literal strings in the output SQL queries, we replace them with the most similar table cell values using fuzzy match. 9 We set both λ att and λ CP to be 0.2. During training, we use a batch size of 8 and we set the dropout rate to be 0.3 in all MLPs and LSTMs. We use the Adam optimizer (Kingma and Ba, 2015) with default learning rate 0.001 and we clip gradients to 5.0. We train our models for up to 50 epochs and conduct early stopping based on per-epoch dev-set evaluation. On a single GTX 1080 Ti GPU, a training mini-batch takes 0.7 second on average and the training process finishes within 10 hours. We do not tune hyper-parameters.

B Comparison of Our Baseline Model with a State-of-the-Art Text-to-SQL Parser
To evaluate the strength of our baseline model, we compare it with Suhr et al.'s (2020) state-of-the-art model previously tested on the Spider dataset . Our task formulation is unlike the Spider dataset in that 1) the official Spider evaluation does not require predictions of literal values and 2) on our dataset, the model needs to predict data types for each column (e.g., number in Figure 1). Suhr et al.'s (2020)   form accuracies (ACC − LF ) that accepts column type disparities between the prediction and the gold standard; Table B1 shows the evaluation results. Our baseline SEQ2SEQ + model has competitive ACC − LF with Suhr et al.'s (2020) state-of-the-art text-to-SQL parser.

C Annotation Guidelines
In our pilot study, we instruct two expert SQL annotators to write down SQL equivalents of the English questions and to pick out the lexical mappings between the question and SQL tokens that correspond to each other semantically and are atomic, i.e., they cannot further decompose into smaller meaningful mappings. These underspecified instructions lead to 70.4% agreement on SQL annotation and 75.1% agreement on alignment annotation. The annotators have similar but not identical intuitions about, for example, what constitutes an atomic unit, especially when there are equally plausible alternative options. Following discussions, we refine our annotation guidelines for frequently occurring patterns to ensure consistent annotations, as follows:

General Rules
1. SQL queries should reflect the semantic intent of the English questions, even if shorter SQL queries return the same execution results. The only exception is when SQL offers no straightforward implementation of the implicit semantic constraints. In that case, answer the first appearing subquestion, i.e., assume that the implicit semantic constraints are always met. For example, it is implicitly assumed in the question "which city are A and B located in?" that A and B are located in the same city; write down the SQL equivalent for "which city is A located in?". 2. When there are competing choices of annotation, select the simplest version. Among alternative SQL queries, select the one with fewer nestings and fewer SQL tokens: SELECT MAX(col) FROM w is prioritized over SELECT col FROM w ORDER BY col DESC LIMIT 1. Following this rule, default values are always omitted since the queries are shorter without them. These include, for example, the keyword ASC in an ORDER BY clause. 3. Lexical alignments should cover as many semantically-meaningful tokens as possible, even if there is no word overlap. For example, for the question "who performed better, toshida or young-sun?", align the word "performed" to its corresponding column ("result" or "rank"). For wh-tokens, align "when", "who" and "where" if appropriate, but omit alignments of "what" and "which" when they do not contribute to concrete meanings. 4. Prioritize alignments with exact lexical matches. This means that for many noun phrases, align bare nouns excluding the determiners instead of maximal noun phrases (e.g., "movie" rather than "the movie" should be aligned to the "movie" column token in the SQL query). In contrast, include "the" in the alignment of superlatives (e.g., "the least"), since superlatives usually do not lexically overlap with the column tokens. 5. In general, the annotation should not depend on the table contents and sorting assumptions. In other words, use direct references to the presented row order id as little as possible. However, use id if the question explicitly asks about the presentation order, e.g., "the first on the list" or "the first listed".

Some Frequent Specific Cases
1. Align "how many" to the aggregation operation when appropriate, but do not align "how many" when the SQL query directly selects a column without aggregation, e.g., the question is "how many total medals has Spain won?" and the . 5. Align question word "game" to "date" column if necessary but use COUNT(*) for counting the game numbers when there are no better alignment alternatives. 6. Align words referring to performance, such as "fast", to the corresponding "result"/"time" columns; if not available, align them to "rank" columns that indirectly refer to performance; if still not available, align them to id, which explicitly relies on the table being presorted by the performance.

D Database Construction
We assume 9 basic data types for WTQ tables: numbers (e.g., "5"), numbers with units (e.g., "5 kg") , date and time (e.g., "May 29, 1968", "3:56"), (sports) scores (e.g., "w 5:3"), number spans (e.g., "12-89"), time spans (e.g., "May 2011-June 2012"), fractions (e.g., "3/5") , street addresses (e.g., "2020 Westchester Street"), and raw texts (e.g., "John Shermer"). Additionally, we consider two composite types: binary tuples (e.g., "KO (head kick)") and lists (e.g., "Wojtek Fibak, Joakim Nyström"). Binary tuples are split into two subcolumns in the generated databases, and lists are automatically transformed to a separate table joined with the original table through primary-foreign key relations. Data types for each column are first identified with regular expressions and manually verified by annotators. Any column that contains a type outside of these 9 types is interpreted as raw text. We also filter out aggregation rows from the tables so that the SQL aggregation functions over the table can skip those pre-computed aggregates.    ) where a i and a i denote the learned attention weights and annotated gold-standard alignments. A smaller distance between a i and a i indicates a model better at reproducing our alignment annotation. While both mean squared error and multiplication are symmetric in a i and a * i , cross entropy is asymmetric and has been previously shown to be the most effective measure in the task of machine translation (Liu et al., 2016). Table F3 shows devset results with different supervised attention loss choices in ALIGN's encoder. The mean square error loss is the strongest, with 1.5% higher execution accuracy than multiplication loss and 0.6% higher than cross-entropy loss.
umn names and question 5-grams. Table G4 shows dev-set results. Training with automatic alignments improves over the SEQ2SEQ + model, but manual annotations provide an additional +1.7% ACC EXE . The manual annotations are cleaner and more informative since there are many column mentions without any lexical overlap with the column headers (e.g., "who" ↔ column "athlete"). Table H5 shows dev-set results of the top 10 most frequent templates. We report logical form (ACC LF ), template (ACC TEMP ) and column (ACC COL ) accuracies. ACC COL is calculated on the subset where template predictions are accurate. 11 The improvement of ALIGN over SEQ2SEQ + is more significant on ACC COL than ACC TEMP . Additionally, ALIGN tends to yield higher ACC COL gains on complex templates, compared with simple and common templates.   Table I6: Dev logical form (ACC LF ) and execution (ACC EXE ) accuracies in a generalization evaluation setting following Finegan-Dollak et al. (2018), where instances of a given template are ablated from training, and we evaluate model accuracies on that unseen template. ALIGN outperforms SEQ2SEQ + in ACC EXE on 9 out of the 10 most frequent templates. Table I6 considers an evaluation setting of Finegan-Dollak et al. (2018) to test the model accuracies on unseen SQL templates. We exclude all instances of a given template from the training set, and then evaluate only on that template. ALIGN outperforms SEQ2SEQ + in ACC EXE on 9 out of the 10 most frequent templates. Notably, on a template that contains both GROUP BY and ORDER BY clauses, the ACC EXE improvement of ALIGN is as large as +49.2%.