Structure-Grounded Pretraining for Text-to-SQL

Learning to capture text-table alignment is essential for tasks like text-to-SQL. A model needs to correctly recognize natural language references to columns and values and to ground them in the given database schema. In this paper, we present a novel weakly supervised Structure-Grounded pretraining framework (STRUG) for text-to-SQL that can effectively learn to capture text-table alignment based on a parallel text-table corpus. We identify a set of novel pretraining tasks: column grounding, value grounding and column-value mapping, and leverage them to pretrain a text-table encoder. Additionally, to evaluate different methods under more realistic text-table alignment settings, we create a new evaluation set Spider-Realistic based on Spider dev set with explicit mentions of column names removed, and adopt eight existing text-to-SQL datasets for cross-database evaluation. STRUG brings significant improvement over BERTLARGE in all settings. Compared with existing pretraining methods such as GRAPPA, STRUG achieves similar performance on Spider, and outperforms all baselines on more realistic sets. All the code and data used in this work will be open-sourced to facilitate future research.


Introduction
Semantic parsing is the task of mapping a natural language (NL) utterance to a machineunderstandable representation such as lambda calculus, abstract meaning representation, or a structured query language (e.g., SQL). In this paper, we focus on the task of translating NL questions to executable SQL queries (text-to-SQL). This is a fundamental task for building natural language interfaces for databases, which can enable non-expert users to effortlessly query databases (Androutsopoulos et al., 1995;Li and Jagadish, 2014a). * Work done during an internship at Microsoft Research. Figure 1: Illustration of text-to-SQL text-table alignment (top half) and parallel text-table corpus (bottom half). In both examples, the associations between tokens in the NL utterance and columns in the table are indicated. In this paper, we aim to leverage the text-table alignment knowledge in the parallel text-table corpus to help text-to-SQL.
One of the key challenges in text-to-SQL is text-table alignment, that is, to correctly recognize natural language references to columns and values and to ground them in the given database schema. Consider the example in the top half of Fig. 1. A model needs to first identify the column mentions total credits, department, and value mention History, and then ground them to the given schema. This is challenging for three reasons. First, the model needs to jointly understand the NL utterance and the database schema, as the user may refer to a column using various expressions which usually differ from the original column name. Second, the model needs to be able to generalize to new database schemas and referential language that is not seen in training. Finally, in the case that accessing cell values is not possible, the model still needs to identify potential value mentions and link them to the correct columns without exhaustively searching and matching over the database.
On the other hand, text-table alignment naturally exists in parallel text-table corpora, e.g., web tables with context (Lehmberg et al., 2016), tableto-text generation datasets (Parikh et al., 2020;Chen et al., 2020a), table-based question answering datasets (Pasupat and Liang, 2015;Chen et al., 2020b). Such datasets can be collected from web pages, documents, etc., and requires much less human effort to create compared with text-to-SQL datasets. The bottom half of Fig. 1 gives an example of such an alignment dataset. There are three value mentions 11417, Pune Junction and Nagpur Jnction, which can be grounded to the train number, departure station and arrival station columns respectively. Such alignment information can be easily obtained by leveraging the table contents or using some human annotation. In this work, we aim to incorporate the text-table alignment knowledge contained in a parallel corpus via pretraining and use it to help the downstream text-to-SQL task.
We present a novel weakly supervised structuregrounded pretraining framework (STRUG) for textto-SQL. We design a set of prediction tasks and optimize them leveraging a parallel corpus containing both NL sentences and tabular data to encourage the encoded representation to capture information required to support tasks that require table grounding. More specifically, we identify three critical tasks for aligning text with table: column grounding, value grounding and column-value mapping (examples shown in Fig. 2). We re-purpose an existing large-scale table-to-text generation dataset ToTTo (Parikh et al., 2020) for pretraining and gain labels for the three tasks via weak supervision. We experiment under two settings, with or without human assistance: (1) human assisted setting, using ToTTo's revised descriptions and cell annotations; (2) automatic setting, using the raw sentences and inferring the cell correspondences via string matching with the table contents.
As pointed out by Suhr et al. (2020), existing text-to-SQL benchmarks like Spider (Yu et al., 2018b) render the text-table alignment challenge easier than expected by explicitly mentioning exact column names in the NL utterances. Contrast this to more realistic settings where users may refer to the columns using a variety of expressions. Suhr et al. (2020) propose a new cross-database setting that uses Spider for training and includes eight other single-domain text-to-SQL datasets for evaluation. In addition to adopting their setting, we create a new evaluation set called Spider-Realistic from the original Spider dev set, by removing explicit mentions of column names from an utterance.
We pretrain STRUG using 120k text-table pairs from ToTTo. Experiments show that our structuregrounded pretraining objectives are very efficient and usually converge with around 5 epochs in less than 4 hours. This dramatically reduces the pretraining cost compared to previous pretraining methods (Herzig et al., 2020;Yin et al., 2020). We adopt the same model architecture as BERT (Devlin et al., 2019), with simple classification layers on top for pretraining. For downstream tasks, STRUG can be used as a text-table encoder and easily integrated with any existing state-of-the-art model. We conduct extensive experiments and show that: (1) Combined with state-of-the-art text-to-SQL model RAT-SQL , using STRUG as encoder significantly outperforms directly adopting pretrained BERT LARGE (RAT-SQL's default encoder) and performs on par with other text-table pretraining models like GRAPPA  on the widely used Spider benchmark.
(2) On more realistic evaluation settings, including Spider-Realistic and the Suhr et al. (2020) datasets, our method outperforms all baselines. This demonstrates the superiority of our pretraining framework in solving the text-table alignment challenge, and its usefulness in practice.
(3) STRUG also helps reduce the need for large amount of costly supervised training data. We experiment with the WikiSQL benchmark (Zhong et al., 2017) by limiting training data size, and show that our pretraining method can boost the model performance by a large margin and consistently outperforms existing pretraining methods.

Related Work
Cross-Database Text-to-SQL. Remarkable progress has been made in text-to-SQL over the past few years. With sufficient in-domain training data, existing models already achieve over 80% exact matching accuracy (Finegan-Dollak et al., 2018;Wang et al., 2018) on single-domain benchmarks like ATIS (Hemphill et al., 1990;Dahl et al., 1994) and GeoQuery (Zelle and Mooney, 1996). However, annotating NL questions with SQL queries is expensive making it cost-prohibitive to collect training examples for all possible databases. A model that can generalize across domains and databases is desired. In light of this, Yu et al. (2018b) present Spider, a cross-database text-to-SQL benchmark that trains and evaluates a system using different databases. More recently, Suhr et al. (2020) provide a holistic analysis of the challenges introduced in cross-database text-to-SQL and propose to include single-domain datasets in evaluation. Their study uncovers the limitations of current text-to-SQL models, and demonstrates the need for models that can better handle the generalization challenges. Pretraining for Text-Table Data. Inspired by the success of pretrained language models, some recent work has tried to apply similar pretraining objectives to text-table data. TaBERT (Yin et al., 2020) and TAPAS (Herzig et al., 2020) jointly learn texttable representations by leveraging a large amount of web tables and their textual context. They flatten the tables and use special embeddings to model the structure information. A masked language model (MLM) objective is then used to predict the masked tokens in the text-table data. MLM is good at modeling the contextualized semantic representations of a token, but is weak at capturing the alignment between a pair of sequences (e.g., text-table). More recently, GRAPPA  explores a different direction for pretraining which shares some similarity with existing work on data augmentation for semantic parsing. GRAPPA first constructs synthetic question-SQL pairs using templates (a synchronous context free grammar) induced from existing text-to-SQL datasets, a SQL semantic prediction objective is then used to learn compositional inductive bias from the synthetic data. However, as the synthetic data is generated using templates, and the column names and values are directly filled in the questions, it has the same problem as existing text-to-SQL datasets that eases the text-table alignment challenge. In constrast, STRUG aims to directly learn the text-table alignment knowledge from parallel text-table corpora via structure-grounded pretraining objectives. We also note that existing pretraining methods and STRUG can be complementary and combined together in the future. Structure Grounding in Text-to-SQL. Structure grounding has been proven to be crucial for textto-SQL, where a model needs to correctly identify column and value mentions in an NL utterance and link them to the given database schema (Guo et al., 2019;Bogin et al., 2019;Lei et al., 2020). Most existing text-to-SQL systems have specially designed components for structure grounding, which is also referred to as schema linking. For example, Guo et al. (2019); Yu et al. (2018a) explore using simple heuristics like string matching for schema linking, and use the linking results as direct hints to their systems. However, such heuristics may not generalize well in real world scenarios where there are varied ways to refer to a column, which usually differ from the original column name. More recently, Shi et al. (2020) and Lei et al. (2020) take a step forward and manually annotate WikiTableQuestions (Pasupat and Liang, 2015) and Spider with fine-grained alignment labels for supervised training (together with the text-to-SQL objective), which brings significant improvements. The main drawback of these models is that they are limited to learn the alignment knowledge from a relatively small training corpus, and cannot generalize well in a cross-domain setting. Moreover, SQL annotations and fine-grained alignment labels are both expensive to get manually. In contrast, this paper aims to re-purpose an existing parallel text-table corpus for pretraining models to learn structure grounding, where we generate alignment labels at large scale with low or no cost.

Motivation
One of the critical generalization challenges in cross-database text-to-SQL is text-table alignment, i.e., a model needs to understand NL utterances and database schemas unseen in training, including value mentions and novel columns, and to correctly map between them. Similar generalization challenges have been studied for a long time in the NLP field. Recently, pretrained language models (Devlin et  2020) have achieved great success in tackling the challenges by learning contextualized representations of words from a large text corpus. Inspired by this, in this work we aim to develop a pretraining method that can directly learn the text-table alignment knowledge from a large parallel text-table corpus.
Unlike previous text-table pretraining works (Herzig et al., 2020;Yin et al., 2020) that optimize unsupervised objectives like MLM during pretraining, we carefully design three structuregrounded tasks: column grounding, value grounding and column-value mapping. These tasks are related to text-to-SQL and can directly capture the text-table alignment during pretraining. As a result, the learned alignment knowledge can be effectively transferred to the downstream task and improve the final performance.

Pretraining Objectives
We use the same model architecture as BERT, and add simple classification layers on top for the three structure-grounded tasks. For downstream tasks, our model can be easily integrated into existing models as text-table encoder. Following previous work (Hwang et al., 2019;Guo et al., 2019), we linearize the input by concatenating the NL utterance and column headers, using <sep> token as a separator.
Formally, given a pair of NL utterance {x i } and table with a list of column headers (in case there are multiple tables like in databases, we concatenate all the column names together) {c j }, we first obtain the contextualized representation x i of each token in the utterance and c j for each column using the last layer output of the BERT encoder. Here each column header c j may contain multiple tokens c j,0 , . . . , c j,|c j | . We obtain a single vector representation for each column using column pooling. More specifically, we take the output of the first and last token of the header, and calculate the column representation as c j = (c j,0 + c j,|c j | )/2. {x i } and {c j } are then used to compute losses for the three tasks. An overview of our model architecture and pretraining objectives are shown in Fig. 2. Column grounding. An important task in textto-SQL is to identify grounded columns from the schema and use them for the generated SQL query. With a parallel text-table corpus, this is similar to selecting the columns that are mentioned in the associated NL sentence. This task requires a model to understand the semantic meaning of a column based on its header alone, and to infer its relation with the NL sentence based on the contextualized representations. We formulate it as a binary classification task. For each column c j , we use a onelayer feed forward network f (·) to get prediction p c j = f (c j ) of whether c j is mentioned in the sentence or not. The column grounding loss L c is then calculated using the binary cross entropy loss w.r.t. ground truth labels y c j ∈ {0, 1}. Note this task requires the model to identify the meaning of a column without access to any of its values. Hence, it is suitable for the typical text-to-SQL setting where the model only has access to the database schema. Value grounding. For clauses like WHERE and HAVING, to generate an executable SQL query, a model also needs to extract the value to be compared with the grounded column from the NL utterance. This can be transformed to the task of finding cell mentions in the NL sentence with a parallel text-table corpus. Since the contents of the table is Dataset # Examples Exec Acc (Suhr et al., 2020) % Col Mentioned ATIS (Hemphill et al., 1990;Dahl et al., 1994) 289 (486) 0.8 0.0 Restaurants (Tang and Mooney, 2000) 27 (378) 3.7 0.0 Academic (Li and Jagadish, 2014b) 180 (196) 8.2 11.4 Yelp (Yaghmazadeh et al., 2017) 54 (128) 19.8 8.0 Scholar (Iyer et al., 2017) 394 (599) 0.5 0.0 Advising (Finegan-Dollak et al., 2018) 309 (2858) 2.3 0.3 IMDB (Yaghmazadeh et al., 2017) 107 (131) 24.6 1.0 GeoQuery(Zelle and Mooney, 1996) 532 (598) The value grounding loss L v is then calculated using the binary cross entropy loss w.r.t. ground truth labels y v i ∈ {0, 1}. Column-Value mapping. As there may be multiple columns and values used in the SQL query, a text-to-SQL model also needs to correctly map the grounded columns and values. This is used to further strengthen the model's ability to capture the correlation between the two input sequences by learning to align the columns and values. We formulate this as a matching task between the tokens in the NL sentence and the columns. For every grounded token x i (i.e., y v i = 1), we pair it with each column c j and calculate the probabil- Here [·, ·] is the vector concatenation operation. We then apply a softmax layer over the predictions for each token p cv , and the final columnvalue mapping loss L cv is then calculated as L cv = CrossEntropy (softmax (p cv i ) , y cv i ), where y cv i ∈ {0, 1} |c| is the ground truth label.
The final loss L for pretraining is the sum of all three losses. We experimented with different weights for each term, but did not observe significant improvement on the results. Hence we only report results with equally weighted losses.

Obtaining Pretraining Data via Weak Supervision
We obtain ground truth labels y c j , y v i and y cv i from a parallel text-table corpus based on a simple intuition: given a column in the table, if any of its cell values can be matched to a phrase in the sentence, this column is likely mentioned in the sentence, and the matched phrase is the value aligned with the column. To ensure high quality text-table alignment information in the pretraining corpus, unlike previous work (Herzig et al., 2020;Yin et al., 2020) that use loosely connected web tables and their surrounding text, here we leverage an existing large-scale table-to-text generation dataset ToTTo (Parikh et al., 2020). ToTTo contains 120,761 NL descriptions and corresponding web tables automatically collected from Wikipedia using heuristics. Additionally, it provides cell level annotation that highlights cells mentioned in the description and revised version of the NL descriptions with irrelevant or ambiguous phrases removed.
We experiment with two pretraining settings, with or without human assistance. In the human assisted setting, we use the cell annotations along with the revised description to infer the ground truth labels. More specifically, we first label all the columns c j that contain at least one highlighted cell as positive (y c j = 1). We then iterate through all the values of the highlighted cells and match them with the NL description via exact string matching to extract value mentions. If a phrase is matched to a highlighted cell, we select all the tokens x i in that phrase and align them with the corresponding columns c j (y v i = 1, y cv i,j = 1). In the automatic setting, we use only the tables and the raw sentences, and obtain cell annotations by comparing each cell with the NL sentence using exact string matching. Note that in both settings, the cell values are used only for preparing supervision for the pretraining objectives, not as inputs to the pretraining model.
To make the pretraining more effective and to achieve a better generalization performance, we also incorporate two data augmentation techniques. First, since the original parallel corpus only contains one table for each training example, we randomly sample K neg tables as negative samples and append their column names to the input sequence. This simulates a database with multiple tables and potentially hundreds of columns, which is common in text-to-SQL. Second, we randomly replace the matched phrases in the NL sentences with values of cells from the same column (the labels are kept the same). This way we can better leverage the contents of the table during pretraining and improve the model's generalization ability by exposing it to more cell values.

Creating a More Realistic Evaluation Set
As one of the first datasets to study cross-database text-to-SQL, Spider has been a widely used benchmark in assessing a model's ability to generalize to unseen programs and databases. However, as pointed out by Suhr et al. (2020), Spider eases the task by using utterances that closely match their paired SQL queries, for example by explicitly mentioning the column names in the question, while in practice NL references to columns usually differ from the original column name. To alleviate this problem, Suhr et al. (2020) propose to train the model with cross-domain dataset like Spider, and add another eight single-domain datasets like ATIS (Hemphill et al., 1990;Dahl et al., 1994) and Geo-Query (Zelle and Mooney, 1996) for evaluation. However, some of the datasets differ a lot from Spider, introducing many novel query structures and dataset conventions. 2 As we can see from  3) w. STRUG (Human Assisted) 65.7 ± 0.7 (62.2) 5.5 ± 1.1 59.5 ± 3.2 40.7 ± 13.9 18.7 ± 2.1 26.8 ± 2.9 21.6 ± 2.3 6.3 ± 1.8 6.9 ± 0.6 w. STRUG (Automatic) 65.3 ± 0.7 (62.2) 2.8 ± 0.7 57.5 ± 0.2 44.4 ± 32.7 20.2 ± 1.6 30.2 ± 5.8 18.5 ± 1.5 6.1 ± 0.5 5.2 ± 0.5  (Yin et al., 2020) 64.5 --RAT-SQL w. BERTLARGE 69.8 ± 0.8 72.3 ± 0.6 w. GRAPPA  73.4 -69.6 w. STRUG (Human Assisted) 72.7 ± 0.7 75.5 ± 0.8 68.4 w. STRUG (Automatic) 72.6 ± 0.1 74.9 ± 0.1 - Table 4: Results on Spider. The top half shows models using only database schema, the bottom half shows models using the database content. We train our model three times with different random seeds and report the mean and standard deviation here.
from the official Spider evaluation script and execution accuracy 3 for evaluation on Spider and Spider-Realistic. On the Suhr et al. (2020) datasets, we use the official evaluation script 4 released by the authors and report execution accuracy. WikiSQL. WikiSQL (Zhong et al., 2017) is a large-scale text-to-SQL dataset consists of over 80k question-query pairs grounded on over 30k Wikipedia tables. Although existing models are already reaching the upper-bound performance on this dataset (Hwang et al., 2019;Yavuz et al., 2018), mainly because of the simplicity of the SQL queries and large amount of data available for training, previous works have also used this dataset to demonstrate the model's generalization ability with limited training data Yao et al., 2020). For the base model, we use SQLova (Hwang et al., 2019)   ing the official leaderboard, we report both logical form accuracy and execution accuracy.

Training Details
For all experiments, we use the BERT implementation from Huggingface (Wolf et al., 2020) and the pretrained BERT LARGE model from Google 5 . For pretraining, we use Adam optimizer (Kingma and Ba, 2015) with a initial learning rate of 2e-5 and batch size of 48. In both settings, we use K neg = 1 and pretrains our model for 5 epochs. We use 4 V100 GPUs for pretraining, which takes less than 4 hours. For Spider and the realistic evaluation sets, we use the official implementation of RAT-SQL 6 and modify it to generate executable SQL queries. We follow the original settings and do hyperparameter search for learning rate (3e-4, 7.44e-4) and warmup step (5k, 10k). We use the same polynomial learning rate scheduler with warmup and train for 40,000 steps with batch size of 24. The learning rate for the pretrained encoder (e.g. BERT) is 3e-6 and is frozen during warmup.
For WikiSQL, we use the official SQLova implementation 7 . We use the default setting with learning rate of 1e-3 for the main model and learning rate of 1e-5 for the pretrained encoder. We train the model for up to 50 epochs and select the best model using the dev set.

Main Results
Spider. We first show results on Spider dev set in Table 4. The original Spider setting assumes only the schema information about the target database is known in both training and evaluation phase, as the content of the database may not be accessible to the system due to privacy concern. More recently, some works have tried to using the database content to help understand the columns and link with the NL utterance. Here we show results for both settings. In the first setting where only schema information is known, we disable the value-based linking module in RAT-SQL. As we can see from Table 4, replacing BERT LARGE with STRUG consistently improves the model performance in both settings. Under the setting where content is available, using STRUG achieves similar performance as GRAPPA and outperforms all other models. GRAPPA uses both synthetic data and larger texttable corpus for pretraining. However, it mainly learns inductive bias from the synthetic data while our model focuses on learning text-table association knowledge from the parallel text-table data. In error analysis on the Spider dev set, we notice that our best model 8 corrects 76 out of 270 wrong predictions made by GRAPPA while GRAPPA corrects 80 out of 274 wrong predictions made by our model. This demonstrates that the two pretraining techniques are complementary and we expect combining them can lead to further performance improvement. For results on different difficulty levels and components, please see Appendix B.1. More realistic evaluation sets. Results on the realistic evaluation sets are summarized in Table 3. Firstly, we notice the performance of all models drops significantly on Spider-Realistic, demonstrating that inferring columns without explicit hint is 7 https://github.com/naver/sqlova 8 RAT-SQL w. STRUG (Human Assisted)  a challenging task and there is much room for improvement. Secondly, using STRUG brings consistent improvement over BERT LARGE in all realistic evaluation sets. In the Spider-Realistic set, using STRUG also outperforms GRAPPA 9 by 2.9%. Under the original Suhr et al. (2020) setting, combining RAT-SQL with STRUG significantly outperforms Suhr et al. (2020) in all datasets, despite that we do not include WikiSQL as additional training data as they did. Thirdly, comparing results in Table 4 with Table 3, using STRUG brings larger improvement over BERT LARGE in the more realistic evaluation sets. As shown in Table 1, the original Spider dataset has a high column mention ratio, so the models can use exact match for column grounding without really understanding the utterance and database schema. The more realistic evaluation sets better simulate the real world scenario and contain much less such explicit clues, making the text-  the large size of training data and the simple SQL structure of WikiSQL. To better demonstrate that the knowledge learned in pretraining can be effectively transferred to text-to-SQL task and reduce the need for supervised training data, we also conduct experiments with randomly sampled training examples. From Fig. 4 we can see that with only 1% of training data (around 500 examples), models using STRUG can achieve over 0.70 accuracy, outperforming both BERT LARGE and TaBERT by a large margin. STRUG brings consist improvement over BERT LARGE until we use half of the training data, where all models reach nearly the same performance as using the full training data. We also show the training progress using 5% of training data in Fig. 5. We can see that STRUG also helps speed up the training progress. For more break-down results on several subtasks, please see Appendix B.2. Comparison of human assisted and automatic setting. In all benchmarks, we notice that STRUG pretrained using the automatic setting actually performs similarly as the setting where cell annotations are used. This indicates the effectiveness of our heuristic for cell annotation and the potential to pretrain STRUG with more unannotated parallel text-table data.

Case Study
We compare the predictions made by RAT-SQL w. BERT LARGE and w. STRUG (Automatic). Some examples are shown in Table 6. In the first example from Spider-Realistic, we can see that the model w. BERT LARGE fails to align tournaments with the tourney_name column, because of string mismatch.
In the second example from IMDB, although the model correctly recognizes James Bond as value reference, it fails to ground it to the correct column which is movie_title. This supports our hypothesis that using STRUG helps to improve the structure grounding ability of the model.

Conclusion
In this paper, we propose a novel and effective structure-grounded pretraining technique for textto-SQL. Our approach to pretraining leverages a set of novel prediction tasks using a parallel texttable corpus to help solve the text-table alignment challenge in text-to-SQL. We design two settings to obtain pretraining labels without requiring complex SQL query annotation: using human labeled cell association, or leveraging the table contents. In both settings, STRUG significantly outperforms BERT LARGE in all the evaluation sets. Meanwhile, although STRUG is surprisingly effective (using only 120k text-table pairs for pretraining) and performs on par with models like TaBERT (using 26m tables and their English contexts) and GRAPPA (using 475k synthetic examples and 391.5k examples from existing text-table datasets) on Spider, we believe it is complementary with these existing text-table pretraining methods. In the future, we plan to further increase the size of the pretraining corpus, and explore how to incorporate MLM and synthetic data.

Ethical Considerations
Dataset. In this work, we re-purpose an existing table-to-text generation dataset ToTTo (Parikh et al., 2020) for our pretraining. We obtain labels for our three pretraining tasks via weak supervision, which uses only the raw sentence-table pairs, or the cell annotations and revised descriptions that are already included in ToTTo dataset. As a result, no extra human effort is required for collecting our pretraining corpus. We also curate a more realistic evaluation dataset for text-to-SQL based on Spider dev set. In particular, we first select a complex subset from the Spider dev set and manually revise the NL questions to remove the explicit mention of column names. The detailed description of the process can be found in Section 4. The first author manually revised all the questions himself, which results in 508 examples in total. Application. We focus on the task of text-to-SQL, which is a fundamental task for building natural language interfaces for databases. Such interface can enable non-expert users to effortlessly query databases. In particular, here we focus on improving the structure grounding ability of text-to-SQL models, which is critical in real-world use cases. We evaluate our model with the widely used Spider benchmark and several more realistic datasets. Experimental results show that our method brings significant improvement over existing baselines, especially on more realistic settings. Computing cost. We use 4 V100 GPUs for pretraining, and 1 V100 GPU for finetuning the model for text-to-SQL on Spider and WikiSQL. One advantage of our method is its efficiency. In our experiments, we pretrain the model for only 5 epochs, which can finish within 4 hours. For comparison, the largest TaBERT model (Yin et al., 2020) takes 6 days to train for 10 epochs on 128 Tesla V100 GPUs using mixed precision training.   We use the filtering scripts 10 released by the authors of Suhr et al. (2020). More specifically, they remove examples that fall into the following categories: (1) a numeric or text value in the query is not copiable from the utterance (except for the numbers 0 and 1, which are often not copied from the input), (2) the result of the query is a empty table, or a query for count returns [1], (3) the query requires selecting more than one final column.

B.1 Detailed Results on Spider and Spider-Realistic
We show more detailed results on the Spider dev set and Spider-Realistic in Table 7, Table 8 and Table 9. From Table 7 we can see that STRUG brings significant improvements in all difficulty levels, and is not biased towards certain subset. Since STRUG mostly improves the structure grounding ability of the model, from Table 8 and   we notice greater improvement using STRUG, especially for GROUP BY clauses.

B.2 Detailed Results on WikiSQL
We show subtask performance for WikiSQL in Table 10, Fig. 7 and Fig. 4. Again, we can see that STRUG mainly improves WHERE column and WHERE value accuracy. From Fig. 6 we can see that with only 1% of training data, model with STRUG already has over 0.87 WHERE column accuracy and nearly 0.85 WHERE value accuracy.