Re-examining the Role of Schema Linking in Text-to-SQL

In existing sophisticated text-to-SQL models, schema linking is often considered as a simple, minor component, belying its importance. By providing a schema linking corpus based on the Spider text-to-SQL dataset, we systematically study the role of schema linking. We also build a simple BERT-based baseline, called S chema-L inking SQL (SLSQL) to perform a data-driven study. We ﬁnd when schema linking is done well, SLSQL demonstrates good performance on Spider despite its structural simplicity. Many remaining errors are attributable to corpus noise. This suggests schema linking is the crux for the current text-to-SQL task. Our analytic studies provide insights on the characteristics of schema linking for future developments of text-to-SQL tasks. 1


Introduction
Structured Query Language (SQL), while exact and powerful, suffers from a complex grammar presenting significant challenges for laymen to write queries. Automatically parsing natural language into SQL (text-to-SQL) thus has huge potential, as it would enable lay users to mine the world's structured data using natural language queries.
To achieve practical text-to-SQL workflow, a model needs to correlate natural language queries with the given database. Therefore, schema linking is considered helpful for text-to-SQL parsing (Guo et al., 2019;Bogin et al., 2019b;Dong et al., 2019;Wang et al., 2020). Here, schema linking means identifying references of columns, tables and condition values in natural language queries. For example, for the question "Find the names of schools that have a donation with amount * Equal contribution. 1 Our code and annotation are available at https:// github.com/WING-NUS/slsql. above 8.5" (shown with relevant tables in Figure 1), "name" is a column reference to school.name, "donation" a table reference to endowment, and "8.5" and "sale" are value references, corresponding to the condition values in the SQL query.
Existing solutions largely treat schema linking as a minor component implemented with simple string matching (Guo et al., 2019;Yu et al., 2018a; heuristics to support sophisticated textto-SQL models. An exception is Dong et al. (2019), which framed schema linking as a task to be solved by sequential tagging. While they did show the importance of schema linking, how it contribute to text-to-SQL task performance remains unanswered as there is no annotated corpus to analyze.
To address these shortcomings, we perform an indepth study on the role of schema linking in text-to-SQL parsing. Intuitively, schema linking helps both cross-domain generalizability and complex SQL generation, which have been identified as the current bottlenecks of the text-to-SQL task (Finegan-Dollak et al., 2018;Yu et al., 2018c). By crossdomain generalizability, we refer to the proper separation of training and testing instances and databases, requiring a model to infer against with arbitrary databases where the schema and the domain are previously unknown. This means the model must be aware of what tables and columns are involved in the question -exactly what schema linking does. Schema linking indirectly addresses the complex SQL generation challenge: the writing of SQL queries comprising a mixture of select, group by, and nested clauses. Generating such queries requires the modeling of complex semantic dependencies in the input and to manage complex SQL grammar during decoding. As discussed in Dong et al. (2019), detecting and removing domainspecific words from the model's purview allows the model to focus on learning syntactic conversion between natural language and SQL, reducing the input's syntactic sparsity. For example, if we view the words linked with the schema as placeholders, the two natural language queries in Figure 1 can be deemed syntactically similar.
To perform a systematic data-driven study, we annotate and contribute ground truth schema linking data for the publicly-available training and development set of the Spider dataset (Yu et al., 2018c). We then build a simple BERT baseline, named Schema-Linking SQL (SLSQL), which links the schema in a natural language query and parses the SQL query with awareness of the predicted schema linking results. We systematically compare several variants of SLSQL, each of which utilizes schema linking differently. We find that schema linking always leads to better SQL parsing performance. But the performance of schema linking is far from perfect, even under supervised learning (≤ 0.83 F1). To maximize the potential of schema linking, we study how SLSQL performs during inference when provisioned with oracle schema linking results (i.e., ground truth annotation). As such, we find our simple SLSQL model performs impressively on Spider -the remaining gap is largely due to corpus noise, including inconsistent patterns and errors in the dataset. This evidence points to schema linking as a critical task for text-to-SQL parsing, and also provides an indicative upper bound on performance on the Spider dataset. Interestingly, our analyses on the failure cases caused by model deficiencies reveal advanced challenges on the text-to-SQL task like deep logical reasoning and extremely complex structure (c.f. Section 5). Our annotated data enables us to address these challenges without the interference of the current noisy schema linker.
In summary, we contribute an annotation of schema linking and in-depth analyses on the role of schema linking in the text-to-SQL task. We identify schema linking as a crux for the further improvement of text-to-SQL parsing. Our analyses provide insights to advance the understanding of text-to-SQL parsing, facilitating future research on the areas of problem identification, dataset construction and model evaluation.

Related Work
Text-to-SQL Parsing: Text-to-SQL parsing has been long studied in past decades (Finegan-Dollak et al., 2018;Yu et al., 2018c). Early text-to-SQL systems rely heavily on complicated rules and handcrafted feature engineering (Zhong et al., 2017;Finegan-Dollak et al., 2018). Fortunately, the research progress has been largely accelerated in recent years thanks to both large-scale text-to-SQL datasets (Zhong et al., 2017;Yu et al., 2018c) and interests in neural modeling (Xu et al., 2017;Dong and Lapata, 2018;Sun et al., 2018;Yu et al., 2018b;Guo et al., 2019;Wang et al., 2020). With years of studies, current research on this task focuses on addressing cross-domain generalizability and generating complex SQL queries. To improve cross-domain generalizability, advanced representations of the schema and the queries are explored, e.g., graph-based schema representations (Bogin et al., 2019b,a), contextualized question representations (Hwang et al., 2019;Guo et al., 2019) and relation-aware self-attention (Wang et al., 2020). As for the complex SQL query generation, approaches are proposed to constrain the output with SQL grammar, e.g., modular decoders for separate SQL clauses (Yu et al., 2018b), intermediate language representation (Guo et al., 2019), recursive decoding for nested queries (Lee, 2019), schemadependent grammar for SQL decoding , etc. Unlike their perspective, this work calls attention to schema linking, which we consider is the crux for the text-to-SQL task and yet to be sufficiently studied.
Schema Linking: The idea of schema linking has been broadly studied in similar tasks like entity linking in the field of knowledge graphs (Fu et al., 2020;Wu et al., 2019;Rijhwani et al., 2019;Logeswaran et al., 2019) and slot filling in dialogue systems (Xu and Hu, 2018;Ren et al., 2018;Nouri and Hosseini-Asl, 2018;Rastogi et al., 2017), where ample annotated data and models have been proposed to address their specific properties. In the general domain of semantic parsing, it has been demonstrated that decoupling underlying structure with lexicon benefits cross-domain semantic parsing (Su and Yan, 2017;Herzig and Berant, 2018). However, when it comes to the text-to-SQL problem, many existing approaches treat schema linking as a minor pre-processing procedure using simple heuristics, such as string matching between natural language utterances and column/table names (Guo et al., 2019;Yu et al., 2018a;. As discussed in Dong et al. (2019), such simple heuristics are difficult to accurately identify columns/tables involved in a natural language utterance and well understand the relation between an utterance and the corresponding database schema. Therefore, they make the first step towards treating schema linking as an individual research problem. Nevertheless, due to the lack of direct schema linking supervision, they achieve limited improvement on the challenging Spider dataset, further illustrating the difficulties of this problem. Unlike these prior approaches, more recent models (Bogin et al., 2019a,b;Wang et al., 2020) integrate schema linking as a learnable component into the network, which brings significant improvements. In this work, we take one step further along this line to perform a thorough study by conducting schema linking annotation, discussing its importance and revealing its unique characteristics.

Schema Linking Annotation
To support a data-driven and systematical study, we annotate the schema linking information for each instance in the training and development set of Spider (Yu et al., 2018c) (the test set is hidden), the largest and most challenging text-to-SQL dataset so far. A simple way is conducting automatic annotation by matching table/column names and values in a ground truth SQL query against its corresponding natural language query. However, such automatic annotation method can bring much noise that would potentially hinder the model performance. Therefore, we annotate the dataset combining both automatic and manual processing.
Automatic Annotation: We first programmatically annotate the schema linking for easy cases to reduce the manual work. The string matching strategy we use is inspired by Guo et al. (2019). We first generate n-grams for each natural language query and only keep those with a length less than 6. For each condition value in a ground truth SQL query, we label the n-gram which it exactly matches. After labeling all n-grams matching with SQL condition values, we enumerate all unlabeled n-grams in the ascending order of length. If an n-gram contains all tokens in a column name, we label and regard this n-gram as a reference to that column. This process deals with cases where a column name is exactly mentioned or slightly paraphrased (e.g., column "type code" mentioned as "code of type"). However, a column/table name can also partially occur (e.g., column "type code" mentioned as "code").
To deal with such cases, we enumerate all unlabeled n-grams in descending order of length. If an n-gram contains any token in a column name, we label it as the reference to the corresponding column. Table reference labeling is conducted similarly.
Manual Annotation: We then recruit three computer science majors to further manually refine the automatic annotation. They are trained with a detailed annotation guideline and 50 trial samples. One is allowed to start after getting all trial samples correctly annotated. During the process, strict quality control is conducted by calculating their inter-annotator agreement (IAA) 2 . Specifically, the dataset is divided into 10 batches where instances in each batch are equally distributed to the annotators with 5% overlap. We accept a batch if the IAA is higher than 0.7 (at least 70% of the instances have exactly the same annotation by all three annotators); otherwise, the annotators are required to re-examine their annotations individually. Once a batch of annotation is accepted, we let the annotators discuss their disagreed annotations and come up with a final agreed result. Analysis: To evaluate how noisy and coarse the automatic annotation is, we calculate its F1 score by treating the manually improved result as the ground truth (see Table 1). For value reference, some missing annotations are caused by abbreviation. For example, the text span "assistant professor" in a natural language query is abbreviated as rank="AsstProf" in the corresponding ground truth SQL query. For column and table reference, an issue is that it cannot deal with columns and tables are not textually referred to in the corresponding utterance. For example, the column elevation is mentioned as "altitude" in a sentence. Other issues of the automatic annotation are largely caused by similar column/table names.

Model
To conduct an in-depth study of the role of schema linking, we develop the SLSQL model, as depicted in Figure 2. It comprises a base model (the left part), which is based on the encoder-decoder structure, and an explicit schema linking component (the right part). While many sophisticated models have been proposed, we adopt this structure to leverage its simplicity to systematically analyze the factor of schema linking. In our experiments, we will study different variants of this structure by configuring the model in various ways. Before the detailed model description, we first introduce our mathematical notations. We denote the natural language query Q as It is the concatenation of a special token [none], the name of each table and the names of its columns. Here, [none], which we treat as a special element in the schema, is designed for schema linking -if a word in Q does not link to any column or table, it links to this token. In addition, we use M LP to mean multilayer perceptron, ⊕ to represent the concatenation operation, and bold symbols to denote dense representations.

Base Model
We now detail the base model which consists of an encoder and a decoder. The encoder (part 1 in Figure 2) processes the input (i.e., Q and E) into hidden representation (denoted as h) and the decoder (part 2 in Figure 2) generates the SQL query (i.e, S) accordingly.
Encoder: Following (Hwang et al., 2019;Guo et al., 2019;, we concatenate the input query Q and database schema E to an integrated sequence as input for BERT  to generate embeddings for each question token and element in the schema (namely ) and the overall representation for the input as h. Here, E consists of embeddings of all the columns/tables and the special token [none]. The embedding of the special token [CLS] in BERT is taken as h. Formally, we have: (1) Note that, in this representation, the schema linking information has also been captured by the multilayer self-attention implicitly. However, we argue the explicit supervisions are required. While a plausible solution is to use the relation-aware encoding proposed by Wang et al. (2020) to do this, we later propose a simpler solution to facilitate our analytical study.
Decoder: Inspired by the prior work (Yin and Neubig, 2017;Lapata, 2016, 2018;, we adopt a two-step decoder to generate the SQL query from the hidden representation h. We first generate a coarse SQL query S , namely a SQL sequence without aggregate functions, using a GRU network . We then synthesize the final SQL query S based on S . The 2 part in Figure 2 illustrates the generation of aggregate functions for the column budget during the decoding process.

Schema Linking Extension
To study the role of schema linking, we extend the encoder to explicitly capture the schema linking information. It works in two steps: in step 1.1, we learn the explicit schema linking based on our annotation; in step 1.2, we learn h, the overall representation for the input, by integrating our explicit schema linking results.

Schema Linking Learning:
We denote the ground truth schema linking distribution as Θ and the estimated one asΘ. We denote the linking probability of the token q i and the schema element e j asP i,j , which is calculated aŝ where 1 ≤ i ≤ |Q| and 1 ≤ j ≤ |E|. Therefore, the estimated schema linking distribution is formulated asΘ = {Pi,j} 1≤i≤|Q|,1≤j≤|E| . The loss function of the schema linking step is defined as Pi,j logPi,j where P i,j ∈ Θ is the ground truth value for each pair of q i and e j .
Schema-aware Representation: In step 1.2, we learn the schema-aware representation (i.e., h) based on the predicted schema linking results (i.e., Θ), Q, and E. To cover the temporal relation, we use a bi-directional GRU to generate h. Thus, the schema-aware representation is learned from where f is the reference mechanism. It is calculated by concatenating the embedding of q i and the embedding of the schema element (including [none]) it likely links with. We here imple-mented it in a soft manner through weighted average embedding. Specifically,

Experiments
In this section, we conduct systematic studies on the role of schema linking in text-to-SQL parsing. We examine several variants of SLSQL and thoroughly analyze the experimental results, providing detailed analyses and discussions to shed light on its unique characteristics.

Experiment Settings
Dataset: We conduct the experiments on the Spider (Yu et al., 2018c) dataset, a large-scale benchmark for cross-domain complex text-to-SQL task. Spider consists of 11,840 examples which are split into training (size: 7,000), development (size: 2,134) and test set (size: 1,034), covering 138 different domains. In addition, SQL queries in the dataset are categorized into four difficulty levels based on the number of SQL keywords. Models are evaluated using the official exact matching accuracy metric of Spider. We conduct ablation studies on the development set since the test set is used for scoring models on the leaderboard and is not publicly accessible.
Model Variants: To study the contribution of schema linking to the text-to-SQL parsing problem, we examine the following variants of SLSQL, each with one way of utilizing the schema linking information. To facilitate discussions, the model described in Section 4 is referred to as default in this section.
• base model: To evaluate the impact of explicit schema linking, we use the base model, as a variant, whose encoder is defined as EQ. (1) and followed by the decoder directly. Note there is no schema linking component in this variant.
• auto: To validate the advance of the manual annotation, we build this variant, which is trained with the automatic schema linking annotation in EQ.
(3) instead of the manual annotation.
• hard reference: As introduced in Section 4.2, we generate schema-aware representation using reference mechanism f . Here, we further build this variant using hard reference concatenation. Specifically, instead of computing weighted average embeddings of all database schema elements as described in EQ. (5), we concatenate a question token embedding q i with e j which is the embedding of the schema element with the highest linking probability, i.e.P i,j , to validate whether a different way of integrating schema linking information impacts the performance.
• oracle: To explore the maximum potential benefit of schema linking to the text-to-SQL task, we design the oracle variant. In this variant, we remove the schema liking learning component (step 1.1 in Figure 2), connect the encoder part with step 1.2 directly, and replace the estimated distributionΘ with the ground truth distribution Θ in EQ. (4), namely, h = Bi-GRU f (Θ, Q, E) . Like the auto variant, we build the oracle auto variant using the automatic annotation instead.   . As expected, feeding SLSQL with Θ rather thanΘ during inference leads to a significantly higher result (oracle and oracle auto). Similarly, using the manual annotation instead of the automatic annotation for the oracle setting largely improves the model performance.

Overall Performance Analysis
We also list top models on the Spider leaderboard for reference in Table 2. As we do not aim at putting effort in building sophisticated models, our default and hard ref variants cannot compete with the state-of-the-art models like RYANSQL (Choi et al., 2020) and RATSQL (Wang et al., 2020). However, the oracle variant shows that a simple model has the potential to sharply outperform these strong models (on the development set), by improving schema linking performance, which shows an important future direction. We will have further discussions on this issue in Section 5.5.

Can Schema Linking Help Manage Complex Queries?
To have further insights on how schema linking can help complex queries, we investigate detailed model performance under different difficulty levels. Table 3 presents the result. We can observe that, generally schema linking helps boost the performance of SLSQL across all different difficulty levels (base v.s. other variants). More accurate schema linking predictions lead to more significant accuracy improvements. For example, the oracle variant, which has access to the manual annotation, achieves the highest score on all different difficulty levels. Besides, default and hard ref outperform the auto variant trained with automatic annotation on all difficulty levels, thanks to the higher quality of schema linking annotation.

Schema Linking Performance Analyses
To have a better understanding of the schema linking task itself, we test its performance for SLSQLdefault and list some representative wrong predictions, as shown in Table 4 and Table 5. We observe that, with the model trained with explicit supervision, the F1 scores for column, table and value linking are still far from satisfactory, demonstrating that schema linking is not an easy task and requires future efforts to improve. Particularly, linking value references with the schema is the most difficult part as its F1 score is the lowest. We find most of wrong value predictions are due to the lack of world knowledge. As shown in Ex.1, the model mistakenly predicts Aruba as a language instead of a  country. In Ex.2, the model fails to understand that "republic" is a government form. We find that, despite using BERT as underlying language understanding module, the model still has difficulty in dealing with some of such value reference linking. To help the model accurately link value references with the schema, a solution can be scanning the content stored in the database, as applied in some prior work (Bogin et al., 2019b;Wang et al., 2020), to facilitate the model inference. The motivation behind it is that "Aruba" occurs in the column country.name instead of countrylangauge.language or city.name. However, in real scenarios, database contents are not always accessible to text-to-SQL models and condition values mentioned in human utterances do not necessarily exist in the database (Zhong et al., 2017). An alternative solution can be looking for external knowledge resources which easily identify that the word "republic" is related to "government form", as adopted in Guo et al. (2019). Nevertheless, such solution relies heavily on the quality and availability of knowledge resources, making the model less portable for practical use. Most of the remaining errors have commonalities with wrong table/column reference linking, in terms of causes.
When it comes to column and table reference linking, we find the error sources are complex, which mainly include failing to capture semantic relations between words and tables/columns and predicting other linking types. We observe that sometimes the model is biased towards predicting columns/tables that exactly occur in an utterance while neglecting more global information. As shown in Ex.3, the model links the word "rank" to the column ranking.rank while the correct choice should be matches.winner rank if considering more global information. In addition to neglecting global semantic information, failing to capture semantic similarity between words is another cause of errors such as the case shown in Ex.4. We find that some of such issues are caused by the WordPiece tokenization (Wu et al., 2016) in BERT. For example, it tokenizes "highschooler" as "highs ##cho ##ole ##r" while tokenizing "high schooler" as "high school ##er", which is the cause of this case. Besides, columns can have common words with their table names, making the model mistakenly predict some part of a column reference as a table reference (or vice versa) in some cases like Ex.5. However, with such linking errors, the model usually is still able to finally generate desired SQL queries, which means they typically are not as harmful as other errors described above.
Fortunately, many of the aforementioned problems have been extensively studied in similar tasks such as zero-shot entity linking in knowledge graph tasks (Wu et al., 2019;Logeswaran et al., 2019;Rijhwani et al., 2019;Fu et al., 2020) and domain adaptive slot filling in dialogue system domains (Xu and Hu, 2018;Rastogi et al., 2017;Ren et al., 2018;Nouri and Hosseini-Asl, 2018). With annotated data, there is an ample room to transfer these approaches, which are mostly based on supervised learning, to the schema linking problem for further improving the text-to-SQL parsing ability.

Error Analysis of the Oracle Variant
To further investigate what the remaining problems lie on, provided that schema linking can be done perfectly, we conduct error analyses using the oracle variant. We randomly sample 100 error instances on the development set. We analyze the errors and classify them into three categories: Correct Equivalent, Corpus Error and Model Incapability. Considering many examples in the dataset have textually similar questions, erroneous predictions having the same cause are counted once during analysis. This process is repeated for five times and we take the average percentage for each error type. We find many errors are due to the the corpus noises, namely the first two error types. Table 6 provides representative examples for each category. We now detail the three error categories.
Correct Equivalent: One SQL query can have several semantic equivalents with different writing patterns. We find that in the Spider dataset, it is not always consistent that which of such patterns is given as the ground truth. We identify some SQL queries generated by SLSQL are actually semantically correct while treated as wrong predictions due to not only co-existence of different SQL writing patterns in the training set but also the exact matching evaluation. According to our manual verification, such false negative samples take up around 30% of the sampled errors. Ex.1 shows a case where two SQL queries are semantically equivalent, despite different writing patterns. There are also some inconsistent patterns in particular clauses like group by, as illustrated in Ex.2. Such errors suggest that either pattern consistency or more flexible, robust evaluation metric should be focused on for future dataset construction.
Corpus Error: After carefully examining each sampled errors, we also identify around 26% of them are caused by incorrectly annotated example in the dataset, e.g., wrong ground truth SQL queries, incomprehensible utterances, problematic database schemas, etc. As shown in Ex.3, "greater area than that of any country" is indeed logically equivalent to "greater than the maximum area of all countries", while the ground truth SQL query means "greater area than that of some countries". Some of these errors are even hard to identify as incorrect ground truth at first glance. Ex.4 looks like a correct equivalent case while actually the ground truth SQL query is wrong. Moreover, we find typos in natural language queries can lead to incorrect SQL queries. For example, one instance has a text span "the sname of every sing" in the natural language query which we believe should be "the name of every song". While better data annotations definitely result in better SQL parsing performance, such errors suggest a robust text-to-SQL parser should be tolerant of noises like typos and grammatical errors, which is an important but overlooked problem for real application.
Model Incapability: Even with oracle schema linking annotation, SLSQL is yet to be perfect. We find about 44% of the failing instances are due to modeling incapability. Many problems lie in the requirement of deep logical reasoning and extremely complex structure. Considering the case shown in Ex.5, the model directly translates the word "and" Q: Which countries have greater area than that of any country in Europe? G: select name from country where surface area > (select min(surface area) from country where continent = 'Europe') P: select name from country where surface area > (select max(surface area) from country where continent = 'Europe') Ex.4 Q Find the number of concerts happened in the stadium with the highest capacity. G: select count( * ) from concert as t1 join stadium as t2 order by t2.capacity desc limit 1 P: select count( * ) from stadium join concert where stadium.capacity = (select max(capacity) from stadium) Model Incapability (44.1%) Ex.5 Q: What is the total surface area of the continents Asia and Europe? G: select sum(surface area) from country where continent = 'Asia' or continent = 'Europe' P: select sum(surface area) from country where continent = 'Asia' and continent = 'Europe' Ex.6 Q: How many countries speak both English and Dutch? G: select count( * ) from (select t1.name from country as t1 join countrylanguage as t2 where t2.language = 'English' intersect select t1.name from country as t1 join countrylanguage as t2 where t2.language = 'Dutch') P: select count( * ) from countrylanguage where countrylanguage.language = 'English' intersect select count( * ) from countrylanguage where countrylanguage.language = 'Dutch' Table 6: Representative examples of the three error types and their average percentages during sampling. Notations Q, G and P stand for question, ground truth and prediction, respectively. Some on clauses are omitted for display.
into the SQL keyword and, leading to a SQL query with contradictory conditions. Although this query is classified as "medium" by the Spider evaluation script, it is actually difficult as it requires a model to perform logic reasoning based on the understanding that a country cannot be in Asia and Europe at the same time. A similar case is "singers with birth year before 1945 and after 1955" where a numeric comparison is required to avoid generating contradictory where conditions. Ex.6 is a case of extremely complex structure where three logical steps are required to synthesize the SQL query, i.e., 1) selecting English-speaking countries and Dutchspeaking countries; 2) finding their intersection using intersect; and 3) counting the intersection size with an outer query. Unfortunately, the model writes a plausible but wrong SQL query.

Discussion:
The above results and analyses suggest that, with schema linking well solved, even a simple BERT baseline can capture quite a large portion of the patterns in the Spider dataset. This indicates that schema linking is the crux for current research on text-to-SQL task, providing an appealing perspective to this task. Also, through experiments and analyses with schema linking annotation, some previously unnoticed challenges like deep logical reasoning and extremely complex structure have emerged, also pointing further research directions. Such problems were interwoven with schema linking problems in the original Spider dataset. Our schema linking annotation makes it possible for such problems to be separately approached without the interference of database schemas.

Conclusion
We critically examine the role of schema linking for the text-to-SQL task. To support modelindependent and thorough studies, we invest human resources to annotate schema references and contribute a high-quality, large-scale schema linking corpus. Experimenting with our designed Schema Linking SQL (SLSQL) model, we demonstrate that more accurate schema linking conclusively leads to better text-to-SQL parsing performance. Importantly, given oracular schema references, a simple BERT model like SLSQL can achieve an impressive performance. Our experiments show that schema linking, often overlooked as simple preprocessing, is actually a requisite for good SQL parsing performance, providing an intriguing perspective for future improvements on this task. Our study sheds light on the characteristics of text-to-SQL parsing for future efforts including advanced modeling, problem identification, dataset construction and model evaluation.