Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task

We present Spider, a large-scale complex and cross-domain semantic parsing and text-to-SQL dataset annotated by 11 college students. It consists of 10,181 questions and 5,693 unique complex SQL queries on 200 databases with multiple tables covering 138 different domains. We define a new complex and cross-domain semantic parsing and text-to-SQL task so that different complicated SQL queries and databases appear in train and test sets. In this way, the task requires the model to generalize well to both new SQL queries and new database schemas. Therefore, Spider is distinct from most of the previous semantic parsing tasks because they all use a single database and have the exact same program in the train set and the test set. We experiment with various state-of-the-art models and the best model achieves only 9.7% exact matching accuracy on a database split setting. This shows that Spider presents a strong challenge for future research. Our dataset and task with the most recent updates are publicly available at https://yale-lily.github.io/seq2sql/spider.


Introduction
Semantic parsing (SP) is one of the most important tasks in natural language processing (NLP). It requires both understanding the meaning of natural language sentences and mapping them to meaningful executable queries such as logical forms, SQL queries, and Python code.
Recently, some state-of-the-art methods with Seq2Seq architectures are able to achieve over 80% exact matching accuracy even on some complex benchmarks such as ATIS and GeoQuery. These models seem to have already solved most problems in this field.
However, previous tasks in this field have a simple but problematic task definition because most of these results are predicted by semantic "matching" rather than semantic parsing. Existing datasets for SP have two shortcomings. First, those that have complex programs (Zelle and Mooney, 1996;Li and Jagadish, 2014;Yaghmazadeh et al., 2017a;Iyer et al., 2017) are too small in terms of number of programs for training modern data-intensive models and have only a single dataset, meaning that the same database is used for both training and testing the model. More importantly, the number of logic forms or SQL labels is small and each program has about 4-10 paraphrases of natural language problem to expand the size of the dataset. Therefore, the exact same target programs appear in both the train and test sets. The models can achieve decent performances even on very complex programs by memorizing the patterns of question and program pairs during training and decoding the programs exactly the same way as it saw in the training set during testing. Finegan-Dollak et al. (2018) split the dataset by programs so that no two identical program would be in both the train and test sets. They show that the models built on this question-splitting data setting fail to generalize to unseen programs. Second, existing datasets that are large in terms of the number of programs and databases such as WikiSQL (Zhong et al., 2017) contain only simple SQL queries and single tables. In order to test a model's real semantic parsing performance on unseen complex programs and its ability to generalize to new domains, an SP dataset that includes a large amount of complex programs and databases with multiple tables is a must.
However, compared to other large, realistic datasets such as ImageNet for object recognition (Deng et al., 2009) and SQuAD for reading comprehension (Rajpurkar et al., 2016), creating such SP dataset is even more time-consuming and challenging in some aspects due to the following reasons. First, it is hard to find many databases with multiple tables online. Second, given a database, annotators have to understand the complex database schema to create a set of questions such that their corresponding SQL queries cover all SQL patterns. Moreover, it is even more challenging to write different complex SQL queries. Additionally, reviewing and quality-checking of question and SQL pairs takes a significant amount of time. All of these processes require very specific knowledge in databases.
To address the need for a large and high-quality dataset for a new complex and cross-domain semantic parsing task, we introduce Spider, which consists of 200 databases with multiple tables, 10,181 questions, and 5,693 corresponding complex SQL queries, all written by 11 college students spending a total of 1,000 man-hours. As Figure 1 illustrates, given a database with multiple tables including foreign keys, our corpus creates and annotates complex questions and SQL queries including different SQL clauses such as joining and nested query. In order to generate the SQL query given the input question, models need to understand both the natural language question and relationships between tables and columns in the database schema.
In addition, we also propose a new task for text-to-SQL problem. Since Spider contains 200 databases with foreign keys, we can split the dataset with complex SQL queries in a way that no database overlaps in train and test, which over-comes the two shortcomings of prior datasets, and defines a new semantic parsing task in which the model needs to generalize not only to new programs but also to new databases. Models have to take questions and database schemas as inputs and predict unseen queries on new databases.
To assess the task difficulty, we experiment with several state-of-the-art semantic parsing models. All of them struggle on this task. The best model achieves only 9.7% exact matching accuracy in the database split setting. This suggests that there is a large room for improvement.
Most of the previous work train their models without schemas as inputs because they use a sin-gle database for both training and testing. Thus, they do not need to generalize to new domains. Most importantly, these datasets have a limited number of labeled logic forms or SQL queries. In order to expand the size of these datasets and apply neural network approaches, each logic form or SQL query has about 4-10 paraphrases for the natural language input. Most previous studies follow the standard question-based train and test split (Zettlemoyer and Collins, 2005). This way, the exact same target queries (with similar paraphrases) in the test appear in training set as well. Utilizing this assumption, existing models can achieve decent performances even on complex programs by memorizing database-specific SQL templates. However, this accuracy is artificially inflated because the model merely needs to decide which template to use during testing. Finegan-Dollak et al. (2018) show that template-based approaches can get even higher results. To avoid getting this inflated result, Finegan-Dollak et al. (2018) propose a new, program-based splitting evaluation, where the exact same queries do not appear in both training and testing. They show that under this framework, the performance of all the current state-of-the-art semantic parsing systems drops dramatically even on the same database, indicating that these models fail to generalize to unseen queries. This indicates that current studies in semantic parsing have limitations.
We also want the model to generalize not only to unseen queries but also to unseen databases. Zhong et al. (2017) published the WikiSQL dataset. In their problem definition, the databases in the test set do not appear in the train or development sets. Also, the task needs to take different table schemas as inputs. Therefore, the model has to generalize to new databases. However, in order to generate about 90,000 questions and SQL pairs for about 26,000 databases, Zhong et al. (2017) made simplified assumptions about the SQL queries and databases. Their SQL labels only cover single SELECT column and aggregation, and WHERE conditions. Moreover, all the databases only contain single tables. No JOIN, GROUP BY, and ORDER BY, etc. are included.

Corpus Construction
All questions and SQL queries were written and reviewed by 11 computer science students who were native English speakers.As illustrated in Figure 2, we develop our dataset in five steps, spending around 1,000 hours of human labor in total: §3.1 Database Collection and Creation, §3.2 Question and SQL Annotation, §3.3 SQL Review, §3.4 Question Review and Paraphrase, §3.5 Final Question and SQL Review.

Database Collection and Creation
Collecting databases with complex schemas is hard. Although relational databases are widely used in industry and academia, most of them are not publicly available. Only a few databases with multiple tables are easily accessible online. Our 200 databases covering 138 different domains are collected from three resources. First, we collected about 70 complex databases from different college database courses, SQL tutorial websites, online csv files, and textbook examples. Second, we collected about 40 databases from the DatabaseAnswers 1 where contains over 1,000 data models across different domains. These data models contain only database schemas. We converted them into SQLite, populated them using an online database population tool 2 , and then manually corrected some important fields so that the table contents looked natural. Finally, we created the remaining 90 databases based on WikiSQL. To ensure the domain diversity, we select about 500 tables in about 90 different domains to create these 90 databases. To create each database, we chose several related tables from WikiSQL dev or test splits, and then created a relational database schema with foreign keys based on the tables we selected. We had to create some intersection tables in order to link several tables together. For most other cases, we did not need to populate these databases since tables in WikiSQL are from Wikipedia, which already had real world data stored.
We manually corrected some database schemas if they had some column names that did not make sense or missed some foreign keys. For table and column names, it is common to use abbreviations in databases. For example, 'student id' might be represented by 'stu id'. For our task definition, we manually changed each column name back to regular words so that the system only handled semantic parsing issues.

Question and SQL Annotation
For each database, we ask eight computer science students proficient in SQL to create 20-50 natural questions and their SQL labels. To make our questions diverse, natural, and reflective of how humans actually use databases, we did not use any template or script to generate question and SQL queries. Our annotation procedure ensures the following three aspects.
A) SQL pattern coverage. We ensure that our corpus contains enough examples for all common SQL patterns. For each database, we ask annotators to write SQL queries that cover all the following SQL components: SELECT with multiple columns and aggregations, WHERE, GROUP BY, HAVING, ORDER BY, LIMIT, JOIN, INTERSECT, EXCEPT, UNION, NOT IN, OR, AND, EXISTS, LIKE as well as nested queries. The annotators made sure that each table in the database appears in at least one query. B) SQL consistency. Some questions have multiple acceptable SQL queries with the same result. However, giving totally different SQL labels to similar questions can hinder the training of semantic parsing models. To avoid this issue, we designed the annotation protocol so that all annotators choose the same SQL query pattern if multiple equivalent queries are possible. More detail is explained in our appendix.
C) Question clarity. We did not create questions that are (1) vague or too ambiguous, or (2) require knowledge outside the database to answer.
First, ambiguous questions refer to the questions that do not have enough clues to infer which columns to return and which conditions to consider. For example, we would not ask "What is the most popular class at University X?" because the definition of "popular" is not clear: it could mean the rating of the class or the number of students taking the course. Instead, we choose to ask "What is the name of the class which the largest number of students are taking at University X?". Here, "popular" refers to the size of student enrollment. Thus, the "student enrollment" column can be used in condition to answer this question. We recognize that ambiguous questions appear in real-world natural language database interfaces.
We agree that future work needs to address this issue by having multi-turn interactions between the system and users for clarification. However, our main aim here is to develop a corpus to tackle the problem of handling complex queries and generalizing across databases, which no existing semantic parsing datasets could do. Moreover, the low performances of current state-of-theart models already show that our task is challenging enough, without ambiguous questions. In addition, questions are required to contain the specific information to return. Otherwise, we don't know if class id is also acceptable in the previous case. Most of questions in the existing semantic parsing datasets are ambiguous. This is not a big problem if we use one single dataset because we have enough data domain specific examples to know which columns are default. However, it would be a serious problem in cross domain tasks since the default return values differ cross domain and people.
Second, humans sometimes ask questions that require common sense knowledge outside the given database. For instance, when people ask "Display the employee id for the employees who report to John", the correct SQL is SELECT employee id FROM employees WHERE manager id = ( SELECT employee id FROM employees WHERE first name = 'John') which requires the common knowledge that "X reports to Y" corresponds to an "employee-manager" relation. we do not include such questions and leave them as a future research direction.
Annotation tools We open each database on a web-based interface powered by the sqlite web 3 tool. It allows the annotators to see the schema and content of each table, execute SQL queries, and check the returned results. This tool was extremely helpful for the annotators to write executable SQL queries that reflect the true meaning of the given questions and return correct answers.

SQL Review
Once the database is labeled with question-query pairs, we ask a different annotator to check if the questions are clear and contain enough information to answer the query. For a question with multiple possible SQL translations, the reviewers double check whether the SQL label is correctly chosen under our protocol. Finally, the reviewers check if all the SQL labels in the current database cover all the common SQL clauses.

Question Review and Paraphrase
After SQL labels are reviewed, native English speakers review and correct each question. They first check if the question is grammatically correct and natural. Next, they make sure that the question reflects the meaning of its corresponding SQL label. Finally, to improve the diversity in questions, we ask annotators to add a paraphrased version to some questions.

Final Review
Finally, we ask the most experienced annotator to conduct the final question and SQL review. This annotator makes the final decision if multiple reviewers are not sure about some annotation issues. Also, we run a script to execute and parse all SQL labels to make sure they are correct.

Dataset Statistics and Comparison
We summarize the statistics of Spider and other text-to-SQL datasets in Table 1. Compared with other datasets, Spider contains databases with multiple tables and contains SQL queries including many complex SQL components. For example, Spider contains about twice more nested queries and 10 times more ORDER BY (LIMIT) and GROUP BY (HAVING) components than the total of previous text-to-SQL datasets. Spider has 200 distinct databases covering 138 different domains such as college, club, TV show, government, etc. Most domains have one database, thus containing 20-50 questions, and a few domains such as flight information have multiple databases with more than 100 questions in total. On average, each database in Spider has 28 columns and 9 foreign keys. The average question length and SQL length are about 13 and 21 respectively. Our task uses different databases for training and testing, evaluating the cross-domain performance. Therefore, Spider is the only one text-to-SQL dataset that contains both databases with multiple tables in different domains and complex SQL queries It tests the ability of a system to generalize to not only new SQL queries and database schemas but also new domains.

Task Definition
On top of the proposed dataset, we define a textto-SQL task that is more realistic than prior work. Unlike most of the previous semantic parsing or text-to-SQL tasks, models will be tested on both different complex SQL queries and different complex databases in different domains in our task. It aims to ensure that models can only make the correct prediction when they truly understand the semantic meaning of the questions, rather than just memorization. Also, because our databases contain different domains, our corpus tests model's ability to generalize to new databases. In this way, model performance on this task can reflect the real semantic parsing ability.
In order to make the task feasible and to focus on the more fundamental part of semantic parsing, we make the following assumptions: • In our current task, we do not evaluate model performance on generating values. Predicting correct SQL structures and columns is more realistic and critical at this stage based on the low performances of various current state-ofthe-art models on our task. In a real world situation, people need to double check what condition values are and finalize them after multiple times. It is unrealistic to predict condition values without interacting with users. In reality, most people know what values to ask but do not know the SQL logic. A more reasonable way is to ask users to use an interface searching the Dataset # Q # SQL # DB # Domain # Table / DB ORDER BY GROUP BY NESTED HAVING  ATIS  5,280  947  1  1  32  0  5  315  0  GeoQuery  877  247  1  1  6  20  46  167  9  Scholar  817  193  1  1  7  75  100  7  20  Academic  196  185  1  1  15  23  40  7  18  IMDB  131  89  1  1  16  10  6  1  0  Yelp  128  110  1  1  7  18  21  0  4  Advising  3,898  208  1  1  10  15  9  22  0  Restaurants  378  378  1  1  3  0  0  4  0  WikiSQL  values, then ask more specific questions. Also, other previous work with value prediction uses one single database in both train and test which makes it possible to overfit. However, in our task, we have different databases of different domains in train and test.
• As mentioned in the previous sections, we exclude some queries that require outside knowledge such as common sense inference and math calculation. For example, imagine a table with birth and death year columns. To answer the questions like "How long is X's life length?", we use SELECT death year -birth year. Even though this example is easy for humans, it requires some common knowledge of the life length definition and the use of a math operation, which is not the focus of our dataset.
• We assume all table and column names in the database are clear and self-contained. For example, some databases use database specific short-cut names for table and column names such as "stu id", which we manually converted to "student id" in our corpus.

Evaluation Metrics
Our evaluation metrics include Component Matching, Exact Matching, and Execution Accuracy. In addition, we measure the system's accuracy as a function of the difficulty of a query. Since our task definition does not predict value string, our evaluation metrics do not take value strings into account. We will release the official evaluation script along with our corpus so that the research community can share the same evaluation platform.

Component Matching
To conduct a detailed analysis of model performance, we measure the average exact match between the prediction and ground truth on different SQL components. For each of the following components: SQL keywords without column names and operators) we decompose each component in the prediction and the ground truth as bags of several subcomponents, and check whether or not these two sets of components match exactly. To evaluate each SELECT component, for example, consider SELECT avg(col1), max(col2), min(col1), we first parse and decompose into a set (avg, min, col1), (max, col2), and see if the gold and predicted sets are the same. Previous work directly compared decoded SQL with gold SQL. However, some SQL components do not have order constraints. In our evaluation, we treat each component as a set so that for example, SELECT avg(col1), min(col1), max(col2) and SELECT avg(col1), max(col2), min(col1) would be treated as the same query. To report a model's overall performance on each component, we compute F1 score on exact set matching.
Exact Matching We measure whether the predicted query as a whole is equivalent to the gold query. We first evaluate on the SQL clauses as described in the last section. The predicted query is correct only if all of the components are correct. Because we conduct set comparison in each clause, this exact matching metric can handle the "ordering issue" (Xu et al., 2017).
Execution Accuracy 4 Since Exact Matching can create false negative evaluation when the semantic parser generates novel and correct syntax structures, we also consider Execution Accuracy. All our databases have executable SQLite files, so we can measure execution accuracy as well. However, it is also important to note that Execution Accuracy can create false positive evaluation as a predicted SQL could return the same result (for example, 'NULL') as the gold SQL when they are semantically different. So we can use both to complement each other. Finally, our evaluation also considers multiple acceptable keys if JOIN and GROUP are in the query. For example, suppose "stu id" in one table refers to "stu id" in another table, GROUP BY either is acceptable.

SQL Hardness Criteria
To better understand the model performance on different queries, we divide SQL queries into 4 levels: easy, medium, hard, extra hard.
We define the difficulty based on the number of SQL components, selections, and conditions, so that queries that contain more SQL keywords (GROUP BY, ORDER BY, INTERSECT, nested subqueries, column selections and aggregators, etc) are considered to be harder. For example, a query is considered as hard if it includes more than two SELECT columns, more than two WHERE conditions, and GROUP BY two columns, or contains EXCEPT or nested queries. A SQL with more additions on top of that is considered as extra hard. Figure 3 shows examples of SQL queries in 4 hardness levels.

Methods
In order to analyze the difficulty and demonstrate the purpose of our corpus, we experiment with several state-of-the-art semantic parsing models. As our dataset is fundamentally different from the prior datasets such as Geoquery and WikiSQL, we adapted these models to our task as follows. We created a 'big' column list by concatenating columns in all tables of the database together as a input to all models. Also, for each model, we limit the column selection space for each question example to all column of the database which the question is asking instead of all column names in 4 We will provide the results in the later version. Please check our website for the latest updates on the task at https://yale-lily.github.io/seq2sql/ spider What is the number of cars with more than 4 cylinders?  the whole corpus.
Seq2Seq Inspired by neural machine translation (Sutskever et al., 2014), we first apply a basic sequence-to-sequence model, Seq2Seq. Then, we also explore Seq2Seq+Attention from (Dong and Lapata, 2016) by adding an attention mechanism (Bahdanau et al., 2015). In addition, we include Seq2Seq+Copying by adding an attention-based copying operation similar to (Jia and Liang, 2016). The original model does not take the schema into account because it has the same schema in both train and test. We modify the model so that it
SQLNet introduced by (Xu et al., 2017) uses column attention and employs a sketch-based method and generates SQL as a slot-filling task. This fundamentally avoids the sequence-tosequence structure when ordering does not matter in SQL query conditions. Because it is originally designed for WikiSQL, we also extend its SELECT and WHERE modules to other compo-nents.
TypeSQL is the state-of-the-art model on the WikiSQL task (Yu et al., 2018). It improves upon SQLNet by proposing a different training procedure and utilizing types extracted from either knowledge graph or table content to help model better understand entities and numbers in the question. In our experiment, we use the question type info extracted from database content. Also, we extend their modules to other components.

Experimental Results and Discussion
We summarize the performance of all models on our test set including accuracy of exact matching in Table 2 and F1 scores of component matching in Table 3. For the final training dataset, we also select and include 752 queries and 1659 questions that follow our annotation protocol from six existing datasets: Restaurants, GeoQuery, Scholar, Academic, IMDB, and Yelp. We report results on two different settings for all models: (1)  In general, the overall performances of all models are low, indicating that our task is challenging and there is still a large room for improvement.
Example Split vs Database Split As discussed in Section 5, another challenge of the dataset is to generalize to new databases. To study this, in Table 2 and Table 3 we compare model performances under the two settings. For all models, the performance under database split is much lower than that under example split. In addition, we observe that all models perform poorly on column selection. This shows that our dataset presents a challenge for the model to generalize to new databases.

Complexity of Database Schema
In order to show how the complexity of the database schema affects model performance, Figure 4 plots the exact matching accuracy as a function of the number of foreign keys in a database. The performance decreases as the database has more foreign keys. The first reason is because the model has to choose column and table names from many candidates in a complex database schema. Second, a complex database schema presents a great challenge for the model to capture the relationship between different tables with foreign keys. It indicates that this task requires more effective methods to encode the

Conclusion
In this paper we introduce Spider, a large, complex and cross-domain semantic parsing and textto-SQL dataset, which directly benefits both NLP and DB communities. Based on Spider, we define a new challenging and realistic semantic parsing task. Experimental results on several state-of-theart models on this task suggests plenty space of improvement.