ChiTeSQL: A Large-Scale and Pragmatic Chinese Text-to-SQL Dataset

,


Introduction
In the past few decades, a large amount of research has focused on searching answers from unstructured texts given natural questions, which is also known as the question answering (QA) task (Burke et al., 1997;Kwok et al., 2001;Allam and Haggag, 2012;Nguyen et al., 2016). However, a lot of high-quality knowledge or data are actually stored in databases in the real world. It is thus extremely useful to allow ordinary users to directly interact with databases via natural questions. To meet this need, researchers have proposed the text-to-SQL task with released English datasets for model training and evaluation, such as ATIS (Iyer et al., 2017), GeoQuery (Popescu et al., 2003), WikiSQL (Zhong et al., 2017), and Spider (Yu et al., 2018b).
Formally, given a natural language (NL) question and a relational database, the text-to-SQL task aims to produce a legal and executable SQL query that leads directly to the correct answer, as depicted in Figure 1. A database is composed of multiple tables and denoted as DB = {T 1 , T 2 , ..., T n }. A table is composed of multiple columns and denoted as T i = {col 1 , col 2 , ..., col m }. Tables are usually linked with each other by foreign keys.
The earliest datasets include ATIS (Iyer et al., 2017) , GeoQuery (Popescu et al., 2003), Restaurants (Tang and Mooney, 2001), Academic (Li and Jagadish, 2014), etc. Each dataset only has a single database containing a certain number of tables. All question/SQL pairs of train/dev/test sets are generated against the same database. Many interesting approaches are proposed to handle those datasets (Iyer et al., 2017;Yaghmazadeh et al., 2017;Finegan-Dollak et al., 2018).
However, real-world applications usually in-volve more than one database, and require the model to be able to generalize to and handle unseen databases during evaluation. To accommodate this need, the WikiSQL dataset is then released by Zhong et al. (2017). It consists of 80,654 question/SQL pairs for 24,241 single-table databases. They propose a new data split setting to ensure that databases in train/dev/test do not overlap. However, they focus on very simple SQL queries containing one SELECT statement with one WHERE clause. In addition, Sun et al. (2020) released TableQA, a Chinese dataset similar to the WikiSQL dataset. Yu et al. (2018b) released a more challenging Spider dataset, consisting of 10,181 question/SQL pairs against 200 multi-table databases. Compared with WikiSQL and TableQA, Spider is much more complex due to two reasons: 1) the need of selecting relevant tables; 2) many nested queries and advanced SQL clauses like GROUP BY and ORDER BY.
As far as we know, most existing datasets are constructed for English. Another issue is that they do not refer to the question distribution in realworld applications during data construction. Taking Spider as an example. Given a database, annotators are asked to write many SQL queries from scratch. The only requirement is that SQL queries have to cover a list of SQL clauses and nested queries. Meanwhile, the annotators write NL questions corresponding to SQL queries. In particular, all these datasets contain very few questions involving calculations between rows or columns, which we find are very common in real applications. This paper presents DuSQL, a large-scale and pragmatic Chinese text-to-SQL dataset, containing 200 databases, 813 tables, and 23,797 question/SQL pairs. Specifically, our contributions are summarized as follows.
• In order to determine a more realistic distribution of SQL queries, we collect user questions from three representative database-oriented applications and perform manual analysis. In particular, we find that a considerable proportion of questions require row/column calculations, which are not included in existing datasets.
• We adopt an effective data construction framework via human-computer collaboration. The basic idea is automatically generating SQL queries based on the SQL grammar and constrained by the given database. For each SQL query, we first generate a pseudo question by traversing it in the execution order and then ask annotators to paraphrase it into a NL question.
• We conduct experiments on DuSQL using three open-source parsing models. In particular, we extend the state-of-the-art IRNet  model to accommodate the characteristics of DuSQL. Results and analysis show that DuSQL is a very challenging dataset. We will release our data at https://github.com/luge-ai/luge-ai/ tree/master/semantic-parsing.

SQL Query Distribution
As far as we know, existing text-to-SQL datasets mainly consider the complexity of SQL syntax when creating SQL queries. For example, Wik-iSQL has only simple SQL queries containing SE-LECT and WHERE clauses. Spider covers 15 SQL clauses including SELECT, WHERE, ORDER BY, GROUP BY, etc, and allows nested queries. However, to build a pragmatic text-to-SQL system that allows ordinary users to directly interact with databases via NL questions, it is very important to know the SQL query distribution in realworld applications, from the aspect of user need rather than SQL syntax. Our analysis shows that Spider mainly covers three types of SQL queries, i.e., matching, sorting, and clustering, whereas WikiSQL only has matching queries. Neither of them contains the calculation type, which we find composes a large portion of questions in certain real-world applications.
To find out the SQL query distribution in reallife applications, we consider the following three representative types of database-oriented applications, and conduct manual analysis against user questions. We ask annotators to divide user questions into five categories (see Appendix B for details), i.e., matching, sorting, clustering, calculation, and others.
Information retrieval applications. We use Baidu, the Chinese search engine, as a typical information retrieval application. Nowadays, search engines are still the most important way for web users to acquire answers. Thanks to the progress in knowledge graph research, search engines can return structured tables or even direct answers from infobox websites such as Wikipedia and Baidu Encyclopedia. From one-day Baidu search logs, we randomly select 1,000 questions for which one of returned top-10 relevant web sites is from infobox websites. Then, we manually classify each question into the above five types.
Customer service robots. Big companies build AI robots to answer questions of customers, which usually require the access to industrial databases. We provide a free trial API 1 to create customer service robots for developers. With the permission of the developers, we randomly select 1,500 questions and corresponding databases from their created robots. These questions cover multiple domains such as banks, airlines, and communication carriers, etc.
Data analysis robots. Every day, innumerous tables are generated, such as financial statements, business orders, etc. To perform data analysis over such data, companies hire professionals to write SQL queries. Obviously, it is extremely useful to build robots that allow financial experts to directly perform data analysis using NL questions. We collect 500 questions from our data analysis robot. Figure 2 shows the query distributions of the three applications. It is obvious that calculation questions occupy a considerable proportion in all three applications. For customer service robots, users mainly try to search information, and therefore most questions belong to the matching type. Yet, 8% questions require calculation SQL queries to be answered. For data analysis robots, calculation questions dominate the distribution, since users try to figure out useful clues behind the data.
To gain more insights, we further divide calculation questions into three subtypes according to

Column Calculation
What is the population density of Hefei?

Calculation with a Constant
How old is Jenny?
SELECT curdate -birthday FROM student WHERE name = 'Jenny' How far is Beijing's population from 23 million?  the SQL syntax, i.e., row calculation, column calculation, and calculation with a constant. Figure 3 shows some examples.

Corpus Construction
Building a large-scale text-to-SQL dataset with multi-table databases is extremely challenging. First, though there are a large amount of independent tables on the Internet, connections among the tables are usually unavailable. Therefore, great efforts are needed to create multi-table databases. Second, it is usually difficult to obtain NL questions against certain databases. Third, given a question and the corresponding database, we need proficient annotators to write a SQL query for the question who understand both the database schema and the SQL syntax.
Different from previous works, which usually rely on human to create both NL questions and SQL queries (Yu et al., 2018b), we build our dataset via a human-computer collaboration way, as illustrated in Figure 4. The key idea is to automatically generate SQL queries paired with pseudo questions given a database. Then pseudo questions are paraphrased to NL questions by humans. Finally, to guarantee data quality, low-confidence SQL queries and NL questions detected according to their overlapping and similarity metrics, and are further checked by humans.

Database Creation
Most of mature databases used in industry are not publicly available. So we collect our databases mainly from the Internet. However, databases available on the Internet are in the form of independent tables, which need to be linked with other tables. We create databases in three steps: table acquisition, table merging, and foreign key creation.
We collect websites to crawl tables, ensuring that they cover multiple domains. As the largest Chinese encyclopedia, Baidu Baike contains more than 17 million entries across more than 200 domains. We start with all the entries in Baike as the initial sites, and extend the collection based on the reference sites in each entry page. We keep sites where tables are crawled. The final collection contains entries of Baike, annual report websites 2 , vertical domain websites 3 , and other websites such as community forums 4 . Table 1 shows the data distribution regarding database sources.
To make a domain correspond to a database, we merge tables with the same schema to a new table with a new schema, e.g., tables about China cities with the schema of {population, area, ...} are merged to a new table with the schema of {termid, name, population, area, ...}, where termid is randomly generated as primary key and name is the name of the city. Meanwhile, we add a type for each column according to the form of its value, where the column type consists of text, number and date.
We create foreign keys between two tables via entity linking, e.g., a table named "Livable cities in 2019" with the schema of {city_name, ranker, ...} joins to a table named "China cities" with the schema of {term_id, name, area, ...} through the links of entities in "city_name" and "name". According to foreign keys, all tables are split into separate graphs, each of which consists of several

Automatic Generation of SQL Queries
Given a database, we want to generate as many common SQL queries as possible. Both manually writing SQL queries and quality-checking take a significant amount of time. Obviously, SQL queries can be automatically generated from the grammar. We utilize production rules from the grammar to automatically generate SQL queries, instead of asking annotators to write them. According to the difficulty 5 and semantic correctness of a SQL query, we prune the rule paths in the generation. Then, we sample the generated SQL queries according to the distribution in Figure 2 and carry out the follow-up work based on them.
As illustrated in Figure 5, the SQL query can be represented as a tree using the rule sequence of {SQLs = SQL, SQL = Select Where, Select = SELECT A, Where = WHERE Conditions, ...}, all of which are production rules of the grammar. Guided by the SQL query distributions in real applications, we design production rules to ensure that all common SQL queries can be generated, e.g., the rule of {C = table.column mathop table.column} allows calculations between columns or rows. By exercising every rule of the grammar, we can generate SQL queries covering patterns of different complexity.
We consider two aspects in the automatic SQL generation: the difficulty and semantic correct- ness of a SQL query. To control the difficulty of the generated queries, we make some restrictions based on our analysis on real-life questions: first, a SQL query contains only one nested query; second, there are no more than three conditions in a where clause and no more than four answers in a select statement; third, a SQL query has at most one math operation; forth, most text values are from databases 6 . To ensure the semantics correctness of the generated query, we abide by preconditions of each clause and expression in the generation, e.g., the expression of {A > SQL} requires that the nested SQL returns a number value. The full list of preconditions is shown in Appendix C.
Under these requirements, we generate a large amount of candidate SQL queries against 200 databases. Among them, only a tiny proportion of SQL queries are of the calculation type, since only few columns support calculation operations. We keep all queries in the calculation type, randomly select ones with sorting and clustering types of the same size, and select ones with the matching type 7 of three times the size. We make sure that these selected queries are spread across all 200 databases. Then these queries are used as input for the followup work. 6 The text values in a SQL query are from the database to reduce the difficulty of SQL prediction. We plan to remove this restriction in the next release version of DuSQL. 7 Including combinations of matching type and other types, e.g., the SQL query of {SELECT ... WHERE ... ORDER BY ... } represents the combination of matching and sorting types.

Semi-automatic Generation of Questions
For each SQL query, we automatically generate a pseudo question to explain it. Then pseudo questions are shown to annotators who can understand them and paraphrase them to NL questions without looking at databases and SQL queries.
We generate a pseudo question for a SQL query according to its execution order. As shown in Figure 6, the entire pseudo question of the SQL query consists of pseudo descriptions of all clauses according to their execution orders. The pseudo description of a clause consists of pseudo descriptions of all its components. We give a description for each component, e.g., list for SELECT, average for the aggregator of avg. Appendix D shows the descriptions for all components. To ensure that the pseudo question is clear and reflects the meaning of the SQL query, intermediate variables are introduced to express sub-SQL queries, e.g., "v1" in the example of Figure 6 represents the result of the nested query and is used as a value in other expressions.
We ask two annotators 8 to reformulate pseudo questions into NL questions 9 , and filter two kinds of questions: 1) incomprehensible ones which are semantically unclear; 2) unnatural ones which are not the focus of humans 10 . During the process of paraphrasing, 6.7% of question/SQL pairs are filtered, among which 76.5% are complex queries. Then we ask other annotators to check the correctness of reformulated questions, and find 8% of questions are inaccurate.

Review and Checking
To guarantee data quality, we automatically detect low-quality question/SQL pairs according to the following evaluation metrics.
• Overlap. To ensure the naturalness of our questions, we calculate the overlap between the pseudo question and the corresponding NL question. The question with an overlap higher than 0.6 is considered to be of low quality.
• Similarity. To ensure that the question contains enough information for the SQL query, we train a similarity model based on question/SQL pairs. The question with a similarity score less than 0.8 is considered to be of low quality.
In the first round, about 18% of question/SQL pairs are of low quality. We ask annotators to check these pairs and correct the error pairs. This process iterates through the collaboration of human and computer until the above metrics no longer changing. It iterates twice in the construction of DuSQL.

Dataset Statistics
We summarize the statistics of DuSQL and other cross-domain datasets in Table 2, and give some 8 They are full-time employees and familiar with SQL language. Meanwhile, they have lots of experience in annotating QA data.
9 Some values in SQL queries are rewritten as synonyms. 10 E.g., "When province is Sichuan, list the total rank of these cities." for the SQL query {SELECT sum(rank) From T2 WHERE province = 'Sichuan'} is considered as an unnatural question, as the total rank would not be asked by humans. examples in Figure 7. DuSQL contains enough question/SQL pairs for all common types. Wik-iSQL and TableQA are simple datasets, only containing matching questions. Spider and CSpider (Min et al., 2019) mainly cover matching, sorting, clustering and their combinations. There are very few questions in the calculation type, and all of them only need column calculations. Spider does not focus on questions that require the common knowledge and math operation. According to our analysis in Figure 2, the calculation type is very common, accounting for 8% to 65% in different applications. DuSQL, a pragmatic industry-oriented dataset, conforms to the distribution of SQL queries in real applications. Meanwhile, DuSQL is larger, twice the size of other complex datasets. DuSQL contains 200 databases, covering about 70% of entries in Baike and more than 160 domains, e.g., cities, singers, movies, animals, etc. We provide content for each database.
All the values of a SQL query can be found in the database, except for numeric values. All table and column names in the database are clear and selfcontained. In addition, we provide English schema for each database, including table names and column headers.

Benchmark Approaches
All existing text-to-SQL works focus on English datasets. Considering that DuSQL is the most similar with Spider, we choose the following three representative publicly available parsers as our benchmark approaches, to understand the performance of existing approaches on our new Chinese dataset.

Dataset
Size DB  We also extend the state-of-the-art IRNet model of  to accommodate the two characteristics of our data, i.e., calculation questions and the need of value prediction.
Seq2Seq+Copying (Zhong et al., 2017) incorporates the database schemas into the model input and uses a copying mechanism in the decoder.
SyntaxSQLNet (Yu et al., 2018a) proposes a SQL syntax tree-based network to generate SQL structures, and uses generation path history and table-aware column attention in the decoder.
IRNet  designs an intermediate representation called SemQL for encoding higher-level abstraction structures than SQL, and then uses a grammar-based decoder (Yin and Neubig, 2017) to synthesize a SemQL query. At present, IRNet reports the state-of-the-art results on Spider dataset.
Both SyntaxSQLNet and IRNet utilize a grammar to guide SQL generation and conduct experiments on Spider dataset. However, neither of their grammars can handle calculation questions. Another major difference between our dataset and Spider is that our evaluation metric (see Section §5) also considers value prediction, since values in a SQL query are from the corresponding question or database both of which are available inputs to the model. Please refer to our discussion in Section §3 for details. Due to the characteristics of our dataset, all the three models perform poorly on DuSQL. Therefore, we extend the IRNet model to accommodate DuSQL as follows.
Firstly, we extend the grammar of SemQL to accommodate the two characteristics of our dataset, as shown in Figure 8. The production rules in bold are added to parse calculation questions. Other production rules are modified based on original rules to support value prediction (Due to space limitation, we attach the full list of extended grammar in Appendix F.). Then we use all the n-grams of length 1-6 in the question to match database cells or number/date to determine candidate values for the predicated SQL query. The values are used in the same way as the columns and tables in the IR-Net model.

Experiments
Data Settings Following WikiSQL, we split our dataset into train/dev/test in a way so that databases are non-overlapping among the three subsets. In other words, all question/SQL pairs for the same database are in the same subset. This is also referred to as cross-domain parsing problem, since some database schemes in dev/test do not appear in train. At last, 200 databases are split into 160/17/23, and 23,979 question/SQL pairs are split into 18,602/2,039/3,156.
Evaluation Metrics Evaluation metrics for the text-to-SQL task include component matching, exact matching, and execution accuracy. Component matching (Yu et al., 2018b) uses F1 score to evaluate the performance of the model on each clause. Exact matching, namely the percentage of questions whose predicted SQL query is equivalent to the gold SQL query, is widely used in text-to-SQL tasks. Execution accuracy, namely the percentage of questions whose predicted SQL query obtains the correct answer, assumes that each SQL query has an answer.
We use exact matching as the main metric, and follow Xu et al. (2017) and Yu et al. (2018b)    handle the "ordering issue". Finally, we give the model performance with (w) and without (w/o) value evaluation.
Main results. Table 4 shows performance of the benchmark approaches. The performance of Seq2SeqCopying is the lowest. It uses the copying mechanism to reduce errors posed by out-ofdomain words in the databases of test set. But it predicts lots of invalid SQL queries with grammatical errors, since its decoder does not consider SQL structures at all. SyntaxSQLNet and IRNet outperform Seq2SeqCopying by utilizing a grammar from SQL structures to guide SQL generation. In particular, IRNet utilize SemQL as an abstraction representation of SQL queries. However, neither of the two vanilla models handles calculation questions and value directions properly. The basic IRNet achieves only 34.2/15.4 accuracy on the test set w/o and w/ value evaluation.
We can see that by simply extending IRNet to parse calculation questions and predict values, the IRNetExt model achieves much higher accuracy (54.3/50.1).
Ablation study. We perform ablation study to gain more insights on the contribution of our extensions. As shown in table 4, the accuracy on test set drops 4.5 by excluding production rules from the grammar of SemQL. The accuracy of calculation type is 0, which composes 20.7% of the questions in the test set. After excluding the prediction of values, the test performance drops significantly for two reasons. First, there are a large number of questions that contain values, accounting for about 75% in the dev set and 70% in the test set. Second, the generation of where clauses can be improved by leveraging the column-cell relationship. Table 3 shows performance of different SQL query types. Firstly, the grammar extension is effective, the accuracy of all types is significantly improved. Second, the accuracy of calculation type is lower than that of other types, as many calculation questions require incorporating common knowledge, e.g., age = dateOfDeath -dateOfBirth. How to represent and incorporate such knowledge into the model is very challenging. Third, questions requiring common knowledge perform poorly, as they need understanding rather than matching, such as the matching issue of "the oldest" and "age".

Related Work
Semantic parsing. Semantic parsing aims to map NL utterances into semantic representations, such as logical forms (Liang, 2013), SQL queries (Tang and Mooney, 2001), Python code (Ling et al., 2016), etc. In order to facilitate model training and evaluation, researchers release a variety of datasets. ATIS and GeoQuery are two popular early datasets originally in logical forms, and are converted into SQL queries (Iyer et al., 2017;Popescu et al., 2003). As two recently released datasets, WikiSQL (Zhong et al., 2017) and Spider (Yu et al., 2018b) have attracted extensive research attention. It is also noteworthy that Min et al. (2019) propose the CSpider dataset by translating English questions of Spider into Chinese.
Data construction methods. As discussed in Section §3, creating a large-scale semantic parsing dataset is extremely challenging. To construct Spider, Yu et al. (2018b) ask annotators to write both questions and SQL queries given a database. Both Iyer et al. (2017) and Herzig and Berant (2019) assume that the database and questions are given and try to reduce the effort of creating semantic representations. Our data construction is most closely related to Overnight (Wang et al., 2015), who proposes to automatically generate logical forms based on a hand-crafted grammar and ask annotators to paraphrase pseudo questions into NL ques-tions. Overnight focuses on logic form (LF) based semantic representation, while our work on SQL representation. The differences are two-fold. First, databases of Overnight are much simpler, composed of a set of entity-property-entity triples. Second, LF operations of Overnight are much simpler, consisting of only matching and aggregation operations, such as count, min, max. Our dataset is more complex and thus imposes more challenges on the data construction.
Text-to-SQL parsing approaches. Seq2Seq models achieve the state-of-the-art results on single-database datasets such as ATIS and Geo-Query (Dong and Lapata, 2016). With the release of WikiSQL dataset, researchers make efforts to handle unseen databases by using database schema as inputs. Two mainstream approaches are the Seq2Seq model with copy mechanism (Sun et al., 2018) and the Seq2Set model (Xu et al., 2017). With BERT representations (Devlin et al., 2019), the execution accuracy exceeds 90% (He et al., 2019;. For the more challenging Spider dataset with multi-table databases,  introduces an intermediate representation (SemQL) for SQL queries, and uses a grammar-based decoder to generate SemQL, reporting state-of-the-art performance. Bogin et al. (2019) proposes to encode the database schema with graph neural network. Recently, Wang et al. (2019) proposes RATSQL to use relation-aware self-attention to better encode the question and database schema simultaneously.

Conclusion
We present the first large-scale and pragmatic Chinese dataset for cross-domain text-to-SQL parsing. Based on the analysis on questions from real-world applications, our dataset contains a considerable proportion of questions that require row/column calculations. We extend the state-of-the-art IR-Net model on Spider to accommodate DuSQL, and obtain substantial performance boost. Yet, there is still a large room for improvement, especially on calculation questions which usually require incorporation of common-sense knowledege into the model. For future work, we will continually improve the scale and quality of our dataset, to facilitate future research and to meet the need of database-oriented applications.  Figure 9: The production rules for SQL generation. Figure 9 shows production rules used for SQL generation.

B Query Type Definition
Question classification is mostly based on the operations used in corresponding SQL queries. Matching means the answer can be directly obtained from the database. Sorting means we need to sort the returned results or only return top-k results. Clustering means we have to perform aggregations (count, min/max, etc.) on each cluster. Calculation means we need to calculate between columns or rows to get the answer. Other usually corresponds to questions requiring reasoning or subjective questions, e.g., "Is Beijing bigger than Shanghai?", and "Is the ticket expensive?". Figure 10 shows some examples for types in Figure 2, except for the calculation type (shown in Figure 3) and other type which

Matching
List cities with a population less than 10 million.

Sorting
Give the top 5 cities with the largest population.
SELECT name FROM T1 ORDER BY population DESC LIMIT 5

Clustering
Give the total population of each province.
SELECT province, sum(population) FROM T1 GROUP BY province Figure 10: Examples of types in Figure 2. All of them are based on the database in Figure 1.
do not have corresponding SQL queries.

C Preconditions in SQL Generation
To ensure the semantic correctness of the generated SQL query, we define the preconditions for each production rule, and abide by these preconditions in the SQL query generation.
• For the generation of SQL query with multiple SQLs, e.g., {SQLs ::=SQL union SQLs}: the columns in the select clause of the previous SQL match the columns in the select clause of the subsequent SQL, i.e., the columns of the two select clauses are the same or connected by foreign keys.
• For the rule of generating GroupC: the C is generated from the rule of {C ::=

D Descriptions of SQL Components
We provide a description for each basic component, as follows: • • The descriptions for columns, tables, and values are equal to themselves.
Meanwhile, we provide the description for each production rule, as shown in Figure 12.   Table 5 shows the statistics of our dataset and other cross-domain datasets in the way of Spider. We provide enough examples for both advanced SQL clauses and the calculation type.

F The extended grammar of SemQL
We extend the grammar used in IRNet model to accommodate DuSQL, as shown in Figure 11. The Figure 8 shows the main changes.