Towards Complex Text-to-SQL in Cross-Domain Database with Intermediate Representation

We present a neural approach called IRNet for complex and cross-domain Text-to-SQL. IRNet aims to address two challenges: 1) the mismatch between intents expressed in natural language (NL) and the implementation details in SQL; 2) the challenge in predicting columns caused by the large number of out-of-domain words. Instead of end-to-end synthesizing a SQL query, IRNet decomposes the synthesis process into three phases. In the first phase, IRNet performs a schema linking over a question and a database schema. Then, IRNet adopts a grammar-based neural model to synthesize a SemQL query which is an intermediate representation that we design to bridge NL and SQL. Finally, IRNet deterministically infers a SQL query from the synthesized SemQL query with domain knowledge. On the challenging Text-to-SQL benchmark Spider, IRNet achieves 46.7% accuracy, obtaining 19.5% absolute improvement over previous state-of-the-art approaches. At the time of writing, IRNet achieves the first position on the Spider leaderboard.


Introduction
Recent years have seen a great deal of renewed interest in Text-to-SQL, i.e., synthesizing a SQL query from a question. Advanced neural approaches synthesize SQL queries in an end-to-end manner and achieve beyond 80% exact matching accuracy on public Text-to-SQL benchmarks (e.g., ATIS, GeoQuery and WikiSQL) (Krishnamurthy et al., 2017;Zhong et al., 2017;Xu et al., 2017;Yaghmazadeh et al., 2017;Yu et al., 2018a;Dong and Lapata, 2018;Wang et al., 2018;Hwang et al., 2019). However, Yu et al. (2018c) presents unsatisfactory performance of state-of-the-art approaches on a newly released, cross-domain Textto-SQL benchmark, Spider. * Contribution during internship in MSRA. NL: Show the names of students who have a grade higher than 5 and have at least 2 friends.
SQL: SELECT T1.name FROM friend AS T1 JOIN highschooler AS T2 ON T1.student_id = T2.id WHERE T2.grade > 5 GROUP BY T1.student_id HAVING count(*) >= 2 Figure 1: An example from Spider benchmark to illustrate the mismatch between the intent expressed in NL and the implementation details in SQL. The column 'student id' to be grouped by in the SQL query is not mentioned in the question.
Spider benchmark brings new challenges that prove to be hard for existing models. Firstly, the SQL queries in Spider contain nested queries and clauses like GROUPBY and HAVING, which are far more complicated than that in another well-studied cross-domain benchmark, Wik-iSQL (Zhong et al., 2017). Considering the example in Figure 1, the column 'student id' to be grouped by in the SQL query is never mentioned in the question. In fact, the GROUPBY clause is introduced in SQL to facilitate the implementation of aggregate functions. Such implementation details, however, are rarely considered by end users and therefore rarely mentioned in questions. This poses a severe challenge for existing end-to-end neural approaches to synthesize SQL queries in the absence of detailed specification. The challenge in essence stems from the fact that SQL is designed for effectively querying relational databases instead of for representing the meaning of NL (Kate, 2008). Hence, there inevitably exists a mismatch between intents expressed in natural language and the implementation details in SQL. We regard this challenge as the mismatch problem.
Secondly, given the cross-domain settings of Spider, there are a large number of out-of-domain (OOD) words. For example, 35% of words in database schemas on development set do not occur in the schemas on training set in Spider. As a comparison, the number in WikiSQL is only 22%. The large number of OOD words poses another steep challenge in predicting columns in SQL queries (Yu et al., 2018b), because the OOD words usually lack of accurate representations in neural models. We regard this challenge as the lexical problem.
In this work, we propose a neural approach, called IRNet, towards tackling the mismatch problem and the lexical problem with intermediate representation and schema linking. Specifically, instead of end-to-end synthesizing a SQL query from a question, IRNet decomposes the synthesis process into three phases. In the first phase, IRNet performs a schema linking over a question and a schema. The goal of the schema linking is to recognize the columns and the tables mentioned in a question, and to assign different types to the columns based on how they are mentioned in the question. Incorporating the schema linking can enhance the representations of question and schema, especially when the OOD words lack of accurate representations in neural models during testing. Then, IRNet adopts a grammar-based neural model to synthesize a SemQL query, which is an intermediate representation (IR) that we design to bridge NL and SQL. Finally, IRNet deterministically infers a SQL query from the synthesized SemQL query with domain knowledge.
On the challenging Spider benchmark (Yu et al., 2018c), IRNet achieves a 46.7% exact matching accuracy, obtaining a 19.5% absolute improvement over previous state-of-the-art approach. At the time of writing, IRNet achieves the first position on the Spider leaderboard. When augmented with BERT (Devlin et al., 2018), IRNet reaches up to 54.7% accuracy. In addition, as we show in the experiments, learning to synthesize SemQL queries rather than SQL queries can substantially benefit other neural approaches for Text- The context-free grammar of SemQL. column ranges over distinct column names in a schema. table ranges over tables in a schema. to-SQL, such as SQLNet (Xu et al., 2017), Type-SQL (Yu et al., 2018a) and SyntaxSQLNet (Yu et al., 2018b). Such results on the one hand demonstrate the effectiveness of SemQL in bridging NL and SQL. On the other hand, it reveals that designing an effective intermediate representation to bridge NL and SQL is a promising direction to being there for complex and cross-domain Textto-SQL.

Approach
In this section, we present IRNet in details. We first describe how to tackle the mismatch problem and the lexical problem with intermediate representation and schema linking. Then we present the neural model to synthesize SemQL queries.

Intermediate Representation
To eliminate the mismatch, we design a domain specific language, called SemQL, which serves as an intermediate representation between NL and SQL. Figure 2 presents the context-free grammar of SemQL and an illustrative SemQL query is shown in Figure 3. We elaborate on the design of SemQL in the following.
Inspired by lambda DCS (Liang, 2013), SemQL is designed to be tree-structured. This structure, on the one hand, can effectively constrain the search space during synthesis. On the other hand, in view of the tree-structure nature of SQL (Yu et al., 2018b;Yin and Neubig, 2018), following the same structure also makes it easier to translate to SQL intuitively.
The mismatch problem is mainly caused by the implementation details in SQL and missing specification in questions as discussed in Section 1. Therefore, it is natural to hide the implementation details in the intermediate representation, which forms the basic idea of SemQL. Considering the example in Figure 3, the GROUPBY, HAVING and FROM clauses in the SQL query are eliminated in the SemQL query, and the conditions in WHERE and HAVING are uniformly expressed in the subtree of Filter in the SemQL query. The implementation details can be deterministically inferred from the SemQL query in the later inference phase with domain knowledge. For example, a column in the GROUPBY clause of a SQL query usually occurs in the SELECT clause or it is the primary key of a table where an aggregate function is applied to one of its columns.
In addition, we strictly require to declare the table that a column belongs to in SemQL. As illustrated in Figure 3, the column 'name' along with its table 'friend' are declared in the SemQL query. The declaration of table helps to differentiate duplicated column names in the schema. We also declare a table for the special column ' * ' because we observe that ' * ' usually aligns with a table mentioned in a question. Considering the example in Figure 3, the column ' * ' in essence aligns with the table 'friend', which is explicitly mentioned in the question. Declaring a table for ' * ' also helps infer the FROM clause in the next inference phase.
When it comes to inferring a SQL query from a SemQL query, we perform the inference based on an assumption that the definition of a database schema is precise and complete. Specifically, if a column is a foreign key of another table, there should be a foreign key constraint declared in the schema. This assumption usually holds as it is the best practice in database design. More than 95% of examples in the training set of the Spider benchmark hold this assumption. The assumption forms the basis of the inference. Take the inference of the FROM clause in a SQL query as an example. We first identify the shortest path that connects all the declared tables in a SemQL query in the schema (A database schema can be formulated as an undirected graph, where vertex are tables and edges are foreign key relations among tables). Joining all the tables in the path eventually builds the FROM clause. Supplementary materials provide detailed procedures of the inference and more examples of SemQL queries.

Schema Linking
The goal of schema linking in IRNet is to recognize the columns and the tables mentioned in a question, and assign different types to the columns based on how they are mentioned in the question. Schema linking is an instantiation of entity linking in the context of Text-to-SQL, where entity is referred to columns, tables and cell values in a database. We use a simple yet effective stringmatch based method to implement the recognition. In the followings, we illustrate how IRNet performs schema linking in details based on the assumption that the cell values in a database are not available.
As a whole, we define three types of entity that may be mentioned in a question, namely, table, column and value, where value stands for a cell value in database. In order to recognize entities, we first enumerate all the n-grams of length 1-6 in a question. Then, we enumerate them in the descending order of length. If an n-gram exactly matches a column name or is a subset of a column name, we recognize this n-gram as column. The recognition of table follows the same way. If an ngram can be recognized as both column and table, we prioritize column. If an n-gram begins with and ends with a single quote, we recognize it as value.
Once an n-gram is recognized, we will remove other n-grams that have overlap with it. To this end, we can recognize all the entities mentioned in a question and obtain a non-overlap n-gram sequence of the question by joining those recognized n-grams and the remaining 1-grams. We refer each n-gram in the sequence as a span and assign each span a type according to its entity. For example, if a span is recognized as column, we will assign it a type COLUMN. Figure 4 depicts Figure 4: An overview of the neural model to synthesize SemQL queries. Basically, IRNet is constituted by an NL encoder, a schema encoder and a decoder. As shown in the figure, the column 'book title' is selected from the schema, while the second column 'year' is selected from the memory.
linking results of a question. For those spans recognized as column, if they exactly match the column names in schema, we assign these columns a type EXACT MATCH, otherwise a type PARTIAL MATCH. To link the cell value with its corresponding column in schema, we first query the value span in ConceptNet (Speer and Havasi, 2012) which is an open, large-scale knowledge graph and search the results returned by ConceptNet over the schema. We only consider the query results in two categories of Con-ceptNet, namely, 'is a type of' and 'related terms', as we observe that the column that a cell value belongs to usually occurs in these two categories. If there exists a result exactly or partially matches a column name in schema, we assign the column a type VALUE EXACT MATCH or VALUE PARTIAL MATCH.

Model
We present the neural model to synthesize SemQL queries, which takes a question, a database schema and the schema linking results as input. Figure 4 depicts the overall architecture of the model via an illustrative example.
To address the lexical problem, we consider the schema linking results when constructing representations for the question and columns in the schema. In addition, we design a memory augmented pointer network for selecting columns during synthesis. When selecting a column, it makes a decision first on whether selecting from memory or not, which sets it apart from the vanilla pointer network (Vinyals et al., 2015). The motivation be-hind memory augmented pointer network is that the vanilla pointer network is prone to selecting same columns according to our observation in experiment. NL Encoder. Let x=[(x 1 , τ 1 ), · · · , (x L , τ L )] denote a non-overlap span sequence of a question, where x i is the i th span and τ i is the type of span x i assigned in schema linking. The NL encoder takes x as input and encodes x into a sequence of hidden states H x . Each word in x i is converted into its embedding vector and its type τ i is also converted into an embedding vector. Then, the NL encoder takes the average of the type and word embeddings as the span embedding e i x . Finally, the NL encoder runs a bi-directional LSTM (Hochreiter and Schmidhuber, 1997) over all the span embeddings. The output hidden states of the forward and backward LSTM are concatenated to construct H x . Schema Encoder. Let s=(c, t) denote a database schema, where c={(c 1 , φ i ), · · · , (c n , φ n )} is the set of distinct columns and their types that we assign in schema linking, and t={t 1 , · · · , t m } is the set of tables. The schema encoder takes s as input and outputs representations for columns E c and tables E t . We take the column representations as an example below. The construction of table representations follows the same way except that we do not assign a type to a table in schema linking.
Concretely, each word in c i is first converted into its embedding vector and its type φ i is also converted into an embedding vector ϕ i . Then, the schema encoder takes the average of word embeddings as the initial representationsê i c for the column. The schema encoder further performs an attention over the span embeddings and obtains a context vector c i c . Finally, the schema encoder takes the sum of the initial embedding, context vector and the type embedding as the column representation e i c . The calculation of the representations for column c i is as follows.
Decoder. The goal of the decoder is to synthesize SemQL queries. Given the tree structure of SemQL, we use a grammar-based decoder (Yin andNeubig, 2017, 2018) which leverages a LSTM to model the generation process of a SemQL query via sequential applications of actions. Formally, the generation process of a SemQL query y can be formalized as follows.
where a i is an action taken at time step i, a <i is the sequence of actions before i, and T is the number of total time steps of the whole action sequence. The decoder interacts with three types of actions to generate a SemQL query, including AP-PLYRULE, SELECTCOLUMN and SELECTTABLE. APPLYRULE(r) applies a production rule r to the current derivation tree of a SemQL query. SE- LECTCOLUMN(c) and SELECTTABLE(t) selects a column c and a table t from schema, respectively. In the followings, we detail the action SELECT-COLUMN and SELECTTABLE. Interested readers can refer to Yin and Neubig (2017) for details of the action APPLYRULE.
We design a memory augmented pointer network to implement the action SELECTCOLUMN. The memory is used to record the selected columns, which is similar to the memory mechanism used in Liang et al. (2017). When the decoder is going to select a column, it first makes a decision on whether to select from the memory or not, and then selects a column from the memory or the schema based on the decision. Once a column is selected, it would be removed from the schema and be recorded in the memory. The probability of selecting a column c is calculated as follows.
, where S represents selecting from schema, MEM represents selecting from memory, v i denotes the context vector that is obtained by performing an attention over H x , E m c denotes the embedding of columns in memory and E s c denotes the embedding of columns that are never selected. w m is trainable parameter.
When it comes to SELECTTABLE, the decoder would select a table t from the schema via a pointer network: As shown in Figure 4, the decoder first predicts a column and then predicts the table that it belongs to. To this end, we can leverage the relations between columns and tables to prune the irrelevant tables.
Coarse-to-fine. We further adopt a coarse-to-fine framework (Solar-Lezama, 2008;Bornholt et al., 2016;Dong and Lapata, 2018), decomposing the decoding process of a SemQL query into two stages. In the first stage, a skeleton decoder outputs a skeleton of the SemQL query. Then, a detail decoder fills in the missing details in the skeleton by selecting columns and tables. Supplementary materials provide a detailed description of the skeleton of a SemQL query and the coarse-to-fine framework.

Experiment
In this section, we evaluate the effectiveness of IRNet by comparing it to the state-of-the-art approaches and ablating several design choices in IRNet to understand their contributions. Baselines. We also evaluate the sequence-tosequence model (Sutskever et al., 2014)   ure to illustrate the structure. To establish baseline, we also augment SyntaxSQLNet with BERT. Note that we only use the base version of BERT due to the resource limitations.

Experiment
We do not perform any data augmentation for fair comparison. All our code are publicly available. 1

Experimental Results
Table 1 presents the exact matching accuracy of IRNet and various baselines on development set and test set. IRNet clearly outperforms all the baselines by a substantial margin. It obtains 27.0% absolute improvement over SyntaxSQLNet on test set. It also obtains 19.5% absolute improvement over SyntaxSQLNet(augment) that performs large-scale data augmentation. When incorporating BERT, the performance of both SyntaxSQL-Net and IRNet is substantially improved and the accuracy gap between them on both development set and test set is widened.
To study the performance of IRNet in detail, following Yu et al. (2018b), we measure the average F1 score on different SQL components on test set.
We compare between SyntaxSQLNet and IRNet. As shown in Figure 5  To investigate the effectiveness of SemQL, we alter the baseline approaches and let them learn to generate SemQL queries rather than SQL queries. As shown in Table 3, there are at least 6.6% and up to 14.4% absolute improvements on accuracy of exact matching on development set. For example, when SyntaxSQLNet is learned to generate SemQL queries instead of SQL queries, it registers 8.6% absolute improvement and even outperforms SyntaxSQLNet(augment) which performs largescale data augmentation. The relatively limited improvement on TypeSQL and SQLNet is because their slot-filling based models only support a subset of SemQL queries. The notable improvement, on the one hand, demonstrates the effectiveness of SemQL. On the other hand, it shows that designing an intermediate representations to bridge NL and SQL is promising in Text-to-SQL.

Ablation Study
We conduct ablation study on IRNet and IR-Net(BERT) to analyze the contribution of each component. Specifically, we first evaluate a base model that does not apply schema linking (SL) and the coarse-to-fine framework (CF), and replace the memory augment pointer network (MEM) with   the vanilla pointer network (Vinyals et al., 2015). Then, we gradually apply each component on the base model. The ablation study is conducted on the development set. Table 4 presents the ablation study results. It is clear that our base model has already significantly outperforms SyntaxSQLNet, SyntaxSQL-Net( augment) and SyntaxSQLNet(BERT). Performing schema linking ('+SL') brings about 8.5% and 6.4% absolute improvement on IRNet and IRNet(BERT). Predicting columns in the WHERE clause is known to be challenging (Yavuz et al., 2018). The F1 score on the WHERE clause increases by 12.5% when IRNet performs schema linking. The significant improvement demonstrates the effectiveness of schema linking in addressing the lexical problem. Using the memory augmented pointer network ('+MEM') further improves the performance of IRNet and IR-Net(BERT). We observe that the vanilla pointer network is prone to select the same columns during synthesis. The number of examples suffering from this problem decreases by 70%, when using the memory augmented pointer network. At last, adopting coarse-to-fine framework ('+CF') can further boost the performance.

Error Analysis
To understand the source of errors, we analyze 483 failed examples of IRNet on development set. We identify three major causes for failures: Column Prediction. We find that 32.3% of failed examples are caused by incorrect column predictions based on cell values. That is, the correct column name is not mentioned in a question, but the cell value that belongs to it is mentioned. As the study points out (Yavuz et al., 2018), the cell values of a database are crucial in order to solve this problem. 15.2% of the failed examples fail to predict correct columns that partially appear in questions or appear in their synonym forms. Such failures may can be further resolved by combining our string-match based method with embeddingmatch based methods (Krishnamurthy et al., 2017) to improve the schema linking in the future. Nested Query. 23.9% of failed examples are caused by the complicated nested queries. Most of these examples are in the Extra Hard level. In the current training set, the number of SQL queries in Extra Hard level (∼20%) is the least, even less than the SQL queries in Easy level (∼23%). In view of the extremely large search space of the complicated SQL queries, data augmentation techniques may be needed. Operator. 12.4% of failed examples make mistake in the operator as it requires common knowledge to predict the correct one. Considering the following example, 'Find the name and membership level of the visitors whose membership level is higher than 4, and sort by their age from old to young', the phrase 'from old to young' indicates that sorting should be conducted in descending order. The operator defined here includes aggregate functions, operators in WHERE clause and the sorting orders (ASC and DESC).
Other failed examples cannot be easily categorized into one of the categories above. A few of them are caused by the incorrect FROM clause, because the ground truth SQL queries join those tables without foreign key relations defined in schema. This violates our assumption that the definition of a database schema is precise and complete.
When incorporated with BERT, 30.5% of failed examples are fixed. Most of them are in the category Column Prediction and Operator, but the improvement on Nested Query is quite limited.

Discussion
Performance Gap. There exists a performance gap on IRNet between the development set and test set, as shown in Table 1. Considering the explosive combination of nested queries in SQL and the limited number of data (1034 in development, 2147 in test), the gap is probably caused by the different distributions of the SQL queries in Hard and Extra level. To verify the hypothesis, we construct a pseudo test set from the official training set. We train IRNet on the remaining data in training set and evaluate them on development set and the pseudo test set, respectively. We find that even though the pseudo set has the same number of complicated SQL queries (Hard and Extra Hard) with the development set, there still exists a performance gap. Note that other approaches do not exhibit the performance gap, because of their relatively poor performance on the complicated SQL queries. For example, SyntaxSQLNet only achieves 4.6% on the SQL queries in Extra Hard level on test set. Supplementary material provides detailed experiments results on the pseudo test set. Limitations of SemQL. There are a few limitations of our intermediate representation. Firstly, it cannot support the self join in the FROM clause of SQL. In order to support the self join, the variable mechanism in lambda calculus (Carpenter, 1997) or the scope mechanism in Discourse Representation Structure (Kamp and Reyle, 1993) may be necessary. Secondly, SemQL has not completely eliminate the mismatch between NL and SQL. Some implementation details in SQL are still preserved in SemQL. For example, the INTERSECT clause is often used to express disjoint conditions. However, when specifying requirements, end uses rarely concern about whether two conditions are disjoint or not. Despite the limitations of SemQL, the experimental results demonstrate its effectiveness in Text-to-SQL. To this end, we argue that designing an effective intermediate representation to bridge NL and SQL is a promising direction to being there for complex and cross-domain Textto-SQL. We leave a better intermediate representation as one of our future works.

Related Work
Natural Language Interface to Database. The task of Natural Language Interface to Database (NLIDB) has received significant attention since 1970s (Warren and Pereira, 1981;Androutsopou-los et al., 1995;Popescu et al., 2004;Hallett, 2006;Giordani and Moschitti, 2012). Most of the early proposed systems are hand-crafted to a specific database (Warren and Pereira, 1982;Woods, 1986;Hendrix et al., 1978), making it challenging to accommodate the cross-domain settings. Later work focus on building a system that can be reused for multiple databases with minimal human effort (Grosz et al., 1987;Androutsopoulos et al., 1993;Tang and Mooney, 2000). Recently, with the development of advanced neural approaches on Semantic Parsing and the release of large-scale, cross-domain Text-to-SQL benchmarks such as WikiSQL (Zhong et al., 2017) and Spider (Yu et al., 2018c), there is a renewed interest in the task (Xu et al., 2017;Iyer et al., 2017;Sun et al., 2018;Gur et al., 2018;Yu et al., 2018a,b;Wang et al., 2018;Finegan-Dollak et al., 2018;Hwang et al., 2019). Unlike these neural approaches that end-to-end synthesize a SQL query, IRNet first synthesizes a SemQL query and then infers a SQL query from it.
Intermediate Representations in NLIDB. Early proposed systems like as LUNAR (Woods, 1986) and MASQUE (Androutsopoulos et al., 1993) also propose intermediate representations (IR) to represent the meaning of a question and then translate it into SQL. The predicates in these IRs are designed for a specific database, which sets SemQL apart. SemQL focuses on a wide adoption and no human effort is needed when it is ported to a new domain. Li and Jagadish (2014) propose a query tree in their NLIDB system to represent the meaning of a question and it mainly serves as an interaction medium between users and their system. Entity Linking. The insight behind performing schema linking is partly inspired by the success of incorporating entity linking in knowledge base question answering and semantic parsing (Yih et al., 2016;Krishnamurthy et al., 2017;Yu et al., 2018a;Herzig and Berant, 2018;Kolitsas et al., 2018). In the context of semantic parsing, Krishnamurthy et al. (2017) propose a neural entity linking module for answering compositional questions on semi-structured tables. TypeSQL (Yu et al., 2018a) proposes to utilize type informations to better understand rare entities and numbers in questions. Similar to TypeSQL, IRNet also recognizes the columns and tables mentioned in a question. What sets IRNet apart is that IRNet assigns different types to the columns based on how they are mentioned in the question.

Conclusion
We present a neural approach SemQL for complex and cross-domain Text-to-SQL, aiming to address the lexical problem and the mismatch problem with schema linking and an intermediate representation. Experimental results on the challenging Spider benchmark demonstrate the effectiveness of IRNet.

Inference of SQL Query
To infer a SQL query from a SemQL query, we traverse the tree-structured SemQL query in preorder and map each tree node to corresponding SQL query components according to the production rule applied on it.
The production rule applied on the node Z denotes whether the SQL query has one of the components, e.g., UNION, EXCEPT, INTERSECT or it is just a single SQL query. The node R stands for the start of a single SQL query. The production rule applied on R denotes whether the SQL query has SELECT clause, WHERE clause and ORDERBY clause. The production rule applied on node Select denotes how many columns does the SELECT clause has. Each node A denotes a column/aggregate function pair. Specifically, nodes under A denote the aggregate function, column name and table name of that a column belongs to respectively. Both sub-trees under nodes Superlative and Order are mapped to the ORDERBY clause in the SQL query. The production rules applied on Filter denotes different condition operators in SQL query, e.g. and, or, >, <, =, in, not in and so on. If there is a node A under the node Filter and its aggregate function is not N one, it will be filled in the HAVING clause, otherwise in the WHERE clause. If there is a node R under node Filter, we will repeat the process recursively on node R and return a nested SQL query. The FROM clause is generated from the selected tables in the SemQL query by identifying the shortest path that connects these tables in schema (Database schema can be formulated as an undirected graph, where vertex are tables and edges are relations among tables). At last, if there exists an aggregate function applied on a column in the SemQL query, there should be GROUPBY clause in the SQL query. The column to be grouped by occurs in the SELECT clause in most cases, or it is the primary key of a table where an aggregate function is applied on one of its columns.

Transforming SQL to SemQL
To generate a SemQL query from a SQL query, we first initialize a node Z. If the SQL query has one of the components UNION, EXCEPT, INTERSECT, we attach the corresponding keywords and two node R under Z, otherwise a single node R. Then, we attach a Select node under R, and the number of columns in SELECT clause determines the num-> ≥ Figure 6: The skeleton of the SemQL query presented in Figure 3 ber of nodes A attached under node Select. If an ORDERBY clause in SQL query contains LIMIT keyword, it will be transformed into node Superlative, otherwise Order. Next, the sub-tree of node Filter is determined by the condition in WHERE and HAVING clause. If it has nested query in WHERE clause, we will process the subquery recursively. For each column in SQL query, we will attach its aggregate function node, a node C and a node T under A. Node C attaches the column name and node T attaches its table name. For the special column ' * ', if there is only one table in the FORM clause that is not belongs to any columns, we assign it the column ' * ', otherwise, we will label the table name of ' * ' manually. Besides, if a table in FROM clause is not assigned to any columns, it will be transformed into a sub-tree under node F ilter with in condition. In this way, all the SQL queries can be successfully transformed into SemQL queries.

Coarse-to-Fine Framework
The skeleton of a SemQL query is derived directly from SemQL query by removing all sub-nodes under node A. Figure 6 shows the skeleton of the SemQL query presented in Figure 3. Figure 7 depicts our coarse-to-fine framework to synthesize a SemQL query. In the first stage, a skeleton decoder outputs the skeleton of a SemQL. Then, a detail decoder fills in the missing details in the skeleton by selecting columns and tables. The probability of generating a SemQL y in the coarseto-fine framework is formalized as follows. where q denotes the skeleton. λ i = 1 when the ith action type is SelectColumn, otherwise λ i = 0.
At training time, our model is optimized by maximizing the log-likelihood of the ground true action sequences: max (x,s,q,y)∈D log p(y|x, s, q) + γ log p(q|x, s) where D denotes the training data and γ represents the scale between log p(y|x, s) and log p(s|x). γ is set to 1 in our experiment.   To test our hypothesis that the performance gap is caused by the different distribution of the SQL queries in Hard and Extra Hard level, we first construct a pseudo test set from the official training set of Spider benchmark. Then, we conduct further experiment on the pseudo test set and the official development set. Specifically, we sample 20 databases from the training set to construct a pseudo test set, which has the same hardness distributions with the development set. Then, we train IRNet on the remaining training set, and evaluate it on the development set and the pseudo test set, respectively. We sample the pseudo test set from the training set in three times, named pseudo test A, pseudo test B and pseudo test C respectively. They contain 1134, 1000, and 955 test data respectively. Table 5 presents the hardness distribution of the three pseudo test sets and the official development set. Figure 9 presents the exact matching accuracy of SQL on the development set and the pseudo test set after each epoch during training. IRNet performs competitively on the development set and the pseudo set C (Figure 9c)   ( Figure 9a and Figure 9b). Although the hardness distributions among the development set and the three pseudo sets are nearly the same, the data distribution still has some difference, which results in the performance gap.

BERT
We further study the performance gap of Syn-taxSQLNet on the development set and the pseudo test A. As shown in Table 6, SyntaxSQLNet achieves 16.9% on the development set and 17.4% on the pseudo test A. When incorporating BERT and learning to synthesizing SemQL, SyntaxSQL-Net(BERT,SemQL) achieves 34.5% on development set and 30.2% on the pseudo test A, exhibiting a clear performance gap (4.3%). Syn-taxSQLNet(BERT, SemQL) significantly outperforms SyntaxSQLNet in the Hard and Extra Hard level. The experimental results show that when SyntaxSQLNet performs better in the Hard and Extra Hard level, the performance gap will be larger, since that the performance gap is caused by the different data distributions.

NL:
What is the hometown of the youngest teacher?

NL:
List the names of the customers who have once bought product "food".