Clause-Wise and Recursive Decoding for Complex and Cross-Domain Text-to-SQL Generation

Most deep learning approaches for text-to-SQL generation are limited to the WikiSQL dataset, which only supports very simple queries over a single table. We focus on the Spider dataset, a complex and cross-domain text-to-SQL task, which includes complex queries over multiple tables. In this paper, we propose a SQL clause-wise decoding neural architecture with a self-attention based database schema encoder to address the Spider task. Each of the clause-specific decoders consists of a set of sub-modules, which is defined by the syntax of each clause. Additionally, our model works recursively to support nested queries. When evaluated on the Spider dataset, our approach achieves 4.6% and 9.8% accuracy gain in the test and dev sets, respectively. In addition, we show that our model is significantly more effective at predicting complex and nested queries than previous work.


Introduction
Text-to-SQL generation is the task of translating a natural language question into the corresponding SQL. Recently, various deep learning approaches have been proposed based on the WikiSQL dataset (Zhong et al., 2017). However, because WikiSQL contains only very simple queries over just a single table, these approaches (Xu et al., 2017;Huang et al., 2018;Yu et al., 2018a;Dong and Lapata, 2018) cannot be applied directly to generate complex queries containing elements such as JOIN, GROUP BY, and nested queries.
To overcome this limitation, Yu et al. (2018c) introduced Spider, a new complex and crossdomain text-to-SQL dataset.
It contains a large number of complex queries over different databases with multiple tables. It also requires a model to generalize to unseen database schema as different databases are used for training and testing. Therefore, a model should understand not only the natural language question but also the schema of the corresponding database to predict the correct SQL query.
In this paper, we propose a novel SQL-specific clause-wise decoding neural network model to address the Spider task. We first predict a sketch for each SQL clause (e.g., SELECT, WHERE) with text classification modules. Then, clause-specific decoders find the columns and corresponding operators based on the sketches. Our contributions are summarized as follows.
• We decompose the clause-wise SQL decoding process. We also modularize each of the clause-specific decoders into sub-modules based on the syntax of each clause. Our architecture enables the model to learn clausedependent context and also ensures the syntactic correctness of the predicted SQL.
• Our model works recursively so that it can predict nested queries.
• We also introduce a self-attention based database schema encoder that enables our model to generalize to unseen databases.
In the experiment on the Spider dataset, we achieve 24.3% and 28.8% exact SQL matching accuracy on the test and dev set respectively, which outperforms the previous state-of-the-art approach (Yu et al., 2018b) by 4.6% and 9.8%. In addition, we show that our approach is significantly more effective compared to previous work at predicting not only simple SQL queries, but also complex and nested queries.

Related Work
Our work is related to the grammar-based constrained decoding approaches for semantic parsing (Yin and Neubig, 2017;Rabinovich et al., 2017;Iyer et al., 2018). While their approaches are focused on general purpose code generation, we instead focus on SQL-specific grammar to address the text-to-SQL task. Our task differs from code generation in two aspects. First, it takes a database schema as an input in addition to natural language. To predict SQL correctly, a model should fully understand the relationship between the question and the schema. Second, as SQL is a non-procedural language, predictions of SQL clauses do not need to be done sequentially.
For text-to-SQL generation, several SQLspecific approaches have been proposed (Zhong et al., 2017;Xu et al., 2017;Huang et al., 2018;Yu et al., 2018a;Dong and Lapata, 2018;Yavuz et al., 2018) based on WikiSQL dataset (Zhong et al., 2017). However, all of them are limited to the specific WikiSQL SQL sketch, which only supports very simple queries. It includes only the SELECT and WHERE clauses, only a single expression in the SELECT clause, and works only for a single table. To predict more complex SQL queries, sequence-to-sequence (Iyer et al., 2017;Finegan-Dollak et al., 2018) and template-based (Finegan-Dollak et al., 2018;Lee et al., 2019) approaches have been proposed. However, they focused only on specific databases such as ATIS (Price, 1990) and GeoQuery (Zelle and Mooney, 1996). Because they only considered question and SQL pairs without requiring an understanding of database schema, their approaches cannot generalize to unseen databases.
SyntaxSQLNet (Yu et al., 2018b) is the first and state-of-the-art model for the Spider (Yu et al., 2018c), a complex and cross-domain text-to-SQL task. They proposed an SQL specific syntax tree-based decoder with SQL generation history. Our approach differs from their model in the following aspects. First, taking into account that SQL corresponds to non-procedural language, we develop a clause-specific decoder for each SQL clause, where SyntaxSQLNet predicts SQL tokens sequentially. For example, in SyntaxSQL-Net, a single column prediction module works both in the SELECT and WHERE clauses, depending on the SQL decoding history. In contrast, we define and train decoding modules separately for each SQL clause to fully utilize clausedependent context. Second, we apply sequenceto-sequence architecture to predict columns instead of using the sequence-to-set framework from SyntaxSQLNet, because correct ordering is essential for the GROUP BY and ORDER BY clauses. Finally, we introduce a self-attention mechanism (Lin et al., 2017) to efficiently encode database schema, which includes multiple tables.

Methodology
We predict complex SQL clause-wisely as described in Figure 1. Each clause is predicted consecutively by at most three different types of modules (sketch, column, operator). The same architecture recursively predicts nested queries with temporal predicted SQL as an additional input.

Question and Schema Encoding
We encode a natural language question with a bidirectional LSTM. We denote H Q ∈ R d×|X| as the question encoding, where d is the number of LSTM units and |X| is the number of tokens in the question.
To encode a database schema, we consider each column in its tables as a concatenated sequence of words from the table name and column name with a separation token. (e.g., [student, [SEP], first, name]). First, we apply bi-directional LSTM over this sequence for each column. Then, we apply the self-attention mechanism (Lin et al., 2017) over the LSTM outputs to form a summarized fixedsize vector for each column. For the ith column, its encoding h where |L| is the number of tokens in the column and w ∈ R d is a trainable parameter. We denote col ] as columns encoding where |C| is the number of columns in the database.

Sketch Prediction
We predict the clause-wise sketch via 8 different text classification modules that include the number of SQL expressions in each clause, the presence of LIMIT clause, and the presence of INTERSECT/UNION/EXCEPT as described in Figure 1. All of them share the same model architecture but are trained separately. For the classification, we applied attention-based bi-directional LSTM following Zhou et al. (2016).
First, we compute sentence representation r s ∈ R d by a weighted sum of question encoding H Q ∈ R d×|X| . Then we apply the softmax classifier to choose the sketch as follows: where w s ∈ R d , W s ∈ R ns×d , b s ∈ R ns are trainable parameters and n s is the number of possible sketches.

Columns and Operators Prediction
To predict columns and operators, we use the LSTM decoder with the attention mechanism (Luong et al., 2015) such that the number of decoding steps are decided by the sketch prediction module. We train 5 different column prediction modules separately for each SQL clause, but they share the same architecture.
In the column prediction module, the hidden state of the decoder at the t-th decoding step is computed as d ∈ R d is an encoding of the predicted column in the previous decoding step. The context vector r (t) is computed by a weighted sum of question encodings H Q ∈ R d×|X| based on attention weight as follows: Then, the attentional output of the t-th decoding step a (t) col is computed as a linear combination of d (t) col ∈ R d and r (t) ∈ R d followed by tanh activation.
where W 1 , W 2 ∈ R d×d are trainable parameters. Finally, the probability for each column at the tth decoding step is computed as a dot product between a (t) col ∈ R d and the encoding of each column in H col ∈ R d×|C| followed by softmax.
To predict corresponding operators for each predicted column, we use a decoder of the same architecture as in the column prediction module. The only difference is that a decoder input at the t-th decoding step is an encoding of the t-th predicted column from the column prediction module.
Attentional output a (t) op ∈ R d is computed identically to Eq. (8). Then, the probability for operators corresponding to the t-th predicted column is computed by the softmax classifier as follows: where W o ∈ R no×d and b o ∈ R no are trainable parameters and n o is the number of possible operators.

From Clause Prediction
After the predictions of all the other clauses, we use a heuristic to generate the FROM clause. We first collect all the columns that appear in the predicted SQL, and then we JOIN tables that include these predicted columns.

Recursion for Nested Queries
To predict the presence of a sub-query, we train another module that has the same architecture as the operator prediction module. Instead of predicting corresponding operators for each column, it predicts whether each column is compared to a variable (e.g., WHERE age > 3) or to a sub-query (e.g., WHERE age > (SELECT avg(age) ..)).

Experimental Setup
We evaluate our model with Spider (Yu et al., 2018c), a large-scale, complex and cross-domain text-to-SQL dataset. We follow the same database split as Yu et al. (2018c), which ensures that any database schema that appears in the training set does not appear in the dev or test set. Through this split, we examine how well our model can be generalized to unseen databases. Because the test set is not opened to the public, we use the dev set for the ablation analysis. For the evaluation metrics, we use 1) accuracy of exact SQL matching and 2) F1 score of SQL component matching, proposed by (Yu et al., 2018c). We also follow their query hardness criteria to understand the model performance on different levels of queries. Our model and all the baseline models are trained based on only the Spider dataset without data augmentation.

Model Configuration
We use the same hyperparameters for every module. For the word embedding, we apply deep contextualized word representations (ELMO) from Peters et al. (2018) and allow them to be fine-tuned during the training. For the question and column encoders, we use a 1-layer 512-unit bi-directional LSTM. For the decoders in the columns and operators prediction modules, we use a 1-layer 1024unit uni-directional LSTM. For the training, we use Adam optimizer (Kingma and Ba, 2014) with a learning rate of 1e-4 and use early stopping with 50 epochs. Additionally, we use dropout (Hinton et al., 2012) with a rate of 0.2 for the regularization. Table 1 shows the exact SQL matching accuracy of our model and previous models. We achieve 24.3% and 28.8% on the test and dev sets respectively, which outperforms the previous best model SyntaxSQLNet (Yu et al., 2018b) by 4.6% and 9.8%. Moreover, our model outperforms previous models on all different query hardness levels.

Result and Analysis
To examine how each technique contributes to the performance, we conduct an ablation analysis of three aspects: 1) without recursion, 2) without self-attention for database schema encoding, and 3) without sketch prediction modules that decide the number of decoding steps. Without recursive sub-query generation, the accuracy drops by 5.7% and 3.6% for hard and extra hard queries, respectively. This result shows that the recursion we use enables the model to predict nested queries. When using the final LSTM hidden state as in Yu et al. (2018b) instead of using self-attention for schema encoding, the accuracy drops by 4.0% on all queries. Finally, when using only an encoder-decoder architecture without sketch generation for columns prediction, the accuracy drops by 4.7%.
For the component matching result for each SQL clause, our model outperforms previous approaches for all of the SQL components by a significant margin, as shown in Table 2. Examples of predicted SQL from different models are shown in Appendix A.

Conclusion
In this paper, we propose a recursive and SQL clause-wise decoding neural architecture to address the complex and cross-domain text-to-SQL task. We evaluate our model with the Spider dataset, and the experimental result shows that our model significantly outperforms previous work for generating not only simple queries, but also complex and nested queries.  Table 3: Sample SQL predictions by our model and previous state-of-the-art models on the dev split. NL denotes the natural language question and Truth denotes the corresponding ground truth SQL query. Ours, Syntax, and SQLNet denotes the SQL predictions from our model, SyntaxSQLNet (Yu et al., 2018b), and modified SQLNet (Xu et al., 2017) by Yu et al. (2018c), respectively.