Semantic Parsing with Syntax- and Table-Aware SQL Generation

We present a generative model to map natural language questions into SQL queries. Existing neural network based approaches typically generate a SQL query word-by-word, however, a large portion of the generated results is incorrect or not executable due to the mismatch between question words and table contents. Our approach addresses this problem by considering the structure of table and the syntax of SQL language. The quality of the generated SQL query is significantly improved through (1) learning to replicate content from column names, cells or SQL keywords; and (2) improving the generation of WHERE clause by leveraging the column-cell relation. Experiments are conducted on WikiSQL, a recently released dataset with the largest question- SQL pairs. Our approach significantly improves the state-of-the-art execution accuracy from 69.0% to 74.4%.


Introduction
We focus on semantic parsing that maps natural language utterances to executable programs (Zelle and Mooney, 1996;Wong and Mooney, 2007;Zettlemoyer and Collins, 2007;Kwiatkowski et al., 2011;Pasupat and Liang, 2015;Iyer et al., 2017;Iyyer et al., 2017). In this work, we regard SQL as the program language, which could be executed on a table or a relational database to obtain an outcome. Datasets are the main driver of progress for statistical approaches in semantic parsing (Liang, 2016). Recently, Zhong * Work is done during internship at Microsoft Research Asia. et al. (2017) release WikiSQL, the largest handannotated semantic parsing dataset which is an order of magnitude larger than other datasets in terms of both the number of logical forms and the number of tables. Pointer network (Vinyals et al., 2015) based approach is developed, which generates a SQL query word-by-word through replicating from a word sequence consisting of question words, column names and SQL keywords. However, a large portion of generated results are incorrect or not executable due to the mismatch between question words and column names (or cells). This also reflects the real scenario where users do not always use exactly the same column name or cell content to express the question.
To address the aforementioned problem, we present a generative semantic parser that considers the structure of table and the syntax of SQL language. The approach is partly inspired by the success of structure/grammar driven neural network approaches in semantic parsing (Xiao et al., 2016;. Our approach is based on pointer networks, which encodes the question into continuous vectors, and synthesizes the SQL query with three channels. The model learns when to generate a column name, a cell or a SQL keyword. We further incorporate columncell relation to mitigate the ill-formed outcomes.
We conduct experiments on WikiSQL. Results show that our approach outperforms existing systems, improving state-of-the-art execution accuracy to 74.4% and logical form accuracy to 60.7%. Extensive analysis reveals the advantages and limitations of our approach.

Task Formulation and Dataset
As shown in Figure 1, we focus on sequence-to-SQL generation in this work. Formally, the task takes a question q and a table t consisting of n col-   Figure 1: An brief illustration of the task. The focus of this work is sequence-to-SQL generation.
umn names and n × m cells as the input, and outputs a SQL query y. We do not consider the join operation over multiple relational tables, which we leave in the future work. We use WikiSQL (Zhong et al., 2017), the largest hand-annotated semantic parsing dataset to date which consists of 87,726 questions and SQL queries distributed across 26,375 tables from Wikipedia.

Related Work
Semantic Parsing. Semantic parsing aims to map natural language utterances to programs (e.g., logical forms), which will be executed to obtain the answer (denotation) (Zettlemoyer and Collins, 2005;Liang et al., 2011;Berant et al., 2013;Poon, 2013;Krishnamurthy and Kollar, 2013;Sun et al., 2016;Jia and Liang, 2016;Kočiský et al., 2016;Lin et al., 2017). Existing studies differ from (1) the form of the knowledge base, e.g. facts from Freebase, a table (or relational database), an image (Suhr et al., 2017;Johnson et al., 2017;Hu et al., 2017;Goldman et al., 2017) or a world state (Long et al., 2016); (2) the program language, e.g. first-order logic, lambda calculus, lambda DCS, SQL, parameterized neural programmer (Yin et al., 2015;Neelakantan et al., 2016), or coupled distributed and symbolic executors (Mou et al., 2017); (3) the supervision used for learning the semantic parser, e.g. question-denotation pairs, binary correct/incorrect feedback (Artzi and Zettlemoyer, 2013), or richer supervision of question-logical form pairs (Dong and Lapata, 2016). In this work, we study semantic parsing over tables, which is critical for users to access relational databases with natural language, and could serve users' in-formation need for structured data on the web. We use SQL as the program language, which has a broad acceptance to programmers.

Natural Language Interface for Databases.
Our work relates to the area of accessing database with natural language interface (Dahl et al., 1994;Brad et al., 2017). Popescu et al. (2003) use a parser to parse the question, and then use lexicon matching between question and the table column names/cells. Giordani and Moschitti (2012) parse the question with dependency parser, compose candidate SQL queries with heuristic rules, and use kernel based SVM ranker to rank the results. Li and Jagadish (2014) translate natural language utterances into SQL queries based on dependency parsing results, and interact with users to ensure the correctness of the interpretation process. Yaghmazadeh et al. (2017) build a semantic parser on the top of SEMPRE (Pasupat and Liang, 2015) to get a SQL sketch, which only has the SQL shape and will be subsequently completed based on the table content. Iyer et al. (2017) maps utterances to SQL queries through sequence-tosequence learning. User feedbacks are incorporated to reduce the number of queries to be labeled. Zhong et al. (2017) develop an augmented pointer network, which is further improved with reinforcement learning for SQL sequence prediction. Xu et al. (2017) adopts sequence-toset model to predict WHERE columns, and uses attentional model to predict the slots in where clause.
Different from (Iyer et al., 2017;Zhong et al., 2017), our approach leverages SQL syntax and table structure. Compared to (Popescu et al., 2003;Giordani and Moschitti, 2012;Yaghmazadeh et al., 2017), our approach is end-to-end learn-ing and independent of syntactic parser or manually designed templates. We are aware of existing studies that combine reinforcement learning and maximum likelihood estimation (MLE) (Guu et al., 2017;Mou et al., 2017;. However, the focus of this work is the design of the neural architecture, despite we also implement a RL strategy (refer to §4.4).

Structure/Grammar Guided Neural Decoder
Our approach could be viewed as an extension of the sequence-to-sequence learning (Sutskever et al., 2014;Bahdanau et al., 2015) with a tailored neural decoder driven by the characteristic of the target language (Yin and Neubig, 2017;Rabinovich et al., 2017). Methods with similar intuitions have been developed for language modeling , neural machine translation (Wu et al., 2017) and lambda calculus based semantic parsing (Dong and Lapata, 2016;. The difference is that our model is developed for sequence-to-SQL generation, in which table structure and SQL syntax are considered.

Methodology
We first describe the background on pointer networks, and then present our approach that considers the table structure and the SQL syntax.

Background: Pointer Networks
Pointer networks is originally introduced by (Vinyals et al., 2015), which takes a sequence of elements as the input and outputs a sequence of discrete tokens corresponding to positions in the input sequence. The approach has been successfully applied in reading comprehension (Kadlec et al., 2016) for pointing to the positions of answer span from the document, and also in sequenceto-sequence based machine translation (Gulcehre et al., 2016) and text summarization (Gu et al., 2016) for replicating rare words from the source sequence to the target sequence.
The approach of Zhong et al. (2017) is based on pointer networks. The encoder is a recurrent neural network (RNN) with gated recurrent unit (GRU) (Cho et al., 2014), whose input is the concatenation of question words, words from column names and SQL keywords. The decoder is another GRU based RNN, which works in a sequential way and generates a word at each time step. The generation of a word is actually selectively replicating a word from the input sequence, the probability distribution of which is calculated with an attention mechanism (Bahdanau et al., 2015). The probability of generating the i-th word x i in the input sequence at the t-th time step is calculated as Equation 1, where h dec t is the decoder hidden state at the t-th time step, h enc i is the encoder hidden state of the word x i , W a is the model parameter.
It is worth to note that if a column name consists of multiple words (such as "original artist" in Figure 1), these words are separated in the input sequence. The approach has no guarantee that a multi-word column name could be successively generated, which would affect the executability of the generated SQL query. Table-Aware seMantic Parser Figure 2 illustrates an overview of the proposed model, which is abbreviated as STAMP. Different from Zhong et al. (2017), word is not the basic unit to be generated in STAMP. As is shown, there are three "channels" in STAMP, among which the column channel predicts a column name, the value channel predicts a table cell and the SQL channel predicts a SQL keyword. Accordingly, the probability of generating a target token is formulated in Equation 2, where z t stands for the channel selected by the switching gate, p z (·) is the probability to choose a channel, and p w (·) is similar to Equation 1 which is a probability distribution over the tokens from one of the three channels.

STAMP: Syntax-and
p(y t |y <t , x) = zt p w (y t |z t , y <t , x)p z (z t |y <t , x) (2) One advantage of this architecture is that it inherently addresses the problem of generating partial column name/cell because an entire column name/cell is the basic unit to be generated. Another advantage is that the column-cell relation and question-cell connection can be naturally integrated in the model, which will be described below.
Specifically, our encoder takes a question as the input. Bidirectional RNN with GRU unit is applied to the question, and the concatenation of both ends is used as the initial state of the decoder. Another bidirectional RNN is used to compute the   Figure 2: An illustration of the proposed approach. At each time step, a switching gate selects a channel to predict a column name (maybe composed of multiple words), a cell or a SQL keyword. The words in green below the SQL tokens stand for the results of the switching gate at each time step.
representation of a column name (or a cell), in case that each unit contains multiple words (Dong et al., 2015). Essentially, each channel is an attentional neural network. For cell and SQL channels, the input of the attention module only contains the decoder hidden state and the representation of the token to be calculated as follows, where e sql i stands for the representation of the ith SQL keyword. As suggested by (Zhong et al., 2017), we also concatenate the question representation into the input of the column channel in order to improve the accuracy of the SELECT column. We implement the switching gate with a feed-forward neural network, in which the output is a sof tmax function and the input is the decoder hidden state h dec t .

Improved with Column-Cell Relation
We further improve the STAMP model by considering the column-cell relation, which is important for predicting the WHERE clause. On one hand, the column-cell relation could improve the prediction of SELECT column. We observe that a cell or a part of it typically appears at the question acting as the WHERE value, such as "anna nalick" for "anna christine nalick"). However, a column name might be represented with a totally different utterance, which is a "semantic gap". Supposing the question is "How many schools did player number 3 play at?" and the SQL query is "Select count School Club Team where No. = 3". We could see that the column names "School Club Team" and "No." are different from their corresponding utterances "schools", "number" in natural language question. Thus, table cells could be regarded as the pivot that connects the question and column names (the "linking" component in Figure 2). For instance, taking the question from Figure 2, the word "York" would help to predict the column name as "College" rather than "Player". There might be different possible ways to implement this intuition. We use cell information to enhance the column name representation in this work. The vector of a column name is further concatenated with a question-aware cell vector, which is weighted averaged over the cell vectors belonging to the same column. The probability distribution in the column channel is calculated as Equation 4. We use the number of cell words occurring in the question to measure the importance of a cell, which is further normalized through a sof tmax function to yield the final weight α cell j ∈ [0, 1]. An alternative measurement is to using an additional attention model whose input contains the question vector and the cell vector. We favor to the intuitive and efficient way in this work.
(4) On the other hand, the column-cell relation could improve the prediction of the WHERE value. To yield an executable SQL, the WHERE value should be a cell that belongs to the same WHERE column 1 . Taking Figure 2 as an example, it should be avoided to predict a where clause like "Player = York" because the cell "York" does not belong to the column name "Player". To achieve this, we incorporate an global variable to memorize the last predicted column name. When the switching gate selects the value channel, the cell distribution is only calculated over the cells belonging to the last predicted column name. Furthermore, we incorporate an additional probability distribution over cells based on the aforementioned word co-occurrence between the question and cells, and weighted average two cell distributions, which is calculated as follows.
wherep cell w (j) is the standard probability distribution obtained from the attentional neural network, and λ is a hyper parameter which is tuned on the dev set.

Improved with Policy Gradient
The model described so far could be conventionally learned via cross-entropy loss over question-SQL pairs. However, different SQL queries might be executed to yield the same result, and possible SQL queries of different variations could not be exhaustively covered in the training dataset. Two possible ways to handle this are (1) shuffling the WHERE clause to generate more SQL queries, and (2) using reinforcement learning (RL) which regards the correctness of the executed output as the goodness (reward) of the generated SQL query. We follow Zhong et al. (2017) and adopt a policy gradient based approach. We use a baseline strategy (Zaremba and Sutskever, 2015) to decrease the learning variance. The expected reward (Williams, 1992) for an instance is calculated as E (y g ) = k j=1 logp(y j )R(y j , y g ), where y g is the ground truth SQL query, y j is a generated SQL query, p(y j ) is the probability of y j being generated by our model, and k is the number of sampled SQL queries. R(y j , y g ) is the same reward function defined by Zhong et al. (2017), which is +1 if y j is executed to yield the correct answer; −1 if y j is a valid SQL query and is executed to get an incorrect answer; and −2 if y j is not a valid SQL query. In this way, model parameters could be updated with policy gradient over question-answer pairs.

Training and Inference
As the WikiSQL data contains rich supervision of question-SQL pairs, we use them to train model parameters. The model has two cross-entropy loss functions, as given below. One is for the switching gate classifier (p z ) and another is for the attentional probability distribution of a channel (p w ).
Our parameter setting strictly follows Zhong et al. (2017). We represent each word using word embedding 2 (Pennington et al., 2014) and the mean of the sub-word embeddings of all the n-grams in the word (Hashimoto et al., 2016) 3 . The dimension of the concatenated word embedding is 400. We clamp the embedding values to avoid over-fitting. We set the dimension of encoder and decoder hidden state as 200. During training, we randomize model parameters from a uniform distribution with fan-in and fan-out, set batch size as 64, set the learning rate of SGD as 0.5, and update the model with stochastic gradient decent. Greedy search is used in the inference process. We use the model trained from question-SQL pairs as initialization and use RL strategy to fine-tune the model. SQL queries used for training RL are sampled based on the probability distribution of the model learned from question-SQL pairs. We tune the best model on the dev set and do inference on the test set for only once. This protocol is used in model comparison as well as in ablations.

Experiment
We conduct experiments on the WikiSQL dataset 4 , which includes 61, 297/9, 145/17, 284 examples in the training/dev/test sets. Each instance consists of a question, a table, a SQL query and a result. Following Zhong et al. (2017), we use two evaluation metrics. One metric is logical form accuracy (Acc lf ), which measures the percentage of the generated SQL queries that have exact string match with the ground truth SQL queries. Since different SQL queries might obtain the same result, another metric is execution accuracy (Acc ex ), which measures the percentage of the generated SQL queries that obtain the correct answer.

Model Comparisons
After released, WikiSQL dataset has attracted a lot of attentions from both industry and research communities. Zhong et al. (2017) develop following methods, including (1) Aug.PntNet which is an end-to-end learning pointer network approach; (2) Seq2SQL (no RL), in which two separate classifiers are trained for SELECT aggregator and SELECT column, separately; and (3) Seq2SQL, in which reinforcement learning is further used for model training. Results of the attentional sequence-to-sequence learning baseline (Attentional Seq2Seq) are also reported in (Zhong et al., 2017). Xu et al. (2017)  STAMP+RL is the model that is fine-tuned with the reinforcement learning strategy as described in §4.4. We implement a simplified version of our approach (w/o cell), in which WHERE values come from the question. Thus, this setting differs from Aug.PntNet in the generation of WHERE column. We also study the influence of the relation-cell relation (w/o column-cell relation) through removing the enhanced column vector, which is calculated by weighted averaging cell vectors.
From Table 1, we can see that STAMP performs better than existing systems on WikiSQL. Incorporating RL strategy does not significantly improve the performance. Our simplified model, STAMP (w/o cell), achieves better accuracy than Aug.PntNet, which further reveals the effects of the column channel. Results also demonstrate the effects of incorporating the column-cell relation, removing which leads to about 4% performance drop in terms of Acc ex .

Model Analysis: Fine-Grained Accuracy
We analyze the STAMP model from different perspectives in this part.
Firstly, since SQL queries in WikiSQL consists of SELECT column, SELECT aggregator, and WHERE clause, we report the results with regard to more fine-grained evaluation metrics over these  Table 2: Fine-grained accuracies on the WikiSQL dev and test sets. Accuracy (Acc lf ) is evaluated on SELECT column (Acc sel ) , SELECT aggregator (Acc agg ), and WHERE clause (Acc where ), respectively.
aspects. Results are given in Table 2, in which the numbers of Seq2SQL and SQLNet are reported in (Xu et al., 2017). We can see that the main improvement of STAMP comes from the WHERE clause, which is also the key challenge of the Wik-iSQL dataset. This is consistent with our primary intuition on improving the prediction of WHERE column and WHERE value. The accuracies of STAMP on SELECT column and SELECT aggregator are not as high as SQLNet. The main reason is that these two approaches train the SELECT clause separately while STAMP learns all these components in a unified paradigm.

Model Analysis: Difficulty Analysis
We study the performance of STAMP on different portions of the test set according to the difficulties of examples. We compare between Aug.PntNet (re-implemented by us) and STAMP. In this work, the difficulty of an example is reflected by the number of WHERE columns.  From Table 3, we can see that STAMP outperforms Aug.PntNet in all these groups. The accuracy decreases with the increase of the number of WHERE conditions.

Model Analysis: Executable Analysis
We study the percentage of executable SQL queries in the generated results. As shown in

Model Analysis: Case Study
We give a case study to illustrate the generated results by STAMP, with a comparison to Aug.PntNet. Results are given in Figure 3. In the first example, Aug.PntNet generates incomplete column name ("style"), which is addressed in STAMP through replicating an entire column name. In the second example, the WHERE value ("brazilian jiu-jitsu") does not belong to the generated WHERE column ("Masters") in Aug.PntNet. This problem is avoided in STAMP through incorporating the table content.

Error Analysis
We conduct error analysis on the dev set of Wik-iSQL to show the limitation of the STAMP model and where is the room for making further improvements. We analyze the 2,302 examples which are executed to wrong answers by the STAMP model, and find that 33.6% of them have wrong SELECT   Even though the overall accuracy of the SELECT column prediction is about 90% and we also use cell information to enhance the column representation, this semantic gap is still the main bottleneck. Extracting and incorporating various expressions for a table column (i.e. relation in a relational database) might be a potential way to mitigate this problem. Compared to column prediction, the quality of cell prediction is much better because cell content typically (partially) appears in the question.

Transfers to WikiTableQuestions
WikiTableQuestions (Pasupat and Liang, 2015) is a widely used dataset for semantic parsing. To further test the performance of our approach, we conduct an additional transfer learning experiment. Firstly, we directly apply the STAMP model trained on WikiSQL to WikiTableQuestions, which is an unsupervised learning setting for the WikiTableQuestions dataset. Results show that the test accuracy of STAMP in this setting is 14.5%, which has a big gap between best systems on WikiTableQuestions, where  and  yield 43.3% and 43.7%, respectively. Furthermore, we apply the learnt STAMP model to generate SQL queries on natural language questions from WikiTableQuestions, and regard the generated SQL queries which could be executed to correct answers as additional pseudo question-SQL pairs. In this way, the STAMP model learnt from a combination of WikiSQL and pseudo question-SQL pairs could achieve 21.0% on the test set. We find that this big gap is caused by the difference between the two datasets. Among 8 types of questions in Wik-iTableQuestions, half of them including {"Union", "Intersection", "Reverse", "Arithmetic"} are not covered in the WikiSQL dataset. It is an interesting direction to leverage algorithms developed from two datasets to improve one another.

Discussion
Compared to slot-filling based models that restrict target SQL queries to fixed forms of "selectaggregator-where", our model is less tailored. We believe that it is easy to expand our model to generate nested SQL queries or JOIN clauses, which could also be easily trained with back-propagation if enough training instances of these SQL types are available. For example, we could incorporate a hierarchical "value" channel to handle nest queries. Let us suppose our decoder works horizontally that next generated token is at the right hand of the current token. Inspired by chunk-based decoder for neural machine translation (Ishiwatari et al., 2017), we could increase the depth of the "value" channel to generates tokens of a nested WHERE value along the vertical axis. During inference, an addition gating function might be necessary to determine whether to generate a nested query, followed by the generation of WHERE value. An intuitive way that extends our model to handle JOIN clauses is to add the 4th channel, which predicts a table from a collection of tables. Therefore, the decoder should learn to select one of the four channels at each time step. Accordingly, we need to add "from" as a new SQL keyword in order to generate SQL queries including "from xxxTable". In terms of the syntax of SQL, the grammar we used in this work could be regarded as shallow syntax, such as three channels and columncell relation. We do not use deep syntax, such as the sketch of SQL language utilized in some slotfilling models, because incorporating them would make the model clumpy. Instead, we let the model to learn the sequential and compositional relations of SQL queries automatically from data. Empirical results show that our model learns these patterns well.

Conclusion and Future Work
In this work, we develop STAMP, a Syntax-and Table-Aware seMantic Parser that automatically maps natural language questions to SQL queries, which could be executed on web table or relational dataset to get the answer. STAMP has three channels, and it learns to switch to which channel at each time step. STAMP considers cell information and the relation between cell and column name in the generation process. Experiments are conducted on the WikiSQL dataset. Results show that STAMP achieves the new state-of-the-art performance on WikiSQL. We conduct extensive experiment analysis to show advantages and limitations of our approach, and where is the room for others to make further improvements. SQL language has more complicated queries than the cases included in the WikiSQL dataset, including (1) querying over multiple relational databases, (2) nested SQL query as condition value, (3) more operations such as "group by" and "order by", etc. In this work, the STAMP model is not designed for the first and second cases, but it could be easily adapted to the third case through incorporating additional SQL keywords and of course the learning of which requires dataset of the same type. In the future, we plan to improve the accuracy of the column prediction component. We also plan to build a large-scale dataset that considers more sophisticated SQL queries.