Question Generation from SQL Queries Improves Neural Semantic Parsing

In this paper, we study how to learn a semantic parser of state-of-the-art accuracy with less supervised training data. We conduct our study on WikiSQL, the largest hand-annotated semantic parsing dataset to date. First, we demonstrate that question generation is an effective method that empowers us to learn a state-of-the-art neural network based semantic parser with thirty percent of the supervised training data. Second, we show that applying question generation to the full supervised training data further improves the state-of-the-art model. In addition, we observe that there is a logarithmic relationship between the accuracy of a semantic parser and the amount of training data.


Introduction
Semantic parsing aims to map a natural language utterance to an executable program (logical form) (Zelle and Mooney, 1996;Wong and Mooney, 2007;Zettlemoyer and Collins, 2007). Recently, neural network based approaches (Dong and Lapata, 2016;Jia and Liang, 2016;Xiao et al., 2016;Guu et al., 2017;Dong et al., 2018) have achieved promising performance in semantic parsing. However, neural network approaches are data hungry, which performances closely correlate with the volume of training data. In this work, we study the influence of training data on the accuracy of neural semantic parsing, and how to train a state-of-theart model with less training data.
We conduct the study on WikiSQL (Zhong et al., 2017), the largest hand-annotated semantic parsing dataset which is larger than other datasets in terms of both the number of logical forms and the number of schemata. The task is to map a natural language question to a SQL query. We use a state-of-the-art end-to-end semantic parser based on neural networks (detailed in Section 3), and vary the number of supervised training instances. Results show that there is a logarithmic relationship between accuracy and the amount of training data, which is consistent with the observations in computer vision tasks (Sun et al., 2017).
We further study how to achieve state-of-the-art parsing accuracy with less supervised data, since annotating a large scale semantic parsing dataset requires funds and domain expertise. We achieve this through question generation, which generates natural language questions from SQL queries. Our question generation model is based on sequenceto-sequence learning. Latent variables (Cao and Clark, 2017) are introduced to increase the diversity of generated questions. The artificially generated question-SQL pairs can be viewed as pseudolabeled data, which can be combined with a small amount of human-labeled data to train the semantic parser.
Results on WikiSQL show that the state-of-theart logical form accuracy drops from 60.7% to 53.7% with only thirty percent of training data, while increasing to 61.0% when we combine the pseudo-labeled data generated from the question generation model. Applying the question generation model to full training data brings further improvements with 3.0% absolute gain. We further conduct a transfer learning experiment that applies our approach trained on WikiSQL to WikiTable-Questions (Pasupat and Liang, 2015). Results show that incorporating generated instances improves the state-of-the-art neural semantic parser .

Overview of the Approach
Our task aims to map a question to a SQL query, which is executable over a table to yield the an-swer. Formally, the task takes a question q and a table t consisting of n column names and n × m cells as the input, and outputs a SQL query y. In this section, we describe an overview of our approach, which is composed of several components.

Model Training
Figure 1: An overview of our approach that improves semantic parsing with question generation. Figure 1 gives an overview of our approach. First, given a table, a SQL query sampler is used to sample valid, realistic, and representative SQL queries. Second, a question generation component takes SQL queries as inputs to obtain natural language questions. Here, the question generation model is learnt from a small-scale supervised training data that consists of SQL-question pairs. Lastly, the generated question-SQL pairs are viewed as the pseudo-labeled data, which are combined with the supervised training data to train the semantic parser.
Since we conduct the experiment on WikiSQL dataset, we follow Zhong et al. (2017) and use the same template-based SQL sampler, as summarized in Table 1. The details about the semantic parser and the question generation model will be introduced in Sections 3 and Section 4, respectively.

Semantic Parsing Model
We use a state-of-the-art end-to-end semantic parser (Sun et al., 2018) that takes a natural language question as the input and outputs a SQL The aggregation column agg col and the condition column cond col can be one of columns in the table.

agg op
The aggregation operator agg op can be empty or COUNT. If the type of agg col is numeric, agg op can additionally be one of MAX and MIN. cond op The condition operator cond op is =. If the type of cond col is numeric, cond op can additionally be one of > and <. cond The condition value cond can be any cell value under the cond col. If the type of cond col is numeric, cond can be numerical value sampled from minimum value to maximum value in the cond col.
Filter Rules 1.The condition will be removed if doing the action does not change the execution result. 2.We only save the sampled SQL queries that produce non-empty result set.  (Zhong et al., 2017). query, which is executed on a table to obtain the answer. To make the paper self-contained, we briefly describe the approach in this section.
The semantic parser is abbreviated as STAMP, which is short for Syntax-and Table-Aware se-Mantic Parser. Based on the encoder-decoder framework, STAMP takes a question as the input and generates a SQL query. It extends pointer networks (Zhong et al., 2017;Vinyals et al., 2015) by incorporating three "channels" in the decoder, in which the column channel predicts column names, the value channel predicts table cells and the SQL channel predicts SQL keywords. An additional switching gate selects which channel to be used for generation. In STAMP, the probability of a token to be generated is calculated as Equation 1, where p z (·) is the probability of the channel z t to be chosen, and p w (·) is the probability distribution of generating a word y t from the selected channel.
p(y t |y <t , x) = zt p w (y t |z t , y <t , x)p z (z t |y <t , x) (1) Specifically, the encoder takes a question as the input, uses bidirectional RNN with GRU cells to compute the hidden states, and feeds the concatenation of both ends as the initial state of the decoder. The decoder has another GRU to calculate the hidden states.
Each channel is implemented with an atten-tional neural network. In the SQL channel, the input of the attention module includes the decoder hidden state and the embedding of the SQL keyword to be calculated (i.e. e sql i ).
In the column channel, the vector of a column name includes two parts, as given in Equation 3. The first vector (h col i ) is calculated with a bidirectional GRU because a column name might contain multiple words. The second vector is a questionaware cell vector, which is weighted averaged over the cell vectors belonging to the column. Cell vectors (h cell i ) are also obtained by a bidirectional GRU. The importance of a cell is measured by the number of co-occurred question words, which is further normalized through a sof tmax function to yield the final weight α cell (3) In the value channel, the model has two distributions and weighted average them as Equation 4. Similar to p sql (·), a standard cell distribution p cell w (·) is calculated over the cells belonging to the last predicted column name. They incorporate an additional probability distribution α cell (·) based on the aforementioned word co-occurrence. The hyper parameter λ is tuned on the dev set.
Please see more details on model training and inference in Sun et al. (2018).

Question Generation Model
In this section, we present our SQL-to-question generation approach, which takes a SQL query as the input and outputs a natural language question. Our approach is based on sequence-to-sequence learning (Sutskever et al., 2014;Bahdanau et al., 2015). In order to replicate rare words from SQL queries, we adopt the copying mechanism. In addition, we incorporate latent variables to increase the diversity of generated questions.

Encoder-Decoder
Encoder: A bidirectional RNN with gated recurrent unit (GRU) (Cho et al., 2014) is used as the encoder to read a SQL query x = (x 1 , ..., x T ). The forward RNN reads a SQL query in a left-to-right direction, obtaining hidden states The backward RNN reads reversely and outputs ( ← − h 1 , ..., ← − h T ). We then get the final representation (h 1 , ..., h T ) for each word in the query, is used as initial hidden state of the decoder.
Decoder: We use a GRU with an attention mechanism as the decoder. At each time-step t, the attention mechanism obtains the context vector c t that is computed same as the multiplicative attention (Luong et al., 2015). Afterwards, the concatenation of the context vector, the embedding of the previously predicted word y t−1 , and the last hidden state s t−1 is fed to the next step.
After obtaining hidden states s t , we adopt the copying mechanism that predicts a word from the target vocabulary or from the source sentence (detailed in Subsection 4.2).

Incorporating Copying Mechanism
In our task, the generated question utterances typically include informative yet low-frequency words such as named entities or numbers. Usually, these words are not included in the target vocabulary but come from SQL queries. To address this, we follow CopyNet (Gu et al., 2016) and incorporate a copying mechanism to select whether to generate from the vocabulary or copy from SQL queries. The probability distribution of generating the t-th word is calculated as Equation 6, where ψ g (·) and ψ c (·) are scoring functions for generating from the vocabulary ν and copying from the source sentence x, respectively.
The two scoring functions are calculated as follows, where W g and W c are model parameters, v i is the one-hot indicator vector for y i and h i is the hidden state of word y i in the source sentence.

Incorporating Latent Variable
Increasing the diversity of generated questions is very important to improve accuracy, generalization, and stability of the semantic parser, since this increases the mount of training data and produces more diverse questions for the same intent. In this work, we incorporate stochastic latent variables (Cao and Clark, 2017;Serban et al., 2017) to the sequence-to-sequence model in order to increase question diversity. Specifically, we introduce a latent variable z ∼ p(z), which is a standard Gaussian distribution N (0, I n ) in our case, and calculate the likelihood of a target sentence y as follows: We maximize the evidence lower bound (ELBO), which decomposes the loss into two parts, including (1) the KL divergence between the posterior distribution and the prior distribution, and (2) a cross-entropy loss between the generated question and the ground truth.
The KL divergence in Equation 9 is calculated as follow, where n is the dimensionality of z.
Q(z|x, y) is a posterior distribution with Gaussian distribution. The mean µ and standard deviation σ are calculated as follows, where h x and h y are representations of source and target sentences in the encoder, respectively. Similar to h x , h y is obtained by encoding the target sentence.

Training and Inference
At the training phase, we sample z from Q(z|x, y) using the re-parametrization trick (Kingma and Welling, 2014), and concatenate the source last hidden state h x and z as the initial state of the decoder. Since the model tends to ignore the latent variables by forcing the KL divergence to 0 (Bowman et al., 2016), we add a variable weight to the KL term during training. At the inference phase, the model will generate different questions by first sampling z from p(z), concatenating h x and z as the initial state of the decoder, and then decoding deterministically for each sample.
Here, we list our training details. We set the dimension of the encoder hidden state as 300, and the dimension of the latent variable z as 64. We use dropout with a rate of 0.5, which is applied to the inputs of RNN. Model parameters are initialized with uniform distribution, and updated with stochastic gradient decent. Word embedding values are initialized with Glove vectors (Pennington et al., 2014). We set the learning rate as 0.1 and the batch size as 32. We tune hyper parameters on the development, and use beam search in the inference process.

Experiment
We conduct experiments on the WikiSQL dataset 1 (Zhong et al., 2017). WikiSQL is the largest handannotated semantic parsing dataset which is an order of magnitude larger than other datasets in terms of both the number of logical forms and the number of schemata (tables  Zhong et al. (2017) to use two evaluation metrics. One is logical form accuracy (Acc lf ), which measures the percentage of exact string match between the generated SQL queries and the ground truth SQL queries. Since different logical forms might obtain the same result, another metric is execution accuracy (Acc ex ), which is the percentage of the generated SQL queries that result in the correct answer.

Impact of Data Size
We study how the number of training instances affects the accuracy of semantic parsing.
In this experiment, we randomly sample 20 subsets of examples from the WikiSQL training data, incrementally increased by 3K examples (about 1/20 of the full WikiSQL training data). We use the same training protocol and report the accuracy of the STAMP model on the dev set. Results are given in Figure 2. It is not surprising that more training examples bring higher accuracy. Interestingly, we observe that both accuracies of the neural network based semantic parser grow logarith-  Table 2: Performance of different approaches on the WikiSQL dataset. The two evaluation metrics are logical form accuracy (Acc lf ) and execution accuracy (Acc ex ). The settings of the training data represent the proportion of supervised data we use.  mically as training data expands, which is consistent with the observations in computer vision tasks (Sun et al., 2017).

Model Comparisons
We report the results of existing methods on Wik-iSQL, and demonstrate that question generation is an effective way to improve the accuracy of semantic parsing. Zhong et al. (2017) implement several methods, including Attentional Seq2Seq, which is a basic attentional sequence-to-sequence learning baseline; Aug.PntNet, which is an augmented pointer network in which words of the target sequence come from the source sequence; and Seq2SQL which extends Aug.PntNet by further learning two separate classifiers for SELECT aggregator and SELECT column.  develop SQLNet, which uses two separate models to predict SELECT and WHERE clauses, respectively, and introduce a sequence-to-set neural network to predict the WHERE clause. STAMP stands for the semantic parser which has been described in Section 3.
From Table 2, we can see that STAMP performs better than existing systems when trained on the full WikiSQL training dataset, achieving state-ofthe-art execution accuracy and logical form accuracy on WikiSQL. We further conduct experiments to demonstrate the effectiveness of our question generation driven approach. We run the entire pipeline (STAMP+QG) with different percentages of training data. The second column "Training Data" in Table 2 and the x-axis in Figure 3 represent the proportion of WikiSQL training data we use for training the QG model and semantic parser. That is to say, STAMP +QG with 30% means that we sample 30% WikiSQL training data to train the QG model, and then combine QG generated data and exactly the same 30% Wik-iSQL training data we sampled before to train the semantic parser. In this experiment, we sample five SQL queries for each table in the training data, resulting in 43.5K SQL queries. Applying the QG model on these SQL queries, we get 92.8K SQLquestion pairs. From Figure 3, we see that accuracy increases as the amount of supervised training data expands. Results show that QG empowers the STAMP model to achieve the same accuracy on WikiSQL dataset with 30% of the training data. Applying QG to the STAMP model under the full setting brings further improvements, resulting in new state-of-the-art accuracies.

Fine-grained Accuracies
Since SQL queries in WikiSQL consist of SE-LECT column, SELECT aggregator, and WHERE clause, we report fine-grained accuracies with regard to these aspects, respectively. From Table 3, we observe that the main advantage of STAMP+QG over STAMP comes from the prediction of the WHERE clause, which is also the main challenge of the WikiSQL dataset. We further analyze STAMP and STAMP+QG on the WHERE clause by splitting the dev and test sets into three groups according to the number of conditions in the WHERE clause. From Table 4, we see that combining QG is helpful when the number of WHERE conditions is more than one. The main reason is that dominant instances in the WikiSQL training set have only one WHERE condition, as shown in Table 5, thus the model might not have memorized enough patterns for the other two limited-data groups. Therefore, the pseudolabeled instances generated by our SQL sampler and QG approach are more precious to the limiteddata groups (i.e #where =2 and #where≥3).  #where supervised data generated data = 1 69.1% 55.4% = 2 24.1% 33.0% ≥ 3 6.1% 11.4% Table 5: Distribution of the number of WHERE conditions in supervised and generated data.

Influences of Different QG Variations
To better understand how various components in our QG model impact the overall performance, we study different QG model variations. We use three evaluation metrics, including two accuracies and BLEU score (Papineni et al., 2002). The BLEU score evaluates the question generation.  SQL SELECT COUNT 2nd leg WHERE aggregate = 7-2 Question (ground truth) what is the total number of 2nd leg where aggregate is 7-2 Question (s2s + cp) how many 2nd leg with aggregate being 7-2 Question (s2s + cp + lv)

Methods
(1) what is the total number of 2nd leg when the aggregate is 7-2 ?
(2) how many 2nd leg with aggregate being 7-2 (3) name the number of 2nd leg for 7-2 Results are shown in Table 6, in which s2s represents the basic attentional sequence-to-sequence learning model (Luong et al., 2015), cp means the copying mechanism, and lv stands for the latent variable. We can see that incorporating a latent variable improves QG model performance, especially in limit-supervision scenarios. This is consistent with our intuition that the performance of the QG model is improved by incorporating the copying mechanism, since rare words of great importance mainly come from the input sequence.
To better understand the impact of incorporating a latent variable, we show examples generated by different QG variations in Table 7. We can see that incorporating a latent variable empowers the model to generate diverse questions for the same intent.

Transfer Learning on WikiTableQuestions
In this part, we conduct an extensional experiment on WikiTableQuestions 2 (Pasupat and Liang, 2015) in a transfer learning scenario to verify the effectiveness of our approach. WikiTableQuestions contains 22,033 complex questions on 2,108 Wikipedia tables. Each instance consists of a natural language question, a table and an answer. Following Pasupat and Liang (2015), we report development accuracy which is averaged over the first three 80-20 training data splits. Test accuracy is reported on the train-test data.
In this experiment, we apply the QG model learnt from WikiSQL to improve the state-of-theart semantic parser  on this dataset. Different from WikiSQL, this dataset requires question-answer pairs for training. Thus, we generate question-answer pairs by follow steps. We first sample SQL queries on the tables from WikiTableQuestions, and then use our QG model to generate question-SQL pairs. After-wards, we obtain question-answer pairs by executing SQL queries. The generated question-answer pairs will be combined with the original Wik-iTableQuestions training data to train the model.

Dev
Test Pasupat and Liang (2015) 37.0% 37. 1% Neelakantan et al. (2016) 37.5% 37. 7% Haug et al. (2017) -38. 7% Zhang et al. (2017) 40  Results are shown in Table 8, in which NSP is short for the state-of-the-art neural semantic parser . Since the train-test data used in NSP is different from others, we retrain the NSP under the same protocol. STAMP (WikiSQL) means that the STAMP model trained on WikiSQL is directly tested on WikiTableQuestions. Despite applying QG slightly improves STAMP in this setting, the low accuracy reflects the different question distribution between these two datasets. In the supervised learning setting, we can see that incorporating QG further improves the accuracy of NSP from 43.8% to 44.2%.

Discussion
To better understand the limitations of our QG model, we analyze a randomly selected set of 100 questions. We observe that 27% examples do not correctly express the meanings of SQL queries, among which the majority of them miss information from the WHERE clause. This problem might be mitigated by incorporating a dedicated encoder/decoder that takes into account the SQL structure. Among the other 73% of examples that correctly express SQL queries, there are two po-tential directions to make further improvements. The first direction is to leverage table information such as the type of a column name or column-cell correlations. For instance, without knowing that cells under the column name "built" are all building years, the model hardly predicts a question "what is the average building year for superb?" for "SELECT AVG built WHERE name = superb". The second direction is to incorporate common knowledge, which would help the model to predict the earliest week rather than the lowest week.

Related Work
Semantic Parsing. Semantic parsing is a fundamental problem in NLP that maps natural language utterances to logical forms, which could be executed to obtain the answer (denotation) (Zettlemoyer and Collins, 2005;Liang et al., 2011;Berant et al., 2013;Krishnamurthy and Kollar, 2013;Pasupat and Liang, 2016;. Existing works can be classified into three areas, including (1) the language of the logical form, e.g. firstorder logic, lambda calculus, lambda dependencybased compositional semantics (lambda DCS) and structured query language (SQL); (2) the form of the knowledge base, e.g. facts from large collaborative knowledge bases, semi-structured tables and images; and (3) the supervision used for learning the semantic parser, e.g. question-denotation pairs and question-logical form pairs. In this work, we regard the table as the knowledge base, which is critical for accessing relational databases with natural language, and also for serving information retrieval for structured data. We use SQL as the logical form, which has a broad acceptance to the public. In terms of supervision, this work uses a small portion of question-logical form pairs to initialize the QA model and train the QG model, and incorporate more generated question-logical form pairs to further improve the QA model.

Question Generation
Our work also relates to the area of question generation, which has drawn plenty of attention recently partly influenced by the remarkable success of neural networks in text generation. Studies in this area are classified based on the definition of the answer, including a sentence (Heilman, 2011), a topic word (Chali and Hasan, 2015), a fact (including a subject, a relation phrase and an object) from knowledge bases (Serban et al., 2016), an image (Mostafazadeh et al., 2016), etc. Recent studies in machine reading comprehension generate questions from an answer span and its context from the document (Du et al., 2017;Golub et al., 2017). Wang et al. (2015) first generate logical forms, and then use AMTurkers to paraphrase them to get natural language questions.  use a template-based approach based on the Paraphrase Database (Ganitkevitch et al., 2013) to generate questions from SQL. In this work, we generate questions from logical forms, in which the amount of information from two directions are almost identical. This differs from the majority of existing studies because a question typically conveys less semantic information than the answer.
Improving QA with QG This work also relates to recent studies that uses a QG model to improve the performance of a discriminative QA model (Wang et al., 2017;Yang et al., 2017;. The majority of these works generate a question from an answer, while there also exists a recent work (Dong et al., 2017) that generates a question from a question through paraphrasing. In addition,  consider QA and QG as dual tasks, and further improve the QG model in a dual learning framework. These works fall into three categories: (1) regarding the artificially generated results as additional training instances (Yang et al., 2017;Golub et al., 2017); (2) using generated questions to calculate additional features Dong et al., 2017); and (3) using the QG results as additional constraints in the training objectives . This work belongs to the first direction. Our QG approach takes a logical form as the input, and considers the diversity of generated questions by incorporating latent variables.

Conclusion
In this paper, we observe the logarithmic relationship between the accuracy of a semantic parser and the amount of training data, and present an approach that improves neural semantic parsing with question generation. We show that question generation helps us obtain a state-of-the-art neural semantic parser with less supervised data, and further improves the state-of-the-art model with full annotated data on WikiSQL and WikiTableQuesions datasets. In future work, we would like to make use of table information and external knowledge to improve our QG model. We also plan to apply the approach to other tasks.