NL2pSQL: Generating Pseudo-SQL Queries from Under-Specified Natural Language Questions

Generating SQL codes from natural language questions (NL2SQL) is an emerging research area. Existing studies have mainly focused on clear scenarios where specified information is fully given to generate a SQL query. However, in developer forums such as Stack Overflow, questions cover more diverse tasks including table manipulation or performance issues, where a table is not specified. The SQL query posted in Stack Overflow, Pseudo-SQL (pSQL), does not usually contain table schemas and is not necessarily executable, is sufficient to guide developers. Here we describe a new NL2pSQL task to generate pSQL codes from natural language questions on under-specified database issues, NL2pSQL. In addition, we define two new metrics suitable for the proposed NL2pSQL task, Canonical-BLEU and SQL-BLEU, instead of the conventional BLEU. With a baseline model using sequence-to-sequence architecture integrated by denoising autoencoder, we confirm the validity of our task. Experiments show that the proposed NL2pSQL approach yields well-formed queries (up to 43% more than a standard Seq2Seq model). Our code and datasets will be publicly released.


Introduction
Converting natural-language questions to SQL codes (NL2pSQL) is an active area of research in natural language processing. However, earlier work (Hemphill et al., 1990;Brad et al., 2017;Zelle and Mooney, 1996;Tang and Mooney, 2000;Popescu et al., 2003;Lawrence and Riezler, 2016;Li and Jagadish, 2014;Roy et al., 2013;Yaghmazadeh et al., 2017;Zhong et al., 2017;Yu et al., 2018c;Finegan-Dollak et al., 2018) mainly focused on fully specified questions where the given question and the table can be uniquely answered by a generated SQL query. For example, a question such as "who is the manufacturer for the order year 1998?" can only be uniquely answered when the table that can answer the question is attached to the question. This requires the generated SQL query to be executable, in order to verify its correctness.
Meanwhile, real-life NL2pSQL questions from developer forums such as Stack Overflow (SO) (StackOverflow, 2018) are mostly under-specified to generate an executable answer. Actual questions deal with more diverse issues, including table creation, backup, or performance, and often present with a less detailed context (such as missing table name or schema). For example, a SO question 2 such as "How to reset AUTO INCREMENT in MySQL?" can only be answered by a non-executable template "ALTER TABLE [tablename] AUTO INCREMENT = 1", where human developers are expected to read and then feed the missing context. We call these questions Under-specified NL2pSQL, distinguished from the conventional NL2pSQL task on fully specified questions.
Our key contribution is to address 1) the novel problem of generating human-readable pseudo SQL code to diverse ranges of under-specified developer questions with our own curated dataset and to propose 2) a new approach as well as 3) evaluation metrics suitable for the proposed task. Our dataset contains crawled posts in SO between the year of 2008 and 2017. The proposed task addresses the following new opportunities and challenges.
First, our new dataset covers diverse scenarios with 362 distinct SQL keywords, such as USING, UNIQUE, SELECT, UPDATE, TRUNCATE, etc, unlike specified scenarios such as data lookup using simple SELECT queries. Meanwhile, 19% of the answers in SO do not contain the SELECT keyword at all.
Second, we eliminate the requirement of a generated code to be executable, for which we propose new evaluation metrics without such a requirement. An existing metric is lexical comparisons with the human-generated templates using BLEU scores (Papineni et al., 2002). This metric based on counting matching n-gram may penalize mismatches in non-essential tokens such as variable names (Lin et al., 2018;Zhong et al., 2017), while mismatches of SQL keywords should not be tolerated.
Our contribution includes the following: • Dataset: We publicly release the NL2pSQL dataset containing 1,244 pairs as well as the denoising dataset for NL2pSQL containing 406,384 pairs. 3 • NL2pSQL: We propose an end-to-end natural NL2pSQL model without the restriction of an executable SQL query trained and evaluated for under-specifed questions. Existing sequence-to-sequence models are known to generate repetitive tokens, but our denoising autoencoder can effectively fix errors in SQL statements towards readable codes. Our model performs a two-step translation by cascading two sequence-to-sequence models, one for generating initial SQL statements from natural language input, and the other to fix ill-formed SQL statements.
• Evaluation metric: While existing NL2pSQL metrics penalize the mismatch of any terms, we propose novel metrics allowing the mismatches for non-essential terms, yet requiring matches for important SQL keyword: Specifically, Canonical-BLEU abstracts the identifier names of the SQL statements before calculating the BLEU scores while SQL-BLEU calculates the BLEU scores using only the SQL keywords. Different forms of SQL queries may exist for a given natural language, and we note that no single metric is capable to accurately discern one form from the other. Thus, two proposed metrics complement each other. 3 https://github.com/clovaai/nl2psql 2 Related Work Natural Language to SQL. Bridging the gap between natural language and SQL queries has been a long-studied area (Warren and Pereira, 1982;Popescu et al., 2003Popescu et al., , 2004Li et al., 2006;Giordani and Moschitti, 2012). With the emergence of deep learning, neural network-based approaches have gained popularity including models using sequence-to-sequence and reinforcement learning (Zhong et al., 2017;Xu et al., 2018;Iyer et al., 2017;Dong and Lapata, 2016;Yu et al., 2018a). These models assume fully-specified questions and require human-annotated executable SQL queries for training and evaluation. Thus, they are not applicable to under-specified questions, which are the majority of issues in realworld software developer forums.
An alternative approach is designing semantic parsers, to exploit rich prior knowledge in the form of features and grammars (Zettlemoyer and Collins, 2005). It aims at mapping natural language utterances to an executable logical form such as SQL. They require rich resources -in the form of either grammars or training data to extract such grammars. As existing semantic parsers on SQL build on pairs of SQL and their execution results, which cannot exist in our problem context (Xu et al., 2018;Zhong et al., 2017;Sun et al., 2018;Yu et al., 2018c,b) of under-specified questions, we adopt a generative approach in our model.
Program Synthesis Program synthesis or automatic code generation is a relatively new field and has gained much attention lately. In previous work (Ling et al., 2016;Rabinovich et al., 2017;Yin and Neubig, 2017;Balog et al., 2017;Murali et al., 2018), the generated code is targeted at programming languages such as Java and Python, or they used a reduced specification such as a domain-specific language (DSL). Meanwhile, Beltramelli utilized GUI screenshots to generate markup language codes (Beltramelli, 2017). Our work differs in the sense that we are not constrained to a reduced specification but deal with a new domain of SQL statements, instead of markup languages or fully-fledged programming languages such as Java and Python.
Denoising and Program Repair Denoising is a popular research area in computer vision (Buades et al., 2005;Elad and Aharon, 2006;Dabov et al., 2007;Zhang et al., 2017;Meiniel et al., 2018) and natural language processing (Kouno et al., 2015;Napoles et al., 2017). In software engineering, denoising tasks focus on bug fixing in fully-fledged programming languages such as Java and C (Long and Rinard, 2015;Mechtaev et al., 2016;Gupta et al., 2017). Our work is different in that we are combining the denoising aspect into the autogenerated SQL queries to complement the limitation of neural generation, which has not been previously considered.

NL2pSQL Dataset
To obtain human-written pseudo SQL statements with their corresponding natural language description, we turn to Stack Overflow (SO). We use the public SO data from the Stack Exchange Data Dump website for convenience purposes as the data are already structured in an XML form, e.g., 'Posts.xml', for easy extraction. Our goal is to curate question and code pairs: For example, in the SO Question 6308594, 4 the natural language input is "how can I copy data from one column to another in the same table", and the code is "update table set columnB = columnA".
In particular, we collect all the answers where they have a single code block and their corresponding questions are labeled with an SQL tag from 'Posts.xml'. Note that this xml file contains all the SO questions and answers since its establishment from August 2008 to September 2017. We also used the data from 'Postlinks.xml' (described in the later paragraph).
A code block can contain multiple lines of codes. We observed that indentation for code readability made numerous SQL statements be represented with multiple lines. A SQL statement indented with a different style will have a trivial effect when being read and executed by a SQL parser. Thus, we transformed all the multi-line SQL statements into a single line by replacing the line break with a white space. In this study, we only consider the SQL statement with a maximum token (delimited by a white space) length of 100.
Among questions with SQL answers, we avoid error debugging questions but focus on implementation questions with code answers. To this end, we first collect all the QA pairs tagged with SQL. We then extract the title from the QA pairs for classifying the implementation questions, as the title description has been reported as sufficient (Iyer et al., 2016). We only keep QA pairs where their question titles start with "how", 5 and not containing the following keywords bug, error, fix within the title content to exclude questions on error debugging. We manually inspected these QA pairs to ensure all of them to be implementation-related questions. As a result, 1,225 QA pairs in total were found.
Another issue is redundancy: 'Postlinks.xml' may also contain recurring question pairs from SO, asking the same questions in lexically different forms (or paraphrased) (Chen and Kim, 2015). We found 19 such recurring questions, and we union them to the previously mentioned 1,225 QA pairs. A total of 1,244 implementation question QA pairs were collected.
We observed that the queries in our dataset mostly involve complex conditions such as the combination of multiple joined tables with ordering and grouping clauses that are not supported by existing NL2pSQL approaches. A majority of the natural text contains 11 to 40 tokens, with SQL queries containing 11 and 15 tokens. To gain more insights on the difficulty of SQL queries, we measured Halstead complexity metrics (Halstead et al., 1977) on the training, development, and test sets and found most queries fall under the complexity score of 0-5 and well-split with nearidentical distributions across the three splits.

Model: NL2pSQL with Denoising
While semantic parsing approaches require the machine-executable code for training (as outlined in Section 2), we focus on a new generative model without executability requirements for our proposed scenario. Essentially, our model consists of two layers of autoencoder architecture (Figure 1). The first layer is for generating SQL code given a natural language as input (NL2pSQL) through an autoencoder. The input to the first autoencoder is the word embedding of the natural language. However, this model suffers a well-known problem of low-quality generation, such as token repetitions (Holtzman et al., 2018), or "ill-formed". Though our approach does not require a code segment to be machine-executable, generated codes how to copy or update from one table to another table

Denoised Generated SQL Code Well-Formed SQL Statement
Denoising Encoder Denoising Decoder Figure 1: Overview of NL2pSQL with Denoising Autoencoder. Our model performs a two-step translation by cascading two sequence-to-sequence models. The input layers are the natural-language word embeddings in the 1 st autoencoder and the SQL statement (generated by the first autoencoder) word embeddings in the 2 nd autoencoder.
should be sufficiently "well-formed" to ensure human readablity. For example, a generated code with repetitive tokens will be considered ill-formed and transformed into its well-formed counterpart (or, denoising). Our contribution is thus to denoise the ill-formed queries by employing a subsequent autoencoder architecture. The input to the second autoencoder is the word embeddings of the generated SQL statements from the first autoencoder. This denoising functionality is included in the second layer with the goal of producing well-formed SQL statements. For training the denoiser, we propose to synthesize a pair of ill-and well-formed SQL statements from the given dataset: <a well-formed SQL statement, an ill-formed SQL statement> as the input to the denoising autoencoder module. Formally, we define a well-formed SQL statement as a valid SQL statement recognized by an SQL parser whereas an ill-formed SQL statement is detected as an erroneous SQL statement construct.
We first collect well-formed SQL statements among SQL statements written by developers in SO using MySQL parser (GuduSoftware, 2018), as MySQL is the most popular SQL dialect among developers (StackOverflowSurvey, 2018). Next, we generate its ill-formed counterpart, by perturbing each of the collected well-formed SQL statements. Specifically, we randomly decide to inject into or remove from the well-formed SQL statement a token for k iterations to generate multiple ill-formed statements for a given well-formed SQL statement. In our model, we set k to be four.
To the best of our knowledge, this is the first NL2pSQL model that is not restricted to the executability from specified questions. This work is also the first in attempting to fix (denoise) illformed SQL statements through a neural networkbased autoencoder architecture.

Evaluation Metrics
To evaluate the generated SQL statements, we first check if the generated SQL statements are wellformed (Section 3.3.1). In addition, we propose two new BLEU variants. Previous studies have reported that the BLEU score may be inappropriate in measuring the correctness of the generated code against the target code (Lin et al., Zhong et al., 2017). This is because a generated code may be written differently but still perform the intended function. The first example in Table 6 illustrates this problem. The generated SQL statement performs the semantics as required in the natural language but is structurally different as compared to the ground truth (GT). By using the vanilla BLEU, it produces a much heavily penalized score of 0.065. Thus, in our study, we use two modified versions of BLEU, namely Canonical-BLEU and SQL-BLEU, as will be described in Section 3.3.2.

Well-Formed SQL Statements
To evaluate the effectiveness of our denoising autoencoder, we compare the relative number of well-formed SQL statements between the generated SQL statements before and after denoising. Whether a particular SQL statement is wellformed is determined by using a SQL parser.

Canonical-BLEU and SQL-BLEU
Here, we first motivate the need for introducing two new BLEU variants, namely Canonical-BLEU and SQL-BLEU, before describing them in more detail. Canonical-BLEU first abstracts the identifiers of the SQL statement before calculating the BLEU score, while SQL-BLEU only considers the SQL keywords within the SQL statement when computing the BLEU score. We note that two queries with the same semantics may have syntactically different forms. For example, a join statement (S1) such as "select a.car from a,b where a.id=b.id" can be equivalent to a nested select statement (S2) such as "select a.car from a where a.id in (select b.id from b where b.name='Mary')". When S1 is generated and the ground truth is a similar join statement, it will have a correct and high SQL-BLEU score, yet having lower computing complexity. When the ground truth is similar to S2, SQL-BLEU and Vanilla BLEU would penalize more heavily than our Canonical-BLEU. In this respect, both SQL-BLEU and Canonical-BLEU complement each other and should be used together.
Canonical-BLEU Generated SQL statements may have different identifiers such as variable and table names, as compared to the original (ground truth) SO SQL statements. The difference in the identifiers' name between the generated SQL statements and the ground truth should not be heavily penalized if their SQL forms are semantically the same. To mitigate this issue, we first "canonicalize" all the identifier names (e.g., MY COLUMN) into indexed placeholders (e.g., var 1). This transformation is performed both on the ground truth SQL statements and the generated SQL statements, and their BLEU scores are calculated. We call this Canonical-BLEU.
SQL-BLEU In a SQL statement, several words are considered keywords that serve as particular functions. For example, SELECT and UPDATE are both keywords in the SQL language. SELECT is used for retrieving a record from a relational database table whereas UPDATE is used to modify the content of a record. In SO, developers are free to adopt and use different column or table names. However, despite that, if the SQL code is about the retrieval of records, there should have a SE-LECT keyword within the SQL statement. Furthermore, we analyze the syntactically correct answers from SO between the period of 2008 and 2017 and found that 362 distinct SQL keywords (e.g., YEAR, DATABASES, USING, UNIQUE, SE-LECT, UPDATE, TRUNCATE, etc) are used, and 19% of the answers (that are well-formed) do not contain the SELECT keyword. We also found that multiple different variations of retrieval-type answers (i.e., 348 different keywords are used together with SELECT) use different combinations of keywords such as WHEN, EXIST, INNER JOIN, IN, GROUP BY, etc. Thus, we decided to use SQL keywords as another form of measure.
We first transform the generated SQL statements into their token types using a SQL parser. Each word will have its corresponding token type. For example, the word SELECT will be tagged as a keyword by the SQL parser. We remove all words that are non-keywords and only those words tagged as keywords are left in the SQL statements (we preserve their word ordering) to calculate the BLEU score. This transformation is performed both on the ground truth SQL code as well as the generated SQL statements and their BLEU scores are calculated. We call this SQL-BLEU.

Data Description
The used dataset is randomly split into train, dev and test sets in the ratio 60:20:20 to be mutual exclusive.

Baseline and Backbone Models
We stress that our work is focused on underspecified NL2pSQL and is designed for users to read/adjust the missing context for their specific usage. This is the first work on under-specified NL2pSQL and is mostly a resource and new problem/metric definition paper. Note that existing NL2pSQL work is all fully-specified and cannot be used for comparison as they involve semantic parsing and query execution on a valid database schema, which is absent in our context. In addition, making the system fully-specified is beyond the scope of this paper.
We conduct experiments for NL2pSQL generation with denoising on two neural machine translation models, Seq2Seq and CopyNet, which have shown competitive performance on naturallanguage-to-code translation (Lin et al., 2018). In addition, we carry out experiments in both token and character levels on both of these models, denoted as Seq2Seq(Token), Seq2Seq(Char), Copy-Net(Token), CopyNet(Char). For each model, we perform a two-step translation by cascading two sequence-to-sequence models, for pSQL generation and denoising respectively. Each model is trained with the following hyperpameters: 6 • Seq2Seq (Cho et al., 2014) encodes questions and decodes into queries using recurrent neural networks (RNNs). The encoder dimension is 200 by using a bi-directional RNN with concatenation. The dimension of the decoder RNN is set as 100. We optimized our objective function with Adam (Kingma and Ba, 2015), using the default momentum hyperparameters. We set the initial learning rate as 0.0001 and the mini-batch size as 32.
We also used variational RNN dropout (Gal and Ghahramani, 2016) with a dropout rate of 0.4. During beam-search decoding (Hwang and Chang, 2007), we set the beam size as 100, and we ran for 100 epochs to train the model. For beam size, it has been reported that a larger beam size will lead to a higher translation quality at the expense of the decoder speed (Freitag and Al-Onaizan, 2017). We wanted to find a balance between translation quality and decoding speed. We tested different beam sizes and an entire decoding process to generate a SQL statement takes only a few seconds when the beam size is set to 100.
• CopyNet (Gu et al., 2016) extends the model above, to selectively "copy" some input sequence into output, using the same hyperparameters discussed above.
• The information retrieval (IR)-based baseline is a Lucene-based code retrieval tool, widely adopted for code search such as Sourcerer (Linstead et al., 2009). It retrieves top-3 matches, based on answer titles, matching the given question.

Results
This section shows the quantitative (Sections 5.1 and 5.2) as well as qualitative analysis (Section 5.3) from our experiments. Table 1 shows the change in the number of SQL statements determined as well-formed by the SQL parser, after going through our denosing autoencoder module trained with a Seq2Seq and a Copy-Net model.

Well-Formed SQL Statements
Our module certainly increases the number of well-formed SQL statements. For example, when paired with a Seq2Seq denoising autoencoder for the top-1 prediction, the number of well-formed statements increases from 47 to 66 (19% increase). Such increase ranges from 12% to 19%, from 14% to 19%, from 30% to 43%, and from 0% to 7%, in Seq2Seq (Token), CopyNet (Token), Seq2Seq (Char), and CopyNet (Char), respectively. When paired with a CopyNet Denoising autoencoder, such increase is less significant. When comparing token-and character-level generation after denoising, we can see that token-level generation shows higher increase, as shown in bold.

Comparison of Canonical and SQL-BLEU
Here, we report the mean scores of both Canonical-BLEU and SQL-BLEU over all the generated SQL statements in the test set (including their denoised version) using the Seq2Seq and CopyNet autoencoder models in both the token and character levels. Tables 2 and 3 show the changes in the mean SQL-BLEU scores on the first three generated SQL statements, after using denoising autoencoders trained with a Seq2Seq and a CopyNet model, respectively. For example, when paired with a Seq2Seq denoising autoencoder for the top-1 prediction, the mean SQL-BLEU scores has a negligible decrease of 0.01 (from 0.28 to 0.27). Such decrease ranges from 0.01 to 0.01, from 0.01 to 0.01, from 0.02 to 0.03, and from 0.02 to 0.02, in Seq2Seq(Token), CopyNet (Token), Seq2Seq (Char), and CopyNet(Char), respectively. When paired with a CopyNet Denoising autoencoder, we observed similar negligible decrease in the Table 1: % of well-formed, generated SQL queries using a Seq2Seq/CopyNet & after denoising autoencoder models. P1, P2, and P3 correspond to the first three predictions. Columns (2, 4, 6, 8) & (3, 5, 7, 9) show the increased in well-formed queries after going through a Seq2Seq & CopyNet denoising autoencoder respectively.         mean SQL-BLEU after denoising. Compared to IR approach, token level approaches mostly yield higher SQL-BLEU scores. Table 4 and 5 show the change in the mean Canonical-BLEU scores on the first three generated SQL statements, after using denoising au-toencoders trained with a Seq2Seq and a CopyNet model respectively. For example, when paired with a Seq2Seq denoising autoencoder for the top-1 prediction, the mean Canonical-BLEU scores has a negligible decrease in Seq2Seq (Token), from 0.00 to 0.01, or 0.01 decrease. Such decrease ranges from 0.00 to 0.02, from 0.00 to 0.01, from 0.01 to 0.02, and from 0.02 to 0.03, in Seq2Seq(Token), CopyNet (Token), Seq2Seq (Char), and CopyNet(Char), respectively. When paired with a CopyNet Denoising autoencoder, we observed similar negligible decrease in the mean SQL-BLEU after denoising. Compared to the IR approach, we observed that most of the generative approaches yield higher Canonical-BLEU scores.

Mean Canonical-BLEU
In Section 5.1, we previously reported up to 43% more well-formed SQL statements in the denoised version than the non-denoised version. This suggests that the two metrics (Well-Formed SQL Statements and the two variants of BLEU -Canonical BLEU and SQL BLEU) we proposed are complementary, in that Canonical-BLEU and SQL-BLEU alone cannot distinguish well-formed SQL statements. We also note that for denoising, while enhancing one metric, it does not hurt another. Table 6 shows the output examples generated by our models in the unseen data (the test set), including the denoised version. We only show the generated outputs from the Seq2Seq model as they yield the best results in our experiments. Each example consists of the natural language input, the ground truth, the top prediction of the generated output from the Seq2Seq model, and its denoised version.

Qualitative Analysis
The first example is about copying or updating from one table to another. In the generated SQL statement, it performs the intended copying function but is syntactically incorrect in the use of date time. The denoised version fixes it by replacing date time : : date and date time : : time with proper column names. Furthermore, the ground truth is coupled with comments, and its SQL statement is much longer than the generated version. The retrieved statement by the IR baseline has similar semantics on observing the same type of an insertion query with our approach. Although both of them are similar, the IR baseline appears more complicated as it contains more columns for insertion and a nested table structure. We believe that the generated SQL statement is a much better choice in this case.
The second example is about counting the number of rows. The generated SQL statement shows the counting-related code. Interestingly, the IR approach produces the exact query. In the last example, it is about inserting a record from another table. Similar to the IR approach, the generated SQL statement is able to churn out a query performing an insertion operation from another table. Both generated SQL statements from the last two examples do not contain any syntax errors, and the denoised autocoders did not change the SQL statements. In the last example, it is difficult to include "top500" which requires verbal explanation of defining one's own function in the "order by" clause. This part was omitted, but all the other semantics are fully preserved.

Conclusions
This paper addressed the problem of generating human-readable pseudo SQL queries from natural-languge questions without codeexecutability restriction, and propose a new dataset. Despite excluding executability, our task covers much more diverse issues including table manipulation and performance-related problems beyond simple table lookup. This coverage allows our task to help software engineers to deal with various practical implementation and operation issues. In addition, we describe a new NL2pSQL model and two evaluation metrics suitable for dealing with our task. Our result shows that adding a denoising autoencoder module enables our model to generate a more number of wellformed queries, to complement a known weakness of neural code generation.