SQL Generation via Machine Reading Comprehension

Text-to-SQL systems offers natural language interfaces to databases, which can automatically generates SQL queries given natural language questions. On the WikiSQL benchmark, state-of- the-art text-to-SQL systems typically take a slot-filling approach by building several specialized models for each type of slot. Despite being effective, such modularized systems are complex and also fall short in jointly learning for different slots. To solve these problems, this paper proposes a novel approach that formulates the task as a question answering problem, where different slots are predicted by a unified machine reading comprehension (MRC) model. For this purpose, we use a BERT-based MRC model, which can also benefit from intermediate training on other MRC datasets. The proposed method can achieve competitive results on WikiSQL, suggesting it being a promising direction for text-to-SQL.


Introduction
Text-to-SQL systems generate SQL queries according to given natural language (NL) queries, as shown in example (1), where the headers in the table schema are {PLAYER, COUNTRY, YEAR(S) WON, TOTAL, TO PAR, FINISH}. Text-to-SQL technology is very useful as it can empower humans to naturally interact with relational databases, which serve as foundations for the digital world today. As a subarea of semantic parsing (Berant et al., 2013), text-to-SQL is known to be difficult due to the flexibility in natural language.
(1) a. NL Query: Who is the player from the United States with a total less than 293? b. SQL Query: SELECT Player FROM T WHERE Country = 'United States' AND Total < 293 Recently, by the development of deep learning, significant advances have been made in text-to-SQL. On the WikiSQL  benchmark for multi-domain, single table text-to-SQL task, state-ofthe-art systems (Hwang et al., 2019;He et al., 2019) can predict more than 80% of entire SQL queries correctly. Most of such systems take a slot-filling approach (Xu et al., 2018) that builds several (e.g. 6) specialized models, each of which is dedicated to predicting a particular type of slots, such as the column in SELECT, or the filter value in WHERE. For practical applications, however, such methods have two drawbacks: First, it is complex and delicate in architecture to rely on many dedicated modules working together to generate SQLs, which poses challenges in (joint) training, deployment and maintenance. Second, since most slot types are modeled with no or only limited dependencies on other slots, it is difficult for such models to leverage inter-dependencies of SQL slots. To deal with such problems, this paper formulates text-to-SQL as a question answering task (Section 3). In this formulation, we use a unified BERT-based (Devlin et al., 2019) machine reading comprehension (MRC) model to predict each type of SQL slots by answering template-generated questions. Then the SQL query is synthesized in the way as in slot-filling approaches. For instance, the SQL query for the example (1) can be re-constructed by answering questions, some of which are shown in Table 1. {United States, 293} What is the filter column for "United States"?
{Country} What is the filter column for "293"? {Total} Table 1: Context, questions and answers in the MRC formulation for the example (1).
In the QA formulation, as all slot types are predicted by the same MRC model in the same manner, we arrive at a much simpler architecture with the benefits of easier training, deployment and maintenance. Moreover, with well-designed question generation strategy, important prior information for slot predictions can be added into the questions to leverage the power of BERT even more. Besides, our MRC-based model can naturally benefit from supplementary training on intermediate-labeled tasks (STILTs) (Phang et al., 2018).
The main contribution of this paper is an MRC approach to text-to-SQL. To the best of our knowledge, this is the first work that casts sketch-based text-to-SQL into question answering. We show that the proposed method can achieve competitive results on the WikiSQL dataset.

Related Work
Text-to-SQL is a sub-area of semantic parsing (Berant et al., 2013), which maps natural language utterances to machine-interpretable representations, such as logic forms (Dong and Lapata, 2016), program codes (Yin and Neubig, 2017), and SQLs. For single-table, simple query text-to-SQL task of WikiSQL, many earlier work (Dong and Lapata, 2016;Krishnamurthy et al., 2017;Sun et al., 2018;Wang et al., 2018) follow a neural sequence-to-sequence architecture (Sutskever et al., 2014) with attention mechanism (Bahdanau et al., 2014). This approach often suffers the "ordering issue" when the WHERE-clause has more than one conditions. Xu et al. (2018) introduces a sketch based method, which treats text-to-SQL as slot-filling, by decomposing the SQL synthesis into several independent classification sub-tasks. Specifically, the aggregation function, the column in SELECT-clause, number of conditions, and each element in <column, operator, cell value> triplets in WHERE-clause are predicted separately. Recent advances (Yu et al., 2018a;Dong and Lapata, 2018;Hwang et al., 2019;He et al., 2019) mostly follow this approach and achieve competitive results on WikiSQL. However, sketch-based models, most of which are based on SQLNet (Xu et al., 2018), usually consist of six or more sub-modules and thus complex. By contrast, our question answering-based approach uses a unified MRC model to make predictions for all the SQL slots, thus enjoys a much simpler architecture and provides a natural way to jointly modeling different slots types.
Many recent work (Krishnamurthy et al., 2017;Guo et al., 2019;Wang et al., 2020;Choi et al., 2020) focused on multi-table and complex queries setting of text-to-SQL, as in the Spider task (Yu et al., 2018b). State-of-the art methods on Spider typically fall into two categories: grammar-based approach (Guo et al., 2019;Wang et al., 2020), and sketch-based approach, such as RYANSQL (Choi et al., 2020).Sketch-based methods also have slot prediction modules similar to SQLNet for the Wik-iSQL, while recursion modules are developed to handle the generation of complex SQL sketches, a characteristic in Spider but absent in WikiSQL. At a high level, our method is along the same line of SQLNet-RYANSQL, yet differs with them, as our method recognize slots in a unified way rather than using dedicated modules to predict each slot type. We can extend our method to the Spider task by following existing sketch construction methods as in RYANSQL, while replacing their slot classification modules with our MRC-based methods.
Machine reading comprehension MRC models Wang et al., 2016; are typically trained to answer questions by extracting a text span from the given context passage.
Thus it is often reduced to predicting the start and end position of the answer in the context passage. Recently, there is a trend to cast non-QA NLP tasks, such as information extraction (Levy et al., 2017;Li et al., 2020), text classification and more (McCann et al., 2018;Keskar et al., 2019) into MRC, which can achieve comparable or improved results on the original task, thanks to the flexible and unified modeling of MRC formulation. Our work is inspired by these previous work, but tackles a new and sophisticated scenario of semantic parsing.

Method
Task formulation and dataset conversion Given a question Q = q 1 q 2 ..q L , and a context passage C = c 1 c 2 ...c M , were |Q| = L and |C| = M are their token numbers. The task is to find the start token C start and end token C end in the context passage for the given question. Some example questions and their context are shown in Table 1. To fit such task formulation and apply MRC models, we first convert standard text-to-SQL annotations, in the form of NL query-SQL query pairs as shown in example (1), together with table headers, into a set of <question, answer, context> triples, similar to the SQuAD dataset (Rajpurkar et al., 2016). For each type of slot y ∈ Y that we would like to predict, we use template to generate a question Q. For all the generated questions associated with the same SQL, we provide a context C, which consists of the original NL query, the table headers and the textual description of the aggregation functions. The context is constructed in such a way that the answer for Q, i.e. the SQL slot to be predicted, is represented by a textual span in the context, we can denote as C start -C end . As in the standard MRC setting, all the predictions are reduced to predicting the start and end index of the textual span in the context passage.
Since there are multiple filters in one SQL in WikiSQL dataset, the MRC method transforms into a multi-turn procedure. In each turn of this procedure, one type of slot will be asked according to results from previous turns. For example in Table 1, the question about select column will be answered and then the aggregation function without needing previous results. A multi-span extraction is conducted to answer the question about values in the data. Once the values are extracted by the MRC model, the filter column corresponding to each value is asked using the extracted value. For example, the question What is the filter column for "United States"? is asked only after the value "United States" is extracted in the previous step. Note that the questions are fed into MRC model follows the above order only in the prediction phase, where all predicted elements of SQL query are constructed one by one in SQL query format of WikiSQL. In the training phase, however, triples of different samples are shuffled into batches to feed into the MRC model while the questions about values are constructed with standard results in the dataset.

BERT-based MRC model
Given the question Q, we need to extract the text span from context C using an MRC model. For this purpose, we use BERT (Devlin et al., 2019). In particular, we resort to the standard question-answer usage of BERT, i.e. feeding the token sequence in the form of [CLS], q 1 , q 2 , .., q L , [SEP], c 1 , c 2 , ..., c M as the input to the BERT model, where the special SEP token is inserted between the question Q and the context C to distinguish them. BERT then outputs a contextualized representation matrix H ∈ R (L+M +2)×d , where d is the vector dimension of the last layer of BERT. Following the MRC setup as in (Devlin et al., 2019), we use two additional trainable parameter vectors v start and v end , both of which are of dim d, to compute the probability of each token position being the start and end position of the answer span, respectively. The computation is simply by applying the softmax function over the multiplication of the BERT representation for each token H i with the two vectors v start and v end , as shown in (1).
The MRC model described above fits scenarios of extraction only one pair of start position vector p start and end position vector p end from the context for the given question, such as the prediction of only one SELECT column and aggregation function in the WikiSQL format, as well as the filter column prediction when the value is given. However, such model is not suitable for value predictions, as there can be multiple filter conditions, each of which has a corresponding value slot, i.e. multiple value spans in the query to be extracted simultaneously as answers. To overcome this limitation, we adopt sequence labeling to predict values using BIO tag-set. Specifically, The BERT representation for each token in the context part. i.e. H C , is fed to a conditional random field (CRF) (Lafferty et al., 2001) layer to yield the output labels, which is shown in (2).
Output T has the same length of the context C while each token in C is assigned a BIO label in T to show if it is a beginning(B) token of a value , or a continuation(I) token of a value, or even outside(O) of a value of value prediction. The results of value predictions can be extracted from such label sequence by combining one B-label and following I-labels with ignoring O-labels. Then the predictions of other SQL elements follow the MRC framework. Such treatment is similar to MRC-based entity-relation extraction work , As future work, techniques in Hu et al. (2019) will be experimented to further unify value predictions into the MRC framework.
STILTs and AGG prediction enhancement STILTs (Phang et al., 2018) refers to the procedure that first fine-tunes a pre-trained language model on an intermediate task, before fine-tuning on the final task. The procedure is known to be effective on improving MRC models by intermediate fine-tuning on other QA datasets (Keskar et al., 2019). Thus we take advantage of STILTs to boost the performance of our BERT-based MRC model. An additional improvement focuses on the aggregation function (AGG) prediction. Analysis of preliminary results suggests that AGG prediction is a bottleneck for our system, which is partly attributed to the findings by Hwang et al. (2019) that AGG annotations in WikiSQL have up to 10% of errors. Since our unified MRC model has to take care of other types of questions, these extra constraints make it more challenging for our model to fit flawed data, compared with a dedicated AGG classifier, as in most SOTA methods. In such case, we improve the AGG results over the original MRC predictions, using only simple association signals in the training data. To this end, we adopt transformation-based learning algorithm (Brill, 1995) to update the AGG predictions based on simple association rules in the form of "change AGG from x to x , given certain word tuple occurrences." Such rules are mined and ranked from the training data by the algorithm.

Dataset, Metric and Implementation Details
We use the largest human-annotated text-to-SQL dataset, WikiSQL , which consists of 80,654 pairs of questions and human-verified SQL queries. Tables appeared either in train or dev set will never appear in the testset. As in previous work, the following two metrics are used for evaluating SQL query synthesis accuracy: (1) Logical Form Accuracy, denoted as LF , where LF = SQL with correct logic form / total # of SQL; and (2) Execution Accuracy, denoted as EX. where EX = SQL with correct execution / total # of SQL. Execution guidance decoding (Wang et al., 2018) is not evaluated. The word embeddings are randomly initialized by BERT, and fine-tuned during the training. Adam is used (Kingma and Ba, 2014) to optimize the model with default hyper-parameters. We choose uncased BERT-base pre-trained model with default settings due to resource limitations. Codes are implemented in Pytorch 1.3 and will be made publicly available 1 .

Results
We compare our method with notable published work that has reported results on WikiSQL, including Seq2SQL , SQLNet (Xu et al., 2018), TypeSQL (Yu et al., 2018a), Coarse-to-Fine (Dong and Lapata, 2018), SQLova (Hwang et al., 2019), X-SQL (He et al., 2019) in Table 2. On the test set, our final model with BERT-base outperforms SQLova, the BERT-large based strong baseline, and rivals the SOTA X-SQL with MT-DNN. For STILTs, we fine-tuned BERT on SQuAD 1.1 dataset for 3 epochs with hyper-parameters similar to Devlin et al. (2019), before fine-tuning on WikiSQL. As shown in Table 2   Analysis Table 3 shows slot type-wise results, implying aggregation function accuracy S agg is the bottleneck to the pure MRC model (ours), which is probably due to that our unified model meets more obstacles in fitting partially erroneous data, for which the AGG enhancement method (AE) is very effective. Our error analysis on 100 randomly sampled errors shows that while 47% of the errors can be attributed to the model, 53% can be best described as data flaws or errors, the majority of which involves AGG. For example, "What year has a bronze of Valentin Novikov", "What year has a silver for Matthias Merz", and "What is the year of the Film Klondike Annie" are of same (WH-word + SELCT column) patterns, but gold AGGs are AVG, MIN and SUM, respectively. We further make an automatic analysis, finding that 8.91% of the data are cases where queries of the same pattern are annotated with at least 3 distinct AGG. Such inconsistency suggests that even higher accuracy means fitting both signal and noise.

Conclusion
This paper proposes a question answering approach to text-to-SQL, where a BERT-based MRC model is trained to predict all the slots that are needed for SQL generation. Our approach enjoys advantages of easier deployment and maintenance in practice, as well as the potentials in leveraging other MRC datasets via the STILTs supplementary training. Capable of jointly learning slots with a simple, unified model, the proposed method proves to be a promising direction for text-to-SQL. As future work, We plan to extend our model to cope with multi-table text-to-SQL task, Spider.