TypeSQL: Knowledge-Based Type-Aware Neural Text-to-SQL Generation

Interacting with relational databases through natural language helps users with any background easily query and analyze a vast amount of data. This requires a system that understands users’ questions and converts them to SQL queries automatically. In this paper, we present a novel approach TypeSQL which formats the problem as a slot filling task in a more reasonable way. In addition, TypeSQL utilizes type information to better understand rare entities and numbers in the questions. We experiment this idea on the WikiSQL dataset and outperform the prior art by 6% in much shorter time. We also show that accessing the content of databases can significantly improve the performance when users’ queries are not well-formed. TypeSQL can reach 82.6% accuracy, a 17.5% absolute improvement compared to the previous content-sensitive model.


Introduction
Building natural language interfaces to relational databases is an important and challenging problem (Li and Jagadish, 2014;Pasupat and Liang, 2015;Yin et al., 2016;Zhong et al., 2017;Yaghmazadeh et al., 2017;Xu et al., 2017;Wang et al., 2017a). It requires a system that is able to understand natural language questions and generate corresponding SQL queries. In this paper, we consider the Wik-iSQL task proposed by Zhong et al. (2017), a large scale benchmark dataset for the text-to-SQL problem. Given a natural language question for a table and the table's schema, the system needs to produce a SQL query corresponding to the question.
We introduce a knowledge-based type-aware text-to-SQL generator, TYPESQL. Based on the prior state-of-the-art SQLNet (Xu et al., 2017), TYPESQL employs a sketch-based approach and views the task as a slot filling problem (Figure 2). By grouping different slots in a reason-able way and capturing relationships between attributes, TYPESQL outperforms SQLNet by about 3.5% in half of the original training time.
Furthermore, natural language questions often contain rare entities and numbers specific to the underlying database. Some previous work (Agrawal and Srikant, 2003) already shows those words are crucial to many downstream tasks, such as infering column names and condition values in the SQL query. However, most of such key words lack accurate embeddings in popular pre-trained word embedding models. In order to solve this problem, TYPESQL assigns each word a type as an entity from either the knowledge graph, a column or a number. For example, for the question in Figure 1, we label "mort drucker" as PERSON according to our knowledge graph; "spoofed title," "artist" and "issue" as COLUMN since they are column names; and "88.5" as FLOAT. Incorporating this type information, TYPESQL further improves the state-of-the-art performance by about another 2% on the WikiSQL dataset, resulting in a final 5.5% improvement in total.
Moreover, most previous work assumes that user queries contain exact column names and entries. However, it is unrealistic that users always formulate their questions with exact column names and string entries in the table. To tackle this issue, when scaleability and privacy are not of a concern, the system needs to search databases to better understand what the user is querying. Our content-sensitive model TYPESQL + TC gains roughly 9% improvement compared to the content-insensitive model, and outperforms the previous content-sensitive model by 17.5%.

Related Work
Semantic parsing maps natural language to meaningful executable programs. The programs could be a range of representations such as logic forms (Zelle and Mooney, 1996;Zettlemoyer and Collins, 2005;Wong and Mooney, 2007;Das et al., 2010;Liang et al., 2011;Banarescu et al., 2013;Artzi and Zettlemoyer, 2013;Reddy et al., 2014;Berant and Liang, 2014;Pasupat and Liang, 2015). Another area close to our task is code generation. This task parses natural language descriptions into a more general-purpose programming language such as Python (Allamanis et al., 2015;Ling et al., 2016;Rabinovich et al., 2017;Yin and Neubig, 2017). As a sub-task of semantic parsing, the text-to-SQL problem has been studied for decades (Warren and Pereira, 1982;Popescu et al., 2003Popescu et al., , 2004Li et al., 2006;Giordani and Moschitti, 2012;Wang et al., 2017b). The methods of the Database community (Li and Jagadish, 2014;Yaghmazadeh et al., 2017) involve more hand feature engineering and user interactions with the systems. In this work, we focus on recent neural network based approaches (Yin et al., 2016;Zhong et al., 2017;Xu et al., 2017;Wang et al., 2017a;Iyer et al., 2017). Dong and Lapata (2016) introduce a sequenceto-sequence approach to converting text to logical forms. Most of previous work focus on specific table schemas, which means they use a single database in both train and test. Thus, they don't generalize to new databases. Zhong et al. (2017) publish the WikiSQL dataset and propose a sequence-to-sequence model with reinforcement learning to generate SQL queries. In the problem definition of the WikiSQL task, the databases in the test set do not appear in the train and develop- Figure 2: SQL Sketch. The tokens starting with "$" are slots to fill. "*" indicates zero or more AND clauses. ment sets. Also, the task needs to take different table schemas into account. Xu et al. (2017) further improve the results by using a SQL sketch based approach employing a sequence-to-set model.

Methodology
Like SQLNet, we employ a sketch-based approach and format the task as a slot filling problem. Figure 2 shows the SQL sketch. Our model needs to predict all slots that begin with $ in Figure 2. Figure 1 illustrates the architecture of TYPE-SQL on the right and a detailed overview of one of three main models MODEL COL on the left. We first preprocess question inputs by type recognition (Section 3.1). Then we use two bi-directional LSTMs to encode words in the question with their types and the column names separately (Section 3.2). The output hidden states of LSTMs are then used to predict the values for the slots in the SQL sketch (Section 3.3).

Type Recognition for Input Preprocessing
In order to create one-to-one type input for each question, we, first, tokenize each question into ngrams of length 2 to 6, and use them to search over the table schema and label any column name appears in the question as COLUMN. Then, we assign numbers and dates in the question into four self-explanatory categories: INTEGER, FLOAT, DATE, and YEAR. To identify named entities, we search for five types of entities: PERSON, PLACE, COUN-TRY, ORGANIZATION, and SPORT, on Freebase 1 using grams as keyword queries. The five categories cover a majority of entities in the dataset. Thus, we do not use other entity types provided by Freebase. Domain-specific knowledge graphs can be used for other applications.
In the case where the content of databases is available, we match words in the question with both the table schema and the content and labels of the columns as COLUMN and match the entry values as the corresponding column names. For example, the type in the Figure 1 would be [none, column, column, none, artist, artist, none, none, column, none, column, issue, none] in this case. Other parts in the Figure 1 keep the same as the content-insensitive approach.

Input Encoder
As shown in the Figure 1, our input encoder consists of two bi-directional LSTMs, BI-LSTM QT and BI-LSTM COL . To encode word and type pairs of the question, we concatenate embeddings of words and their corresponding types and input them to BI-LSTM QT . Then the output hidden states are H QT and H COL , respectively.
For encoding column names, SQLNet runs a bidirectional LSTM over each column name. We first average the embeddings of words in the column name. Then, we run a single BI-LSTM COL between column names. This encoding method improves the result by 1.5% and cuts the training time by half. Even though the order of column names does not matter, we attribute this improvement to the fact that the LSTM can capture their occurrences and relationships.

Slot-Filling Model
Next, we predict values for the slots in the SQL sketch. For the slots in Figure 2, SQL-Net has a separate model for each of them which do not share their trainable parameters. This creates five models for the five slots and one model for $COND# (12 BI-LSTMs in total). However, since the predict procedures of $SELECT COL, $COND COL, and $COND# are similar, we combine them into a single model. Additionally, $COND COL depends on 1 https://developers.google.com/freebase/ the output of $SELECT COL, which reduces errors of predicting the same column in these two slots $COND COL Moreover, we group $OP and $COND VAL together because both depend on the outputs of $COND COL. Furthermore, we use one model for $AGG because we notice that the $AGG model converges much faster and suffers from overfitting when combined with other models. Finally, TYPESQL consists of three models ( Figure  1 right): • MODEL COL for $SELECT COL, $COND# and $COND COL where the parameters of BI-LSTM QT and BI-LSTM COL are shared in each model (6 BI-LSTMs in total). Since all three models use the same way to compute the weighted question and type representation H QT/COL using the column attention mechanism proposed in SQLNet, we first introduce the following step in all three models: where softmax applies the softmax operator over each row of the input matrix, α QT/COL is a matrix of attention scores, and H QT/COL is the weighted question and type representation. In our equations, we use W and V to represent all trainable parameter matrices and vectors, respectively.
MODEL COL-$SELECT COL H QT/COL is used to predict the column name in the $SELECT COL: MODEL COL-$COND# Unlike SQLNet, we compute number of conditions in the WHERE in a simpler way: We set the maximum number of conditions to 4.  Table 2: Breakdown results on WikiSQL. Acc agg , Acc sel , and Acc where are the accuracies of canonical representation matches on AGGREGATOR, SELECT COLUMN, and WHERE clauses between the synthesized SQL and the ground truth respectively.
MODEL COL-$COND COL We find that SQL-Net often selects the same column name in the $COND COL as $SELECT COL, which is incorrect in most cases. To avoid this problem, we pass the weighted sum of question and type hidden states conditioned on the column chosen in $SELECT COL H QT/SCOL (expended as the same shape of H QT/COL ) to the prediction:  (Vinyals et al., 2015) to compute the distribution of the next token in the decoder. In particular, the probability of selecting the i-th token w i in the natural language question as the next token in the substring is computed as: where h is the hidden state of the previously generated token. The generation process continues until the END token is the most probable next token of the substring.

Experiments
Dataset We use the WikiSQL dataset (Zhong et al., 2017) Implementation Details We implement our model based on SQLNet (Xu et al., 2017) in Py-Torch (Paszke et al., 2017). We concatenate pretrained Glove (Pennington et al., 2014) and paraphrase (Wieting and Gimpel, 2017) embeddings. The dimensions and dropout rates of all hidden layers are set to 120 and 0.3 respectively. We use Adam (Kingma and Ba, 2015) with the default hyperparameters for optimization. The batch size is set to 64. The same loss functions in (Xu et al., 2017) are used. Our code is available at https://github.com/taoyds/typesql. Table 1 shows the main results on the WikiSQL task. We compare our work with previous results using the three evaluation metrics used in (Xu et al., 2017). Table  2 provides the breakdown results on AGGREGA-TION, SELECTION, and WHERE clauses.

Results and Discussion
Without looking at the content of databases, our model outperforms the previous best work by 5.5% on execute accuracy. According to Table 2, TYPESQL improves the accuracy of SELECT by 1.3% and WHERE clause by 5.9%. By encoding column names and grouping model components in a simpler but reasonable way, TYPESQL achieves a much higher result on the most challenging subtask WHERE clause. Also, the further improvement of integrating word types shows that TYPE-SQL could encode the rare entities and numbers in a better way.
Also, if complete access to the database is allowed, TYPESQL can achieve 82.6% on execute accuracy, and improves the performance of the previous content-aware system by 17.5%. Although (Zhong et al., 2017) enforced some limitations when creating the WikiSQL dataset, there are still many questions that do not have any column name and entity indicator. This makes generating the right SQLs without searching the database content in such cases impossible. This is not a critical problem for WikiSQL but is so for most realworld tasks.

Conclusion and Future Work
We propose TYPESQL for text-to-SQL which views the problem as a slot filling task and uses type information to better understand rare entities and numbers in the input. TYPESQL can use the database content to better understand the user query if it is not well-formed. TYPESQL significantly improves upon the previous state-of-the-art on the WikiSQL dataset.
Although, unlike most of the previous work, the WikiSQL task requires model to generalize to new databases, the dataset does not cover some important SQL operators such as JOIN and GROUP BY. This limits the generalization of the task to other SQL components. In the future, we plan to advance this work by exploring other more complex datasets under the database-split setting. In this way, we can study the performance of a generalized model on a more realistic text-to-SQL task which includes many complex SQL and different databases.