PG-GSQL: Pointer-Generator Network with Guide Decoding for Cross-Domain Context-Dependent Text-to-SQL Generation

Text-to-SQL is a task of translating utterances to SQL queries, and most existing neural approaches of text-to-SQL focus on the cross-domain context-independent generation task. We pay close attention to the cross-domain context-dependent text-to-SQL generation task, which requires a model to depend on the interaction history and current utterance to generate SQL query. In this paper, we present an encoder-decoder model called PG-GSQL based on the interaction-level encoder and with two effective innovations in decoder to solve cross-domain context-dependent text-to-SQL task. 1) To effectively capture historical information of SQL query and reuse the previous SQL query tokens, we use a hybrid pointer-generator network as decoder to copy tokens from the previous SQL query via pointer, the generator part is utilized to generate new tokens. 2) We propose a guide component to limit the prediction space of vocabulary for avoiding table-column dependency and foreign key dependency errors during decoding phase. In addition, we design a column-table linking mechanism to improve the prediction accuracy of tables. On the challenging cross-domain context-dependent text-to-SQL benchmark SParC, PG-GSQL achieves 34.0% question matching accuracy and 19.0% interaction matching accuracy on the dev set. With BERT augmentation, PG-GSQL obtains 53.1% question matching accuracy and 34.7% interaction matching accuracy on the dev set, outperforms the previous state-of-the-art model by 5.9% question matching accuracy and 5.2% interaction matching accuracy. Our code is publicly available.


Introduction
Text-to-SQL is a sub-task of semantic parsing, which aims to convert utterances to SQL queries. A great deal of deep learning approaches (Xu et al., 2018;Hwang et al., 2019;Bogin et al., 2019;Guo et al., 2019;Wang et al., 2019) have been proposed to solve context-independent text-to-SQL tasks such as WikiSQL (Zhong et al., 2017) and Spider (Yu et al., 2018c). Among them, SQLova (Hwang et al., 2019) achieves 84.2% and 83.6% logical form accuracy on WikiSQL dev and test sets, and RAT-SQL (Wang et al., 2019) achieves 69.7% and 65.6% exact matching accuracy on the Spider dev and test sets when accessing database content, respectively.
However, in real-world interaction scenarios, users often interact with the system through multi-turn dialogue. Then, the system needs to generate complete SQL query based on historical interaction information and current user utterance, as shown in Figure 1. Suhr et al. (2018) propose a seq2seq model with an interaction-level encoder to solve domain-specific context-dependent task called ATIS (Hemphill et al., 1990) which only contains the flight-booking domain. Unlike ATIS dataset, Yu et al. (2019b) present SParC, a new complex cross-domain and context-dependent text-to-SQL dataset based on Spider. Most of previous approaches in the field of text-to-SQL only focus on translating stand-alone utterances to SQL queries, and are not suitable for solving SParC which contains contextual dependencies.
Database: tvshow Table: tv_channel, tv_series, cartoon Utterance 1: Tell me the package option for the series named "Rock TV".
Utterance 2: Tell me the language of this series.
Utterance 3: List the language used least number of TV Channel. List language and number of TV Channel. SQL query1: SELECT package_option FROM tv_channel WHERE series_name = "Rock TV" SQL query2: SELECT language FROM tv_channel WHERE series_name = "Rock TV" SQL query3: SELECT language , COUNT(*) FROM tv_channel GROUP BY language ORDER BY COUNT(*) ASC LIMIT 1 Figure 1: An example of interaction from SParC dataset (Yu et al., 2019b). The generation of each SQL query depends on the historical information and current utterance, except for the first SQL query.
CD-Seq2Seq (Yu et al., 2019b) is a baseline model of SParC and it do not consider the table-column dependency, which causes the errors in predicting columns when tables are mentioned in the question but columns show ambiguity. Recently,  propose EditSQL which contains an editing mechanism that treats the previous SQL query as a sequence and reuses the previous SQL query to achieve the new state-of-the-art performance on SParC. But EditSQL generates table and column together (e.g., tv channel.series name) in a token during decoding phase, which raises noise of table prediction. Moreover, we observe that foreign key dependency is significant in cross-domain text-to-SQL datasets (Yu et al., 2018c;Yu et al., 2019a;Yu et al., 2019b) which contain more than 95% SQL queries that depend on foreign keys to link tables in both train and dev sets. Last but not least, to our knowledge, there is no research on utilizing tables to guide the prediction of SQL queries in decoding phase. It is worth studying whether this method will improve the performance.
In this work, we propose a novel model called PG-GSQL to address the cross-domain contextdependent text-to-SQL task. Our encoder is based on interaction-level encoder, and we use pointergenerator network (PG) with guide (G) component as the decoder. The interaction-level encoder and pointer-generator network are employed to capture the historical information in encoding and decoding phases, respectively. The pointer part in pointer-generator network is used to copy a token from previous SQL query and the generator is used to generate a new token from vocabulary in each decoding step. The guide component is designed to avoid dependency errors which contain table-column dependency and foreign key dependency errors during token generation phase. In addition, we observe that linking  table and its corresponding columns improves the prediction accuracy of tables, hence we design a novel  column-table linking mechanism to concatenate table and its corresponding columns together to represent the table. Experimental results show that PG-GSQL achieves 34.0% question matching accuracy and 19.0% interaction matching accuracy on SParC dev set. When using BERT augmentation, PG-GSQL obtains 53.1% question matching accuracy and 34.7% interaction matching accuracy on SParC dev set, gains 5.9% question matching accuracy and 5.2% interaction matching accuracy improvements compared to the previous state-of-the-art model.
However, the work most relevant to our research is the cross-domain context-dependent text-to-SQL  task. SParC (Yu et al., 2019b) and CoSQL (Yu et al., 2019a) are the latest datasets of cross-domain context-dependent text-to-SQL, and CoSQL is more complex than SParC because CoSQL includes system responses. There is not enough research in cross-domain context-dependent text-to-SQL task, which requires a model to depend on the historical information and current utterance to generate SQL query from the unseen database. Suhr et al. (2018) propose a seq2seq model with an interaction-level encoder to solve ATIS (Hemphill et al., 1990), and this model is extended to SParC called CD-Seq2Seq (Yu et al., 2019b). CD-Seq2Seq uses position embeddings to get the position information for each utterance, and proposes a copy mechanism to copy segments from the previous SQL query. In addition, the interaction-level encoder uses a discourse state to maintain and update over entire interaction. Yu et al. (2019b) propose SyntaxSQL-con based on SyntaxSQL (Yu et al., 2018b), the difference between them is that SyntaxSQL-con uses two bi-directional LSTMs (Hochreiter and Schmidhuber, 1997) with different parameters to encode the previous utterance and current utterance. Both of them utilize the SQL syntax and generation history to generate SQL queries. Recently,  propose EditSQL to employ the interaction-level encoder with the utterance-table encoder as the encoder. Moreover, EditSQL proposes a table-aware decoder with an editing mechanism as the decoder. The editing mechanism is similar to our hybrid pointer-generator network, which is used to copy tokens from the previous SQL query or insert new tokens.
Our hybrid pointer-generator network is inspired by (See et al., 2017) which focuses on abstractive text summarization. The difference between our model and the previous model in pointer-generator network is that we copy words from previous SQL query rather than the source sentences.

PG-GSQL
In this section, we present the model PG-GSQL to tackle the cross-domain context-dependent text-to-SQL task. Similar to CD-Seq2Seq (Yu et al., 2019b), we employ an interaction-level encoder as PG-GSQL encoder. The decoder consists of two effective innovations. (1) We use the pointer-generator network as decoder to copy tokens from previous SQL query or generate new tokens from vocabulary.
(2) We use guide component to limit the prediction space of vocabulary during decoding phase, which is able to avoid dependency errors to improve the performance. The architecture of decoder is illustrated in Figure 2. In addition, we use a column-table linking mechanism to concatenate table and its corresponding columns for each table to augment the representation of tables.

Schema Embedding
Let s = {t 1 , · · · , t n } denotes the set of tables in schema, n is the size of table set. Let t i = {c i1 , · · · , c im } denotes the column set, the elements in the set belong to table t i and m is the size of the set. We use two different patterns to encode table and column, respectively. Figure 3 illustrates an example of our method. For column embeddings, we use a bi-directional LSTM (Hochreiter and Schmidhuber, 1997) to encode the concatenation of its type column and the column name for each column (e.g., column series name), and use the final hidden state as column embedding. For table embeddings, we concatenate the column names in table t i and table name with their types to represent table t i (e.g., column id . column series name . table tv channel), which is able to augment the relation between column and table. Then, we use the same bi-directional LSTM to get the table t i embedding. The schema embedding h C consists of table embeddings and column embeddings.

Interaction-level Encoder
To obtain the relevance between utterance and schema, we use a simple string-matching algorithm to discern tables and columns which are mentioned in an utterance. We prioritize longer matching result and assign the type is None if there is no matching. If there are both column and table with same name in an utterance, we prioritize table. Let x i = [(x i,1 , τ i,1 ), · · · , (x i,L , τ i,L )] denotes the matching result of an utterance in turn i and L is the length of x i , the matching type of x i,j is τ i,j . First, we modify the original sequence as [τ i,1 , x i,1 ... τ i,L , x i,L ] and denote e i,j x as the j th token embedding of the new sequence in turn i. Then we use a bi-directional LSTM to encode the new sequence and the forward LSTM is defined by: where h − → E i,j is the j th forward hidden state in turn i, and h I i−1 is the turn i − 1 discourse state which is used in interaction-level encoder to maintain and update over the entire interaction. The backward LSTM h ← − E i,j is modified analogously and the hidden state of e i,j  Figure 4: The interaction-level encoder architecture.
The architecture of the interaction-level encoder is illustrated in Figure 4.

Base Decoder
Our base decoder is a LSTM decoder with attention (Bahdanau et al., 2015;Luong et al., 2015) to generate SQL queries. The decoder hidden state in step k is computed as: where h D k is the decoder hidden state in step k, e k−1 y is the (k − 1) th output token embedding and c k−1 is the context vector in step k − 1. The context vector c k is the concatenate of interaction attention and schema attention vectors: where c token k is the interaction attention vector and c schema k is the schema attention vector. We use the current utterance hidden state and the h previous utterance hidden states as the interaction encoder hidden state. In addition, we add relative position embeddings φ I to each utterance hidden state in attention compution. The attention between the decoder hidden state and the interaction encoder hidden state is computed as: where α token k is the attention distribution and |x t | means the length of x t . v di , W di 1 and W di 2 are learnable parameters. Furthermore, the attention between the decoder hidden state and the schema embedding is calculated as follows: |schema| denotes the length of h C , v dc , W dc 1 and W dc 2 are learnable parameters. The probabilities of generating schema entries and SQL keywords are computed as: where S SQL and S schema are SQL keyword scores and schema entry scores, respectively. W o , W SQL , b SQL and W schema are learnable parameters.

Pointer-generator Network
We empirically verify that the current SQL query is similar to the previous SQL query and the difference between them is generated via current utterance. Hence, how to connect the previous SQL query and current utterance for generating corresponding SQL is important in context-dependent scenario. We use a hybrid pointer-generator network to copy a token from the previous SQL query or generate a new token from vocabulary during each decoding step. First, we use another bi-directional LSTM to encode the previous SQL query and define h Q p as the p th hidden state of the previous predicted SQL query. Then, we calculate the attention between current decoder hidden state and the previous SQL query as follows: where |q| is the length of previous SQL query, v dp , W dp 1 and W dp 2 are learnable parameters. Finally, we modify the context vector as: In addition, we generate a switch p copy from context vector and decoder hidden state in step k: where W 1 copy and W 2 copy are the learnable parameters, σ is the sigmoid function. p copy is used to choose whether to copy a token from the previous SQL query via the previous SQL query attention distribution α query k or generate a new token. We define q i as the i th token in previous SQL query and modify the output probability distribution as follows:

Guide Component
The guide component in our model is utilized to avoid the table-column dependency and foreign key dependency errors. In order to solve misprediction of table-column dependencies, we design an intermediate state of the SQL query as shown in Figure 5 to predict tables first. Once tables are predicted, we prune the columns which do not belong to the predicted tables to avoid table-column dependency error. For avoiding the foreign key dependency error, we use a simple filter method to prune the tables which do not connect the predicted tables via foreign key constraints. We apply guide component in the decoder, and use heuristic algorithm to get the prediction type in each decoding step. Once the prediction type is obtained, we modify h C as follows: Original: SELECT language, COUNT( * ) FROM tv channel GROUP BY language ORDER BY COUNT( * ) ASC LIMIT 1 IntermediateState: FROM tv channel SELECT language, COUNT( * ) GROUP BY language ORDER BY COUNT( * ) ASC LIMIT 1 Figure 5: An example of getting an intermediate state of the SQL query, and the FROM clause is moved to the first place in the SQL query.

BERT Enhanced Embedding
We employ BERT (Devlin et al., 2019) to augment the embeddings of utterances, schemas and previous SQL queries in our model. The method we use BERT is different from previous models (Hwang et al., 2019;, we only use the utterance tokens and table representations which are obtained from section 3.1 to the pretrained small cased BERT model. The sequence is fed into the pretrained BERT model as follows: where [CLS] and [SEP] are split tokens, X i is the utterance tokens, and c ij is the j th column in table t i . We use the hidden states at the last layer in BERT as the token embeddings.

Experiment
In this section, we evaluate the effectiveness of our model on two large complex cross-domain contextdependent text-to-SQL datasets: SParC contains 3034, 421, 842 interactions and CoSQL contains 2164, 292, 551 interactions for training, development and testing. As the test sets of SParC and CoSQL are unreleased, we only evaluate our model on their dev sets. We evaluate our model using question matching accuracy (the exact set matching score over all questions) and interaction matching accuracy (the exact set matching score over all interactions). We do not use any extra data to boost our model for fair comparison. In addition, we conduct an ablation study to analyze the contribution of each innovation on SParC dev set.

Experimental Setup
Our implementation is based on PyTorch (Paszke et al., 2019) and we use Adam (Kingma and Ba, 2015) for optimization. The hyperparameter h is set to 5 in our model, and we use 50-dimensional position embeddings which are initialized from a random uniform distribution U [0.1, 0.1] and are fixed during training. In addition, we use the pretrained 300-dimensional GloVe word embeddings (Pennington et al., 2014) for utterance embeddings, schema embeddings and keyword embeddings, the keyword embeddings are also fixed. The initial learning rate is 0.001 in PG-GSQL and it will be multiplied by 0.8 when the validation loss exceeds the previous epoch. In addition, when using BERT instead of GloVe, we set the learning rate of BERT to 1e-5. We use the official evaluation script 2 to calculate question matching accuracy and interaction matching accuracy.

Model
Question Matching Interaction Matching SyntaxSQL-con (Yu et al., 2019b) 18.5 4.3 CD-Seq2Seq (Yu et al., 2019b) 17.1 6.7 EditSQL  33.0 16.4 EditSQL(BERT)  47.   Table 3: Accuracy of question matching on SParC dev set in different hardness levels. Table 1 shows the comparisons of PG-GSQL with other models on CoSQL dev set, the performance of EditSQL on CoSQL is obtained from official page 3 . As illustrated, our model achieves 41.2% question matching accuracy and 16.4% interaction matching accuracy when using BERT augmentation, and it outperforms the EditSQL by 1.3% question matching accuracy and 4.1% interaction matching accuracy. Table 2 shows the results of question matching accuracy and interaction matching accuracy by comparing PG-GSQL with previous models on SParC dataset. PG-GSQL outperforms EditSQL, which obtains 34.0% question matching accuracy and 19.0% interaction matching accuracy. When using BERT augmentation, PG-GSQL achieves 53.1% question matching accuracy and 34.7% interaction matching accuracy on the dev set. Compared with the previous state-of-the-art model EditSQL, our model improves question matching accuracy and interaction matching accuracy by 5.9% and 5.2%, respectively. Furthermore, we also evaluate the performance of PG-GSQL in different hardness levels on SParC dev set according to the official classification. There are 481, 441, 145, 133 questions in easy, medium, hard and extra hard levels respectively. As shown in Table 3, compared to previous baseline models which provide the data in all four hardness levels, our model outperforms previous models by a large margin in all hardness levels. We further study the performance of PG-GSQL in different turns by official classification. There are 421, 421, 269, 89 questions in turn 1 to 4. As shown in Table 4.1, our model outperforms baseline models in all turns on dev set. In addition, we observe that utterances in the later turns have greater dependencies over previous turns and greater risks for error propagation.

Ablation Study
We ablate the major novel innovations of PG-GSQL and PG-GSQL(BERT) to analyze their contributions on SParC dev set. Specifically, we first substitute the base decoder for the pointer-generator network and Table 2 shows that the performance drops by 5.9% in question matching accuracy and 9% in interaction  matching accuracy. When using BERT embeddings, it is obvious that interaction matching accuracy drops by a large margin of 19.5%. This shows hybrid pointer-generator network does effectively reuse the previous SQL query tokens. Then, we disable the guide component, which leads to the performance goes down by 3.8% in question matching accuracy and 1.9% in interaction matching accuracy. When incorporating BERT, Table 2 shows that guide component makes a significant drop in question matching accuracy to 35.8%. This demonstrates that the guide component is able to effectively avoid dependency errors during decoding phase. In addition, to study the performance of PG-GSQL and PG-GSQL(BERT) in detail, we evaluate the effects of different innovations at different hardness levels and different turns as shown in Figure 6. Figure 6(a) shows that pointer-generator network makes a significant improvement in predicting complex SQL queries which occur in later turns as shown in Figure 6(b). Furthermore, Figure 6(b) more specifically illustrates that guide component improves the question matching accuracy in turn 1, and hybrid pointer-generator network depends on the turn 1 to promote the interaction matching accuracy.

Conclusion
In this paper, we present PG-GSQL with hybrid pointer-generator network and guide component to address the cross-domain and context-dependent text-to-SQL task. Experimental results show pointergenerator network is capable of reusing the previous SQL query tokens to significantly improve interaction matching accuracy. Furthermore, guide component avoids table-column dependency and foreign key dependency errors during decoding phase, and column-table linking improves the prediction accuracy of tables. The ablation study shows that our model improves the performance not only in predicting simple queries, but also in predicting nested, complex queries in unseen databases.