IGSQL: Database Schema Interaction Graph Based Neural Model for Context-Dependent Text-to-SQL Generation

Context-dependent text-to-SQL task has drawn much attention in recent years. Previous models on context-dependent text-to-SQL task only concentrate on utilizing historical user inputs. In this work, in addition to using encoders to capture historical information of user inputs, we propose a database schema interaction graph encoder to utilize historicalal information of database schema items. In decoding phase, we introduce a gate mechanism to weigh the importance of different vocabularies and then make the prediction of SQL tokens. We evaluate our model on the benchmark SParC and CoSQL datasets, which are two large complex context-dependent cross-domain text-to-SQL datasets. Our model outperforms previous state-of-the-art model by a large margin and achieves new state-of-the-art results on the two datasets. The comparison and ablation results demonstrate the efficacy of our model and the usefulness of the database schema interaction graph encoder.


Introduction
The Text-to-SQL task aims to translate natural language texts into SQL queries. Users who do not understand SQL grammars can benefit from this task and acquire information from databases by just inputting natural language texts. Previous works (Li and Jagadish, 2014;Xu et al., 2017;Yu et al., 2018a;Bogin et al., 2019b;Huo et al., 2019) focus on context-independent text-to-SQL generation. However, in practice, users usually interact with systems for several turns to acquire information, which extends the text-to-SQL task to the context-dependent text-to-SQL task in a conversational scenario. Throughout the interaction, user inputs may omit some information that appeared before. This phenomenon brings difficulty for context-dependent text-to-SQL task.
Besides, two cross-domain contextdependent datasets SParC (Yu et al., 2019b) and CoSQL (Yu et al., 2019a) are released. Crossdomain means databases in test set differ from that in training set, which is more challenging.
EditSQL  is the previous state-of-the-art model on SParC and CoSQL datasets and it focuses on taking advantages of previous utterance texts and previously predicted query to predict the query for current turn. Table 1 shows the user inputs, ground truth queries and predicted queries of EditSQL for an interaction. In the second turn, EditSQL views "Kacey" as the name of a dog owner. However, since the context of the interaction is about dogs, "Kacey" should be the name of a dog. This example shows that a model using only historical information of user inputs may fail to keep context consistency and maintain thematic relations.
According to (Yu et al., 2019b) and (Yu et al., 2019a), to maintain thematic relations, users may change constraints, ask for different attributes for the same topic when they ask the next questions. Thus, database schema items (i.e., table.column) in current turn should have relation with items in previous turn. For example, in Table 1, the second question x 2 adds a constraint of the name and asks for the age of a dog instead of the numbers of all dogs. The corresponding database schema items Dogs.age and Dogs.name in y 2 belong to the same table as Dogs.* in previous query y 1 . Therefore, we propose to take historical information about database schema items into consideration.
In particular, we first construct a graph based on corresponding database, where graph nodes are database schema items and graph edges are primary-foreign keys and column affiliation. Short  distance between graph nodes appearing in previous query and current query can reveal the context consistency since there is usually an edge between the different attributes of the same topic. We then propose a database schema interaction graph encoder to model database schema items together with historical items. Empirical results on two large cross-domain context-dependent text-to-SQL datasets -SParC and CoSQL show that our schema interaction graph encoder contributes to modeling context consistency and our proposed model with database schema interaction graph encoder substantially outperforms the state-of-theart model. Our main contributions are summarized as follows: • Previous models failed to keep context consistency and predict queries in a conversation scenario. To remedy this, we propose a database schema interaction graph encoder for database schema encoding and it can keep context consistency for the contextdependent text-to-SQL task. Our implemen-tations are public available 1 .
• Our model with the database schema interaction graph encoder achieves new state-ofthe-art performances on development and test sets of two cross-domain context-dependent text-to-SQL datasets, SparC and CoSQL.

Related Work
Many studies have focused on contextindependent text-to-SQL task. Zhong et al. (2017) split the vocabulary and use reinforcement learning. Xu et al. (2017) propose a sketched-based model, which decomposes the token prediction process into SELECT-clause prediction and WHERE-clause prediction, aiming at taking previous predictions into consideration. Yu et al. (2018a) further employ a tree-based SQL decoder so as to decode SQL queries with the help of SQL grammar. In order to encode database schemas, schemas are regarded as graphs and graph neural networks have been applied (Bogin et al., 2019a,b). Guo et al. (2019) design an intermediate representation to bridge the gap between natural language texts and SQL queries. Choi et al. (2020) utilize a sketch-based slot filling approach to synthesize SQL queries. Wang et al. (2019) attempt to align the database columns and their mentions in user inputs by using a relation-aware self attention.
Recently, context-dependent text-to-SQL task has drawn people's attention. In-domain contextdependent benchmarks ATIS (Suhr et al., 2018) have been proposed. For ATIS, Suhr et al. (2018) utilize a sequence to sequence framework. Besides, they introduce an interaction-level encoder for incorporating historical user inputs and a segment copy mechanism to reduce the length of generation. Later, two large and complex cross-domain context-dependent dataset SParC (Yu et al., 2019b) and CoSQL (Yu et al., 2019a) are proposed. In order to tackle cross-domain context-dependent text-to-SQL task,  propose the EditSQL model in order to capture features from historical user inputs, variant database schemas and previously predicted SQL query. Liu et al. (2020) further evaluate context modeling methods and apply a grammar-based decoder. EditSQL achieves the state-of-the-art performance on the two cross-domain datasets. Compared to EditSQL, our work further explore a new way to employ historical information of database schemas.

Problem Setup
We define X as a series of natural language utterances of an interaction (i.e., user inputs), Y as corresponding ground-truth SQL queries, S as the set of database schema items (table.column) and R as the set of relations between schema items (primary-foreign keys and column affiliation). Let X = {x 1 , x 2 , ..., x |X| }, where |X| is the number of utterances. x i is the i-th utterance and x i j is the j-th token of it. y i is the i-th SQL query corresponding to x i and y i j is the j-th token of y i . S consists of schema items {S 1 , ..., S |S| }, where |S| is the number of database schema items. At turn i, the model should make use of current and previous utterances {x 1 , x 2 , ..., x i }, database schema items S and their relations R to predict a SQL querỹ y i . The objective of the model is to maximize the probability of

IGSQL Model
Our model adopts an encoder-decoder framework with attention mechanism. Figure 1 shows the architecture of our model. The model have four main components: (1) a database schema interaction graph encoder, which consists of crossturn schema interaction graph layers and intra-turn schema graph layers, (2) a text encoder that captures historical information of user inputs, (3) a co-attention module that updates outputs of text encoder and database schema interaction graph encoder, and (4) a decoder with a gated mechanism to weight the importance of different vocabularies. In addition, the model also uses BERT embedding.
We will first introduce the BERT embedding in Section 4.1, and then introduce our database schema interaction graph encoder in Section 4.2, text encoder and co-attention module in Section 4.3 and decoder in Section 4.4.

BERT Embedding
BERT (Devlin et al., 2019) is a pre-trained language model. Employing BERT output as embeddings of user inputs and database schema items has proved effective in context-dependent text-to-SQL task (Hwang et al., 2019;Guo et al., 2019;Wang et al., 2019;Choi et al., 2020). Therefore, we leverage BERT to get the embeddings of user inputs and database schema items as other context-dependent text-to-SQL models do. We concatenate user inputs and database schema items by separating with a "[SEP]" token following (Hwang et al., 2019). The output of BERT model is used as the embeddings of user inputs and schema items.

Database Schema Interaction Graph Encoder
As shown in Table 1, previous model mistakes "Kacey" as the name of a dog owner. However, the interaction is all about dogs and "Kacey" should be the name of a dog. It shows that previous model does not perform well in modeling context consistency of an interaction. For two database schema items appearing in two adjacent turns, short distance of items in the graph can reveal the context consistency. For example, the distance between Dogs.* 2 and correct item Dogs.name is 1. Distance between Dogs.* and wrong item owners.name is 3.
Therefore, we propose a database schema interaction graph encoder based on the database schema graph, attempting to model context consistency by using historical schema representations. The database schema interaction graph encoder consists of L 1 cross-turn schema interaction graph layers and L 2 intra-turn schema graph layers (L 1 and L 2 are hyper-parameters). Cross-turn schema interaction graph layers update schema item representations by using that in previous turn. Intraturn schema graph layers further aggregate adjacent item representations in the same turn.

Graph Construction and Schema Items
Encoding We first introduce how we construct a graph based on database schema. We use database schema items as nodes. Each node has an edge linking to itself. There is an undirected edge between node t and node j according to relation set R if one of the following condition is satisfied: 1) node t and node j are the foreign-primary key pair; 2) node t and node j belong to the same table. We define the edge set as E.
A schema item table.column is divided into "table", "." and "column". We use a BiLSTM with BERT embedding to encode tokens and average hidden state vectors of BiLSTM as the embedding of the schema item. The embedding of the j-th schema item at i-th turn is noted as r i j .
[CLS] How many dogs on the  Figure 2 shows an example of the database schema interaction graph. The graph only allows node t in previous turn to update node j in current turn, when the distance between node t and node j in the original graph constructed in Section 4.2.1 is less than or equal to 1. For example, if we want to update the representation of Dogs.dog id at turn i, we add edges linking Dogs.*, Dogs.name, Dogs.owner id and Dogs.dog id at turn i − 1 to Dogs.dog id at turn i. Note that we have L 1 cross-turn schema interaction graph layers for turn i. At the l-th layer, we obtain updated representation z i,l t of the t-th schema item by using attention on outputs of the L 2 intra-turn schema graph layers at previous turn {g i−1,L 2 t } |S| t=1 (which will be introduced in next subsection) and representations of previous layer

Cross-turn Schema Interaction Graph Layer
. We use item embedding r i t as the initial representation z i,0 t . For simplicity, we omit turn index i and layer index l in the formulas of attention mechanism except the input z i,l−1 t , g i−1,L 2 t and output z i,l t . At the l-th layer, we first use a feed-forward neural network with leakyReLU activation function for non-linear transformation. We use FFN to denote the feed-forward neural network with leakyReLU activation function.
We then apply attention mechanism as follows.
where d 1 is the dimension of u t . W 1 and W 2 are weight matrices. α t,j andα t,j are the attention scores.ũ t is the t-th output vector of attention.
Following (Vaswani et al., 2017;Veličković et al., 2017), we extend attention mechanism to multi-head attention. We also add a sub-layer of feed-forward neural network with residual connection as in Transformer.
where z i,l t is the final output of layer l. There are L 1 cross-turn schema interaction graph layers and thus z i,L 1 t is the final output of cross-turn schema interaction graph layers for the t-th schema item.

Intra-turn Schema Graph Layer
There are L 2 intra-turn schema graph layers following cross-turn schema interaction graph layers. In each intra-turn schema graph layer, we use almost the same attention mechanism as in the crossturn schema interaction graph layer, except that we use the original graph constructed in Section 4.2.1. Since the original graph does not contain nodes in previous turn, the intra-turn schema graph layer can only update node representation by aggregating adjacent node representations in the same turn.
At each intra-turn schema graph layer layer l of turn i, it takes output vectors in previous layer g i,l−1 t as inputs and its output is g i,l t . g i,0 t is z i,L 1 t . We then use attention mechanism to aggregate information. We also add a sub-layer of FFN and residual connection. For simplicity, we omit the turn index i and layer index l in attention except input g i,l−1 t and output g i,l t .
where W 3 is a weight matrix and d 2 is the dimension of µ t . β t,j is the attention score of the j-th node to the t-th node.μ t is the attention output. g i,l t is the output of t-th schema item at layer l of turn i. Besides, We also extend attention to multihead attention.
The final output of intra-turn schema graph layers for the t-th schema item is g i,L 2 t .

Text Encoder and Co-Attention Module
We use a BiLSTM to encode tokens of an utterance text with BERT embedding. In order to capture interaction history, we add an LSTM as interaction encoder and utilize turn-level attention, following . The final representation of the t-th token in utterance i is denoted as h i t .
We also add a co-attention module between text tokens and schema items following . The schema item vectorg i t used in decoding phase is the concatenation of g i,L 2 t and its corresponding attention vector over text. The representation of input text tokensh i t used in decoding phase is the concatenation of h i t and its corresponding attention vector over schema items. Due to page limit, we omit the details here, which can be found in .

Decoder
In decoding phase, we first encode previously predicted query with a BiLSTM. We then exploit a LSTM decoder with attention (Bahdanau et al., 2015) to capture features from input text's token vectors, schema item vectors and previously predicted SQL query vectors. At j-th time step, We use attention on text token's vectorh i t , database schema vectorg i t and previously predicted SQL token's vector q t . We thus get three context vectors. The final context vector c j is the concatenation of these three context vectors.
We follow (Suhr et al., 2018) to make prediction of SQL tokens based on SQL reserved words, database schema items and previous predicted SQL tokens. We also add a gate mechanism to introduce the importance of these three vocabularies. For simplicity, we omit turn index i in decoder step exceptỹ i j . The gate mechanism is introduced to measure the importance of three vocabularies.
where o j is the j-th hidden vector of the LSTM decoder. c j is the context vector.
[; ] is the concatenation operator andõ j is the non-linear transformation of [o j ; c j ]. σ is the sigmoid function. res, sch, que represent SQL reserved words, database schema items and previously predicted SParC CoSQL cross-domain  Interaction  4298  3007  Train  3034  2164  Dev  422  292  Test  842  551  User Questions 12726 15598  Databases  200  200  Tables  1020  1020  Vocab  3794  9585  Avg Turn 3.0 5.2 SQL tokens respectively and ζ res , ζ sch , ζ que represent the importance of these three kinds of tokens.
We then predict SQL tokens as follows.
where w is the one-hot vector of word w. q t andg i t are query vector and schema item vector that are mentioned before. The final generation probability p(ỹ i j ) is p 1 (ỹ i j ) + p 2 (ỹ i j ) + p 3 (ỹ i j ). Z is the normalization factor that ensures v∈V p(v) is 1, where V is the whole vocabulary. The loss function is i j −log(p(y i j ))

Implementation Details
We use Adam optimizer (Kingma and Ba, 2015) to optimize the loss function. The initial learning rate except BERT model is 1e-3, while the initial learning rate of BERT model is 1e-5. We use learning rate warmup over the first 1000 steps. The learning rate will be multiplied by 0.8 if the loss on development set increases and the token accuracy on development set decreases. The number of crossturn schema interaction graph layer L 1 is 2, while the number of intra-turn schema graph layer L 2 is 1. The dimensions d 1 and d 2 are both 300. For encoder and decoder, the hidden size of the one layer LSTM and BiLSTM are 300. Besides, we use batch re-weighting to reweigh the loss function following (Suhr et al., 2018). For BERT embedding, following EditSQL, we use the pre-trained BERT base model in order to make fair comparison.

Experiment Setup
Datasets. We conduct experiments on two largescale cross-domain context-dependent SQL generation datasets, SParC (Yu et al., 2019b) and CoSQL (Yu et al., 2019a). In comparison with previous context-dependent dataset ATIS (Dahl et al., 1994), SParC and CoSQL are more complex since they contain more databases and adopt a crossdomain task setting, where the databases of training set differ from that of development set and test set. Statistics of SParC and CoSQL are shown in Table 2. Evaluation Metrics. Yu et al. (2018b) introduce exact set match accuracy to replace string match accuracy by taking queries with same constraints but different orders as the same query. In SParC and CoSQL, we use question match accuracy and interaction match accuracy as evaluation metrics. Question match accuracy is the average exact set match accuracy over all questions, while interaction match accuracy is the average exact set match accuracy over all interactions. Baseline Models. We compare our model with following baseline models.
This model is originated in (Suhr et al., 2018) for ATIS dataset. Yu et al. (2019b) adapt this model to cross-domain setting by adding a BiLSTM to encode schema items and modifying the decoder to generate different schema items according to databases.
• SyntaxSQL-con. This model is originated in (Yu et al., 2018a), which utilizes SQL grammars for decoder. Yu et al. (2019b) adapt this model to context-dependent setting by adding LSTM encoders to encode historical user inputs and historical SQL queries.
• EditSQL. The model is proposed by . In addition to modules for encoding historical user inputs and corresponding SQL queries, it also contains a copy mechanism to copy tokens from previous SQL queries.

Experiment Results
Results of these baseline models and our proposed IGSQL model are shown in Table 3 Table 4 shows the exact match accuracy of interaction with respect to different turn number. In both datasets, performances on interactions with one turn improve less. In SParC, performances on interactions with two turns and four turns improve the most, while in CoSQL, performances on interaction with two turns and larger than four turns improve the most. These results demonstrate that our database schema interaction graph encoder contributes to modeling schema items in conversational scenarios. Table 5 lists the exact match accuracy with respect to different hardness level. Results in the table show that performance at each hardness level improves. The results indicate that capturing historical database schema information can not only improve the accuracy of easy questions, but also answer harder questions more accurately.

Ablation Study
In order to verify the usefulness of our database schema interaction graph encoder, we conduct several ablation experiments as follows. w/o cross-turn schema interaction graph layer. In this experiment, we discard cross-turn schema interaction graph layers. In this setting, our model cannot encode historical database schema information. w/o intra-turn schema graph layer. In this experiment, we discard intra-turn schema graph layers to examine whether these layers are useful. GRU interaction layer. One of the most common way to employ historical information of database schema items is to update node representation directly from historical vector of the same node. For example, in Figure 2, we can use a GRU by taking representation of Dogs.dog id at turn i − 1 and its BERT embedding at turn i as input. The output of GRU is the vector of Dogs.dog id at turn i. In this experiment, we use a GRU to replace cross-turn schema interaction graph layers. Fully-connected interaction layer. To examine the effectiveness of our design of schema interaction graph, we make experiment that replaces the schema interaction graph with fully connected graph. Taking Figure 2 as an example, to update representation of blue node at turn i, there are edges connecting blue node at turn i to all nodes at turn i − 1.
Since the test sets of SParC and CoSQL are not public, we carry out the ablation experiments only   Table 7: An example of an interaction in CoSQL. x i is the input sequence at i-th turn and y i is the corresponding ground truth query. We show the predictions of EditSQL and IGSQL and mark the differences with red color. on development sets of these two datasets. Table 6 shows the results of ablation experiments.
Our full model achieves about 2 points improvement compared with the model without cross-turn schema interaction graph layers and the model with GRU interaction layer. Besides, our model achieves about 1 point improvement compared with the model without intra-turn schema graph layers. These results indicate that our cross-turn and intra-turn schema graph layers are very helpful. The difference between cross-turn schema interaction graph layer and fully-connected interaction layer is how we add edges between nodes at turn i − 1 and turn i. Compared to fullyconnected interaction layer, the schema interaction graph introduces a distance restriction when adding edges. Our model with schema interaction graph performs substantially better, which shows that our design of schema interaction graph can significantly help our model to keep context con-sistency.

Case Study
In Table 7, we show an interaction with four turns. We also provide the predictions of Edit-SQL and IGSQL and mark the differences with red color. After the first turn, EditSQL confuses cartoon.original air date with tv series.air date. Our proposed IGSQL model successfully obtains answers in the correct order by taking historical information of database schema items into account.

Conclusion and Future work
In this paper, we focus on context-dependent crossdomain SQL generation task. We find that previous state-of-the-art model only takes historical user inputs and previously predicted query into consideration, but ignores the historical information of database schema items. Thus we propose a model named IGSQL to model database schema items in a conversational scenario. Empirical results demonstrate the efficacy of our model. We also conduct ablation experiments to reveal the significance of our database schema interaction graph encoder. For future work, we will explore methods attempting to solve hard and extra hard questions.