ShadowGNN: Graph Projection Neural Network for Text-to-SQL Parser

Given a database schema, Text-to-SQL aims to translate a natural language question into the corresponding SQL query. Under the setup of cross-domain, traditional semantic parsing models struggle to adapt to unseen database schemas. To improve the model generalization capability for rare and unseen schemas, we propose a new architecture, ShadowGNN, which processes schemas at abstract and semantic levels. By ignoring names of semantic items in databases, abstract schemas are exploited in a well-designed graph projection neural network to obtain delexicalized representation of question and schema. Based on the domain-independent representations, a relation-aware transformer is utilized to further extract logical linking between question and schema. Finally, a SQL decoder with context-free grammar is applied. On the challenging Text-to-SQL benchmark Spider, empirical results show that ShadowGNN outperforms state-of-the-art models. When the annotated data is extremely limited (only 10% training set), ShadowGNN gets over absolute 5% performance gain, which shows its powerful generalization ability. Our implementation will be open-sourced at https://github.com/WowCZ/shadowgnn


Introduction
Recently, Text-to-SQL has drawn a great deal of attention from the semantic parsing community (Berant et al., 2013;Cao et al., 2019. The ability to query a database with natural language (NL) engages the majority of users, who are not familiar with SQL language, in visiting large databases. A number of neural approaches have been proposed to translate questions into executable SQL queries. On public Text-to-SQL benchmarks (Zhong et al., * The corresponding authors are Lu Chen and Kai Yu. 2017; Krishnamurthy et al., 2017), exact match accuracy even excesses more than 80%. However, the cross-domain problem for Text-to-SQL is a practical challenge and ignored by the prior datasets.
To be clarified, a database schema is regarded as a domain. The domain information consists of two parts: the semantic information (e.g., the table name) of the schema components and the structure information (e.g., the primary-key relation between a table and a column) of the schema. The recently released dataset, Spider (Yu et al., 2018), hides the database schemas of the test set, which are totally unseen on the training set. In this cross-domain setup, domain adaptation is challenging for two main reasons. First, the semantic information of the domains in the test and development set are unseen in the training set. On the given development set, 35% of words in database schemas do not occur in the schemas on the training set. It is hard to match the domain representations in the question and the schema. Second, there is a considerable discrepancy among the structure of the database schemas. Especially, the database schemas always contain semantic information. It is difficult to get the unified representation of the database schema. Under the cross-domain setup, the essential challenge is to alleviate the impact of the domain information.
First, it is necessary to figure out which role the semantic information of the schema components play during translating an NL question into a SQL query. Consider the example in Fig. 1(a), for the Text-to-SQL model, the basic task is to find out all the mentioned columns (name) and tables (team, match season) by looking up the schema with semantic information (named as semantic schema). Once the mentioned columns and tables in the NL question are exactly matched with schema components, we can abstract the NL question and the semantic schema by replacing the general component type with the specific schema components. As shown in Fig. 1(b), we can still infer the structure of the SQL query using the abstract NL question and the schema structure. With the corresponding relation between semantic schema and abstract schema, we can restore the abstract query to executable SQL query with domain information. Inspired by this phenomenon, we decompose the encoder of the Text-to-SQL model into two modules. First, we propose a Graph Projection Neural Network (GPNN) to abstract the NL question and the semantic schema, where the domain information is removed as much as possible. Then, we use the relation-aware transformer to get unified representations of abstract NL question and abstract schema.
Our approach, named ShadowGNN, is evaluated on the challenging cross-domain Text-to-SQL dataset, Spider. Contributions are summarized as: • We propose the ShadowGNN to alleviate the impact of the domain information by abstracting the representation of NL question and SQL query. It is a meaningful method to apply to similar cross-domain tasks.
• To validate the generalization capability of our proposed ShadowGNN, we conduct the experiments with limited annotated data. The results show that our proposed ShadowGNN can obtain absolute over 5% accuracy gain compared with state-of-the-art model, when the annotated data only has the scale of 10% of the training set.
• The empirical results show that our approach outperforms state-of-the-art models (66.1% accuracy on test set) on the challenging Spider benchmark. The ablation studies further confirm that GPNN is important to abstract the representation of the NL question and the schema.

Background
In this section, we first introduce relational graph convolution network (R-GCN) (Schlichtkrull et al., 2018), which is the basis of our proposed GPNN. Then, we introduce the relation-aware transformer, which is a transformer variant considering relation information during calculating attention weights.

Relational Graph Convolution Network
Before describing the details of R-GCN, we first give notations of relational directed graph. We denote this kind of graph as G = (V, E, R) with nodes (schema components) v i ∈ V and directed labeled edge where v i is the source node, v j is the destination node and r ∈ R is the edge type from v i to v j . N r i represents the set of the neighbor indices of node v i under relation r, where v i plays the role of the destination node.
Each node of the graph has an input feature x i , which can be regarded as the initial hidden state h (0) i of the R-GCN. The hidden state of each node in the graph is updated layer by layer with following step: Sending Message At the l-th layer R-GCN, each edge (v i , r, v j ) of the graph will send a message from the source node v i to the destination node v j . The message is calculated as below: where r is the relation from v i to v j and W (l) r is a linear transformation, which is a trainable matrix. Following Equation 1, the scale of the parameter of calculating message is proportional to the number of the node types. To increase the scalability, R-GCN regularizes the message-calculating parameter with the basis decomposition method, which is defined as below: where B is the basis number, a rb is the coefficient of the basis transformation V Aggregating Message After the message sending process, all the incoming messages of each node will be aggregated. Combined with Equations 1 and 2, R-GCN simply averages these incoming messages as: where c i,r equals to |N r i |. Updating State After aggregating messages, each node will update its hidden state from h where σ is an activation function (i.e., ReLU) and W (l) 0 is a weight matrix. For each layer of R-GCN, the update process can be simply denoted as: , |G| is the number of the nodes and G is the graph structure.

Relation-aware Transformer
With the success of the large-scale language models, the transformer architecture has been widely used in natural language process (NLP) tasks to encode the sequence X = [x i ] n i=1 with the selfattention mechanism. As introduced in Vaswani et al. (2017), a transformer is stacked by selfattention layers, where each layer transforms x i to y i with H heads as follows: where h is the head index, d z is the hidden dimension of z ij is attention probability, Concat denotes the concatenation operation, LayerNorm is layer normalization (Ba et al., 2016) and FC is a full connected layer. The transformer function can be simply denoted as: where and |X| is the sequence length.
Relation-aware transformer (RAT) (Shaw et al., 2018) is an important extension of the traditional transformer, which regards the input sequence as a labeled, directed, fully-connected graph. The pairwise relations between input elements are considered in RAT. RAT incorporates the relation information in Equation 6 and Equation 8. The edge from element x i to element x j is represented by vectors r ij,K and r ij,V , which are represented as biases incorporated in self-attention layer, as follows: where r ij,K and r ij,V are shared in different attention heads. For each layer of RAT, the update process can be simply represented as: ,j=1 is the relation matrix among the sequence tokens and R ij means the relation type between i-th token and j-th token.
Both R-GCN and RAT have been successfully applied into Text-to-SQL tasks. Bogin et al. (2019a) utilizes R-GCN to encode the structure of the semantic schema to get the global representations of the nodes.  considers not only the schema structure but also the schema link between the schema and the NL question. They proposed a unified framework to model the representation of the schema and the question with RAT. However, they do not explicitly explore the impact of the domain information. In the next section, we will introduce our proposed GPNN and explain how to use GPNN to get the abstract representation of the schema and the question.

Method
Text-to-SQL models take the NL questions Q = {q i } n i=1 and the semantic schema G = {s j } m j=1 as the input. In our proposed ShadowGNN, the encoder has been decomposed into two modules. The first module filters the specific domain information with a well-designed graph projection neural network (GPNN). The second module leverages relation-aware transformer to further get unified representations of question and schema. This two-phase encoder of ShadowGNN simulates the inference process of a human when translating a question to a SQL query under cross-domain setup: abstracting and inferring.

Graph Projection Neural Network
In this subsection, we introduce the structure of GPNN. As we discussed, the schema consists of database structure information and domain semantic information. GPNN looks at the schema from these two perspectives. Thus, GPNN has three kinds of inputs, abstract schema, semantic schema, and NL question. The input of the abstract schema is the type (table or column) of the schema nodes without any domain information, which can be regarded as a projection of semantic schema. Each node in the abstract schema is represented by a one-hot vector a j , which has two dimensions. For semantic schema and NL question, we first use pretrained language model RoBERTa  to initialize their representations. We directly concatenate NL question and semantic schema together, which formats as " [CLS] question [SEP] tables columns [SEP]". Each node name in the semantic schema may be tokenized into several sub-tokens or sub-words. We add an average pooling layer behind the final layer of the RoBERTa to align the sub-tokens to the corresponding node. We indicate the initial representation of NL question and semantic schema as q The main motivation of GPNN is to abstract the representations of question and schema. The abstract schema has been distilled from the semantic schema. The essential challenge lies on abstracting question representation. There are two separate operations in each GPNN layer: Projection Attention and Character Encoding. The projection attention of GPNN is to take the semantic schema as the bridge, where question updates its representation using abstract schema but attention information is calculated with the vectors of semantic schema. The character encoding is to augment the structure representation of the question sentence and the schema graph. Projection Attention In each GPNN layer, there is first an attention operation between NL question and semantic schema, as follows: where W Q and W K are trainable parameters at lth projection layer and e n×m = {e ij } n,m i=1,j=1 is the matrix of the weight score. n is the length of the question, and m is the number of schema nodes.
Before operating attention mechanism, inspired by (Bogin et al., 2019a), we first calculate the maximum values u of attention probability, where the physical meaning of u j is the most probability that the j-th component of the schema is mentioned by the question. We distinct the initial representation of the abstract schema by multiplying u on l-th layer abstract schema representation a (l) in element-wise way,â (l) = a (l) · u.
When updating the question representation, we take the representation of augmented abstract schemaâ (l) as key value of attention at l-th layer of GPNN, where gate(·) = sigmoid(Linear(·)) and W (l) V is trainable weight. When updating semantic schema, we take the transpose of the above attention matrix as the attention from schema to question, Similar to the update process of question from Equation 17-21, the update process of semantic schemas (l+1) takesê m×n as attention score and q (l) as attention value. We can see that we only use the augmented abstract schema to update the question representation. In this way, the domain information contained in question representation will be removed. The update process of the abstract schemaā (l+1) is the same as the semantic schema updating, where their attention weightê m×n on the question q (l) is shared. Noting that the input of attention operation for the abstract schema is the augmented abstract representationâ. Character Encoding We have used the projection attention mechanism to update the three kinds of vectors. Then, we combine the characters of schema and NL question and continue encoding schema and question with R-GCN(·) function and Transformer(·) function respectively, as shown in Fig. 2., q (l+1) = Transformer(q (l+1) ).
Until now, the projection layer has been introduced. Graph projection neural network (GPNN) is a stack of the projection layers. After GPNN module, we get the abstract representation of the schema and the question, indicated as a (N ) and q (N ) .

Schema Linking
The schema linking (Guo et al., 2019;Lei et al., 2020) can be regarded as a kind of prior knowledge, where the related representation between question and schema will be tagged according to the matching degree. There are 7 tags in total: The column values store in the databases. As the above description, the schema linking can be represented as D = {d ij } n,m i=1,j=1 , which d ij means the match degree between i-th word of question and j-th node name of schema.
To integrate the schema linking information into GPNN module, we calculate a prior attention score p n×m = Linear(Embedding(d ij )), where d ij is the one-hot representation of match type d ij . The attention score in Equation 17 is updated as following: where p ij is the prior score from p n×m . The prior attention score is shared among all the GPNN layers.

RAT
If we split the schema into the tables and the columns, there are three kinds of inputs: question, among the three inputs and uses the RAT(·) function to get unified representation of question and schema. The details of the defined relations among three components are introduced in RATSQL ( . The schema linking relations are the subset of R. In this paper, we leverage the RAT to further unify the abstract representation of question q (N ) and schema a (N ) , which is generated by previous GPNN module. We concatenate sentence sequence q (N ) and schema sequence a (N ) together into a longer sequence representation, which is the initial input of RAT module. After RAT module, the final unified representation of question and schema is indicated as:

Decoder with SemQL Grammar
To effectively constrain the search space during synthesis, IRNet (Guo et al., 2019) designed a contextfree SemQL grammar as the intermediate representation between NL question and SQL, which is essentially an abstract syntax tree (AST). SemQL recovers the tree nature of SQL. To simplify the grammar tree, SemQL in IRNet did not cover all the keywords of SQL. For example, the columns contained in GROUPBY clause can be inferred from SELECT clause or the primary key of a table where an aggregate function is applied to one of its columns. In our system, we improve the SemQL grammar, where each keyword in SQL sentence is corresponded to a SemQL node. During the training process, the labeled SQL needs to be transferred into an AST. During the evaluation process, the AST needs to recovered as the corresponding SQL. The recover success rate means the rate that the recovered SQL totally equals to labeled SQL. Our improved grammar raises the recover success rate from 89.6% to 99.9% tested on dev set.
We leverage the coarse-to-fine approach (Dong and Lapata, 2018) to decompose the decoding process of a SemQL query into two stages, which is similar with IRNet. The first stage is to predict a skeleton of the SemQL query with skeleton decoder. Then, a detail decoder fills in the missing details in the skeleton by selecting columns and tables.

Experiments
In this section, we evaluate the effectiveness of our proposed ShadowGNN than other strong baselines. We further conduct the experiments with limited annotated training data to validate the generalization capability of the proposed ShadowGNN. Finally, we ablate other designed choices to understand their contributions.

Experiment Setup
Dataset & Metrics We conduct the experiments on the Spider (Yu et al., 2018), which is a large-scale, complex and cross-domain Text-to-SQL benchmark. The databases on the Spider are split into 146 training, 20 development and 40 test. The humanlabeled question-SQL query pairs are divided into Approaches
Baselines The main contribution of this paper lies on the encoder of the Text-to-SQL model. As for the decoder of our evaluated models, we improve the SemQL grammar of the IRNet (Guo et al., 2019), where the recover success rate raises from 89.6% to 99.9%. The SQL query first is represented by an abstract syntax tree (AST) following the well-designed grammar . Then, the AST is flattened as a sequence (named SemQL query) by the deep-first search (DFS) method. During decoding, it is still predicted one by one with LSTM decoder. We also leverage the coarse-to-fine approach to the decoder as IRNet. A skeleton decoder first outputs a skeleton of the SemQL query. Then, a detail decoder fills in the missing details in the skeleton by selecting columns and tables. R-GCN (Bogin et al., 2019a;Kelkar et al., 2020) and RATSQL  are two other strong baselines, which improve the representation ability of the encoder.
Implementations We implement ShadowGNN and our baseline approaches with PyTorch (Paszke et al., 2019). We use the pretrained models RoBERTa from PyTorch transformer repository (Wolf et al., 2019). We use Adam with default hyperparameters for optimization. The learning rate is set to 2e-4, but there is 0.1 weight decay for the learning rate of pretrained model. The hidden sizes of GPNN layer and RAT layer are set to 512. The dropout rate is 0.3. Batch size is set to 16. The layers of GPNN and RAT in ShadowGNN encoder are set to 4.

Experimental Results
To fairly compared with our proposed Shad-owGNN, we implement RATSQL  with the same coarse-to-fine decoder and RoBERTa augmentation of ShadowGNN model. We also report the performance of GPNN encoder on test set. The detail implementations of these two baselines show as following: • RATSQL ♣ RATSQL model replaces the four projection layers with another four relationaware self-attention layers. There are totally eight relation-aware self-attention layers in the encoder, which is consistent with orignal RAT-SQL setup .
• GPNN Compared with ShadowGNN, GPNN model directly removes the relation-aware transformer. There are only four projection layers in the encoder, which can get better performance than eight layers. Table 1 presents the exact match accuracy of the novel models on development set and test set. Compared with the state-of-the-art RATSQL, our proposed ShadowGNN gets absolute 2.6% and 0.5% improvement on development set and test set with RoBERTa augmentation. Compared with our implemented RATSQL ♣ , ShadowGNN can still stay ahead, which has absolute 2.1% and 2.1% improvement on development set and test set. ShadowGNN improved the encoder and SemQL grammar of IRNet obtains absolute 11.1% accuracy gain on

Generalization Capability
We design an experiment to validate the effectiveness of the graph projection neural network (GPNN). Considering a question "What is name and capacity of stadium with most concert after year ?", which has been preprocessed, "name" and "capacity" are column names. We exchange their positions and calculate the cosine similarity with the representations of the final GPNN layer in Shad-owGNN model. Interestingly, we find that "name" has the most similar with "capacity", as shown in Figure 3. The semantic meaning of the two column names seems to be removed that the representations of the two column names only dependent on the existed positions. It indicates the GPNN can get the abstract representation of the question.
To further validate the generalization ability of our proposed ShadowGNN, we conduct the experiments on the limited annotated training datasets. The limited training datasets are sampled from fully training dataset with 10%, 50% and 100% sampling rate. As shown in Figure 4, there is a large performance gap between RATSQL and ShadowGNN, when the annotated data is extremely limited only occupied 10% of the fully training dataset. Shad-

Approaches
Easy Medium Hard Extra Hard All R-GCN (Kelkar et al., 2020)   owGNN outperforms RATSQL and GPNN with over 5% accuracy rate on development set. Under this limited training data setup, we find an interesting phenomenon that the convergence speed of ShadowGNN is much faster than the other two models. As described in Section 3, the two-phase encoder of ShadowGNN simulates the inference process of a human when translating a question to a SQL query: abstracting and inferring. The experiments on limited annotated training datasets show these two phases are both necessary, which not only can improve the performance but also speed up the convergence.

Ablation Studies
We conduct ablation studies to analyze the contributions of well-designed graph projection neural network (GPNN). Except RATSQL and GPNN models, we implement other two ablation models: R-GCN and R-GCN+RAT. First, we introduce the implementations of the ablation models.
• R-GCN ♣ We directly remove the projection part in the GPNN. When updating the question representation, we use the representation of semantic schema as attention value instead of abstract representation.
• R-GCN+RAT In this model, there are four R-GCN layers and four relation-aware selfattention layers. To be comparable, the initial input of R-GCN is the sum of semantic schema and abstract schema.
The decoder parts of these four ablation models are the same as the decoder of ShadowGNN. We present the accuracy of the ablation models at the four hardness levels on the development set, which is defined in (Yu et al., 2018). As shown in Table 2, ShadowGNN can get the best performance at three hardness levels. Compared with R-GCN (Kelkar et al., 2020), our implemented R-GCN based on SemQL grammar gets higher performance. Compared with R-GCN+RAT model, ShadowGNN still gets the better performance, where the initial input information is absolutely the same. It denotes that it is necessary and effective to abstract the representation of question and schema explicitly.

Related Work
Text-to-SQL Recent models evaluated on Spider have pointed out several interesting directions for Text-to-SQL research. An AST-based decoder (Yin and Neubig, 2017) was first proposed for generating general-purpose programming languages. IR-Net (Guo et al., 2019) used a similar AST-based decoder to decode a more abstracted intermediate representation (IR), which is then transformed into an SQL query. RAT-SQL  introduced a relation-aware transformer encoder to improve the joint encoding of question and schema, and reached the best performance on the Spider (Yu et al., 2018) dataset. BRIDGE  leverages the database content to augment the schema representation. RYANSQL (Choi et al., 2020) formulates the Text-to-SQL task as a slot-filling task to predict each SELECT statement. EditSQL , IGSQL (Cai and Wan, 2020) and R 2 SQL (Hui et al.) consider the dialogue context during translating the utterance into SQL query. GAZP (Zhong et al., 2020) proposes a zero-shot method to adapt an existing semantic parser to new domains. PIIA  proposes a human-in-loop method to enhance Textto-SQL performance. Graph Neural Network Graph neural network (GNN) (Li et al., 2015) has been widely applied in various NLP tasks, such as text classification (Chen et al., 2020b;Lyu et al., 2021), text generation , dialogue state tracking (Chen et al., 2020a; and dialogue policy (Chen et al., 2018a(Chen et al., ,b, 2019(Chen et al., , 2020c. It also has been used to encode the schema in a more structured way. Prior work (Bogin et al., 2019a) constructed a directed graph of foreign key relations in the schema and then got the corresponding schema representation with GNN. Global-GNN (Bogin et al., 2019a) also employed a GNN to derive the representation of the schema and softly select a set of schema nodes that are likely to appear in the output query. Then, it discriminatively re-ranks the top-K queries output from a generative decoder. We proposed Graph Projection Neural Network (GPNN), which is able to extract the abstract representation of the NL question and the semantic schema. Generalization Capability To improve the compositional generalization of a sequence-tosequence model, SCAN (Lake and Baroni, 2018) (Simplified version of the CommAI Navigation tasks) dataset has been published. SCAN task requires models to generalize knowledge gained about the other primitive verbs ("walk", "run" and "look") to the unseen verb "jump". Russin et al. (2019) separates syntax from semantics in the question representation, where the attention weight is calculated based on syntax vectors but the hidden representation of the decoder is the weight sum of the semantic vectors. Different from this work, we look at the semi-structured schema from two perspectives (schema structure and schema semantics). Our proposed GPNN aims to use the schema semantics as the bridge to get abstract representation of the question and schema.

Conclusion
In this paper, we propose a graph project neural network (GPNN) to abstract the representation of question and schema with simple attention way. We further unify the abstract representation of question and schema outputted from GPNN with relativeaware transformer (RAT). The experiments demonstrate that our proposed ShadowGNN can get excellent performance on the challenging Text-to-SQL task. Especially when the annotated training data is limited, our proposed ShadowGNN gets more performance gain on exact match accuracy and convergence speed. The ablation studies further indicate the effectiveness of our proposed GPNN. Recently, we notice that some Text2SQL-specific pretrained models have been proposed, e.g., TaBERT (Yin et al., 2020) and GraPPa . In future work, we will evaluate our proposed ShadowGNN with these adaptive pretrained models.