Representing Schema Structure with Graph Neural Networks for Text-to-SQL Parsing

Research on parsing language to SQL has largely ignored the structure of the database (DB) schema, either because the DB was very simple, or because it was observed at both training and test time. In \spider{}, a recently-released text-to-SQL dataset, new and complex DBs are given at test time, and so the structure of the DB schema can inform the predicted SQL query. In this paper, we present an encoder-decoder semantic parser, where the structure of the DB schema is encoded with a graph neural network, and this representation is later used at both encoding and decoding time. Evaluation shows that encoding the schema structure improves our parser accuracy from 33.8% to 39.4%, dramatically above the current state of the art, which is at 19.7%.


Introduction
Semantic parsing (Zelle and Mooney, 1996;Zettlemoyer and Collins, 2005) has recently taken increased interest in parsing questions into SQL queries, due to the popularity of SQL as a query language for relational databases (DBs).
Work on parsing to SQL (Zhong et al., 2017;Iyer et al., 2017;Finegan-Dollak et al., 2018;Yu et al., 2018a) has either involved simple DBs that contain just one table, or had a single DB that is observed at both training and test time. Consequently, modeling the schema structure received little attention. Recently, Yu et al. (2018b) presented SPIDER, a text-to-SQL dataset, where at test time questions are executed against unseen and complex DBs. In this zero-shot setup, an informative representation of the schema structure is important. Consider the questions in Figure 1: while their language structure is similar, in the first query a 'join' operation is necessary because the information is distributed across three tables, while in the other query no 'join' is needed. In this work, we propose a semantic parser that strongly uses the schema structure. We represent the structure of the schema as a graph, and use graph neural networks (GNNs) to provide a global representation for each node (Li et al., 2016;Cao et al., 2018;Sorokin and Gurevych, 2018). We incorporate our schema representation into the encoder-decoder parser of , which was designed to parse questions into queries against unseen semi-structured tables. At encoding time we enrich each question word with a representation of the subgraph it is related to, and at decoding time we emit symbols from the schema that are related through the graph to previously decoded symbols.
We evaluate our parser on SPIDER, and show that encoding the schema structure improves accuracy from 33.8% to 39.4% (and from 14.6% to 26.8% on questions that involve multiple tables), well beyond 19.7%, the current stateof-the-art. We make our code publicly available at https://github.com/benbogin/ spider-schema-gnn.

Problem Setup
We are given a training set {(x (k) , y (k) , S (k) )} N k=1 , where x (k) is a natural language question, y (k) is its translation to a SQL query, and S (k) is the schema of the DB where y (k) is executed. Our goal is to learn a function that maps an unseen question-schema pair (x, S) to its correct SQL query. Importantly, the schema S was not seen at training time, that is, S = S (k) for all k.
A DB schema S includes: (a) The set of DB tables T (e.g., singer), (b) a set of columns C t for each t ∈ T (e.g., singer name), and (c) a set of foreign key-primary key column pairs F, where each (c f , c p ) ∈ F is a relation from a foreignkey c f in one table to a primary-key c p in another. We term all schema tables and columns as schema items and denote them by V = T ∪ {C t } t∈T .

A Neural Semantic Parser for SQL
We base our model on the parser of , along with a grammar for SQL provided by AllenNLP (Gardner et al., 2018), which covers 98.3% of the examples in SPIDER. This parser uses a linking mechanism for handling unobserved DB constants at test time. We review this model in the context of text-to-SQL parsing, focusing on components we expand upon in §4.
Linking schema items To handle unseen schema items,  learn a similarity score s link (v, x i ) between a word x i and a schema item v that has type τ . 1 The score is based on learned word embeddings and a few manually-crafted features.
The linking score is used to compute where V τ are all schema items of type τ and s link (∅, ·) = 0 for words that do not link to any schema item. The functions p link (·) and s link (·) will be used to decode unseen schema items.
Encoder A Bidirectional LSTM (Hochreiter and Schmidhuber, 1997) provides a contextualized representation h i for each question word x i . Importantly, the encoder input at time step i is [w x i ; l i ]: the concatenation of the word embedding for x i and l where r v is a learned embedding for the schema item v. 2 Thus, p link (v | x i ) augments every word x i with information on the schema items it should link to.
Decoder We use a grammar-based (Xiao et al., 2016;Cheng et al., 2017;Yin and Neubig, 2017;Rabinovich et al., 2017) LSTM decoder with attention on the input question. At each decoding step, a non-terminal of type τ is expanded using one of the grammar rules. Rules are either schema-independent and generate non-terminals or SQL keywords, or schema-specific and generate schema items. At each decoding step j, the decoding LSTM takes a vector g j as input, which is an embedding of the grammar rule decoded in the previous step, and outputs a vector o j . If this rule is schemaindependent, g j is a learned global embedding. If it is schema-specific, i.e., a schema item v was generated, g j is a learned embedding τ (v) of its type. An attention distribution a j over the input words is computed as usual (Bahdanau et al., 2015), as well as a weighted average of the input c j = i a j h j . Now a distribution over grammar rules is computed by: , where G legal , V legal are the number of legal rules (according to the grammar) that can be chosen at time step j for schema-independent and schemaspecific rules respectively. The score s glob j is computed with a feed-forward network, and the score s loc j is computed for all legal schema items by multiplying the matrix S link ∈ R V legal ×|x| , which contains the relevant linking scores s link (v, x i ), with the attention vector a j . Thus, decoding unseen schema items is done by first attending to the question words, which are linked to the schema items.

Modeling Schemas with GNNs
Schema structure is informative for predicting the SQL query. Consider a table with two columns, where each is a foreign key to two other tables (student semester table in Figure 2). Such a table is commonly used for describing a manyto-many relation between two other tables, which affects the output query. We now show how we represent this information in a neural parser and use it to improve predictions.
At a high-level our model has the following parts (Figure 2). (a) The schema is converted to a graph. (b) The graph is softly pruned conditioned on the input question. (c) A Graph neural network generates a representation for nodes that is aware of the global schema structure. (d) The encoder and decoder use the schema representation. We will now elaborate on each part.
Schema-to-graph To convert the schema S to a graph ( Figure 2, left), we define the graph nodes as the schema items V. We add three types of edges: for each column c t , we add edges (c t , t) and (t, c t ) to the edge set E ↔ (green edges). For each foreignprimary key column pair (c t 1 , c t 2 ) ∈ F, we add edges (c t 1 , c t 2 ) and (t 1 , t 2 ) to the edge set E → and edges (c t 2 , c t 1 ) and (t 2 , t 1 ) to E ← (dashed edges). Edge types are used by the graph neural network to capture different ways in which columns and tables relate to one another.
Question-conditioned relevance Each question refers to different parts of the schema, and thus, our representation should change conditioned on the question. For example, in Figure 2, the relation between the tables student semester and program is irrelevant. To model that, we re-use the distribution p link (·) from §3, and define a relevance score for a schema item v: ρ v = max i p link (v | x i ) -the maximum probability of v for any word x i . We use this score next to create a question-conditioned graph representation. Figure 2 shows relevant schema items in dark orange, and irrelevant items in light orange.
Neural graph representation To learn a node representation that considers its relevance score and the global schema structure, we use gated GNNs (Li et al., 2016). Each node v is given an initial embedding conditioned on the relevance score: h We then apply the GNN recurrence for L steps. At each step, each node re-computes its representation based on the representation of its neighbors in the previous step: using a standard GRU update (see Li et al. (2016)).
We denote the final representation of each schema item after L steps by ϕ v = h (L) v . We now show how this representation is used by the parser.
Encoder In §3, a weighted average over schema items l i was concatenated to every word x i . To enjoy the schema-aware representations, we compute l ϕ i = τ v∈Vτ ϕ v p(v | x i ), which is identical to l i , except ϕ v is used instead of r v . We concatenate l ϕ i to the output of the encoder h i , so that each word is augmented with the graph structure around the schema items it is linked to.
Decoder As mentioned ( §3), when a schema item v is decoded, the input in the next time step is its type τ (v). A first change is to replace τ (v) by ϕ v , which has knowledge of the structure around v. A second change is a self-attention mechanism that links to the schema, which we describe next.
When scoring a schema item, its score should depend on its relation to previously decoded schema items. E.g., in Figure 2, once the table semester has been decoded, it is likely to be joined to a related table. We capture this intuition with a self-attention mechanism.
For each decoding step j, we denote by u j the hidden state of the decoder, and byĴ = (i 1 , . . . , i |Ĵ| ) the list of time steps before j where a schema item has been decoded. We define the ma-trixÛ ∈ R d×|Ĵ| = [u i 1 , . . . , u i |Ĵ| ], which concatenates the hidden states from all these time steps. We now compute a self-attention distribution over these time steps, and score schema items based on this distribution (Figure 2, right): where the matrix S att ∈ R |Ĵ|×V legal computes a similarity between schema items that were previously decoded, and schema items that are legal according to the grammar: is a feed-forward network. Thus, the score of a schema item increases, if substantial attention is placed on schema items to which it bears high similarity.

Experiments and Results
Experimental setup We evaluate on SPIDER (Yu et al., 2018b), which contains 7,000/1,034/2,147 train/development/test examples. We pre-process examples to remove table aliases from the queries, and add table names to all columns (details in Appendix B). We maximize the log-likelihood of the gold sequence during training, and use beamsearch (of size 10) at test time, similar to prior work. We run the GNN for L = 2 steps. We use the official evaluation script from SPIDER to compute accuracy, i.e., whether the predicted query is equivalent to the gold query. Results Our full model (GNN) obtains 39.4% accuracy on the test set, substantially higher than prior state-of-the-art (SYNTAXSQLNET), which is at 19.7%. Removing the GNN from the parser (NO GNN), which results in the parser of , augmented with a grammar for SQL, obtains an accuracy of 33.8%, showing the importance of encoding the schema structure. 34.9% to 40.7%. Importantly, using schema structure specifically improves performance on questions with multiple tables from 14.6% to 26.8%. We ablate the major novel components of our model to assess their impact. First, we remove the self-attention component (NO SELF ATTEND). We observe that performance drops by 2 points, where SINGLE slightly improves, and MULTI drops by 6.5 points. Second, to verify that improvement is not only due to self-attention, we ablate all other uses of the GNN. Namely, We use a model identical to NO GNN, except it can access the GNN representations through the self-attention (ONLY SELF ATTEND). We observe a large drop in performance to 35.9%, showing that all components are important. Last, we ablate the relevance score by setting ρ v = 1 for all schema items (NO REL.). Indeed, accuracy drops to 37.0%.
To assess the ceiling performance possible with a perfect relevance score, we run an oracle experiment, where we set ρ v = 1 for all schema items that are in the gold query, and ρ v = 0 for all other schema items (GNN ORACLE REL.). We see that a perfect relevance score substantially improves performance to 54.3%, indicating substantial headroom for future research. join analysis For any model, we can examine the proportion of predicted queries with a join, where the structure of the join is "bad": (a) when the join condition clause uses the same table twice (ON t1.column1 = t1.column2), and (b) when the joined table are not connected through a primary-foreign key relation.
We find that NO GNN predicts such joins in 83.4% of the cases, while GNN does so in only 15.6% of cases. When automatically omitting from the beam candidates where condition (a) occurs, NO GNN predicts a "bad" join in 14.2% of the cases vs. 4.3% for GNN (total accuracy increases by 0.3% for both models). As an example, in Figure 2, s loc j scores the table student the highest, although it is not related to the previ-ously decoded table semester. Adding the selfattention score s att j corrects this and leads to the correct student semester, probably because the model learns to prefer connected tables.