A Tale of Two Linkings: Dynamically Gating between Schema Linking and Structural Linking for Text-to-SQL Parsing

In Text-to-SQL semantic parsing, selecting the correct entities (tables and columns) for the generated SQL query is both crucial and challenging; the parser is required to connect the natural language (NL) question and the SQL query to the structured knowledge in the database. We formulate two linking processes to address this challenge: schema linking which links explicit NL mentions to the database and structural linking which links the entities in the output SQL with their structural relationships in the database schema. Intuitively, the effectiveness of these two linking processes changes based on the entity being generated, thus we propose to dynamically choose between them using a gating mechanism. Integrating the proposed method with two graph neural network-based semantic parsers together with BERT representations demonstrates substantial gains in parsing accuracy on the challenging Spider dataset. Analyses show that our proposed method helps to enhance the structure of the model output when generating complicated SQL queries and offers more explainable predictions.


Introduction
Semantic parsing, which aims at mapping natural language (NL) utterances to computer understandable logic forms or programming languages, has been an active research topic in the field of natural language processing (NLP) for decades (Zettlemoyer and Collins, 2005;Liang et al., 2011). Although a variety of logic forms have been studied by researchers, Text-to-SQL has particularly attracted a large amount of attention due to the desire of natural language interfaces to database (NLIDB) (Warren and Pereira, 1982;Zelle and Mooney, 1996;Dong, 2019) for both scientific and industrial reasons. Recently, there is a growing interest in neural based Text-to-SQL semantic parsing, thanks to the development of new evaluation paradigms and datasets (Iyer et al., 2017;Zhong et al., 2017;Yu et al., 2018;. Text-to-SQL parsing requires strict structured prediction due to its application scenario where the output SQL will be sent to an executor program directly. To enhance the capacity of an auto-regressive model to capture structural information, current state-of-the-art semantic parsers usually adopt a grammar-based decoder (Xiao et al., 2016;Yin and Neubig, 2017;. Rather than directly generating the tokens in a traditional sequence-to-sequence manner, grammar-based decoders produce a sequence of production rules to construct an abstract syntax tree (AST) of the corresponding SQL. As the grammar constraints narrows down the search space to only grammatically valid ASTs, those parsers can usually generate well-formed SQL skeletons (Guo et al., 2019;Bogin et al., 2019a).
However, it is still difficult for current state-of-the-art models to fill in the skeletons with semantically correct entities, especially when they are required to generalize to unseen DB schemas (Yu et al., 2018;Suhr et al., 2020). To predict the correct entity, the model should have a database (DB) schema grounded understanding of the NL question, which means that the model should be able to jointly learn the semantics in the NL question and the structured knowledge in a given database. We formulate two types of entity generation problems, which can be addressed by the following two linking processes respectively. 1 Q: For each continent, list its id, name, and how many countries it has? SELECT t1 . contid , t1 . continent , COUNT( * ) FROM continents AS t1 JOIN countries AS t2 ON t1 . contid = t2 . continent GROUP BY t1 . contid ; 2 Q1: What is the average, minimum, and maximum age of all singers from :::::: France? Q2: What is the average, minimum, and maximum age for all :::::: French singers? SELECT AVG( age ) , MIN( age ) , MAX( age ) FROM singer WHERE country = 'France ' ; 3 Q: What is the first name and gender of the all the students who have more than one pet? SELECT t1 . fname , t1 . sex FROM student AS t1 JOIN has pet AS t2 ON t1 . stuid = t2 . stuid GROUP BY t1 . stuid HAVING COUNT( * ) > 1 underline indicates the mention can only be resolved by linking to a cell value or common sense reasoning.
Schema linking. Schema linking (Guo et al., 2019;Wang et al., 2020) is an instance of entity linking (Shen et al., 2014) in the context of linking to relational DB schema. Text-to-SQL semantic parsers should learn to recognize an entity mention in the NL question and link it to the corresponding unique entity in the DB schema. This task can be challenging due to the diversity and ambiguity NL mentions. However, in practice, the solution is often relatively easy when a particular entity is well realized with similar wording in both the NL question and DB schema. As shown in Table 1, in Spider (Yu et al., 2018), the underlined mentions can almost exactly match the corresponding schema entities. Therefore, current state-of-the-art parsers normally address this problem with simple string matching or embedding matching modules. Structural linking. While some entities are generated because they are mentioned in the NL question, others can be generated because of their role as special functional components in SQL, e.g., contid and stuid in the ON clauses of the first and third examples in Table 1. These entities usually cannot find their corresponding mentions in the NL question but can be induced by the structural constraints of SQL. Such phenomenons are generally referred to as the structural mismatch between NL and formal languages (Kwiatkowski et al., 2013;Berant and Liang, 2014). This process frequently occurs when generating complex SQL queries. We propose to treat this entity generation process as finding a structural link between current candidates and past generated entities. Although previous work has considered some simple structural constraints (Guo et al., 2019), to the best of our knowledge, we are the first to formally describe this process. While schema linking and structural linking can complement each other (e.g., the entity used by the GROUP BY clause needs to be a column in a previously selected table and also has its mention in the NL question), they actually address different types of problems. In most cases, e.g., the examples in Table 1 described before, a decoder may need to discriminate one from another for better generation performance.
In this work, we propose to use a dynamic gating mechanism to serve as a switch between the two linking processes. Schema linking can be implemented as the decoder attending to the encoded representations of the NL question to find an entity mention, then locating the corresponding entity from the schema. On the other hand, in structural linking, our decoder performs self-attention over those past decoder states where an entity was generated and further makes a decision between copying one of the previously generated entities or taking one of its linked entities (e.g., the foreign key of a previously joined table) based on the schema structure. From the model's perspective, it can also be viewed as a memory pointer network that enhances the structured prediction ability of an auto-regressive model, and the dynamic gating determines when to emphasize this enhancement. Our proposed method can be easily applied to most semantic parsers as long as their decoders explicitly or implicitly have two modules that deal with schema linking and structural linking.
We integrate the dynamic gating technique to two state-of-the-art Text-to-SQL parsing models (Bogin et al., 2019a;Bogin et al., 2019b) and further augment them with pretrained BERT (Devlin et al., 2019) word representations. We evaluate our model on the Spider dataset which is challenging because of its  Table Nodes Primary-Foreign Key Edges  Figure 1: An illustration of our proposed method when running the first example shown in Table 1. This figure shows two independent entity generation procedures, the top one favors schema linking while the bottom one favors structural linking. Some details are omitted for the sake of simplicity. Grammars are simplified to fit in the limited space, readers are encouraged to refer to  for details.
cross-domain setting where a model needs to generalize to not only complex SQL but also unseen DBs. Experimental results show that our proposed method consistently yields an improvement of more than 3% on exact set matching accuracy and sees the most benefits when generating complex SQL. Further analysis confirms that the models are dynamically switching between the two linking processes.

Approach
In this section, we first formulate the Text-to-SQL semantic parsing task in §2.1. We will then describe the details of our proposed method in §2.2.

Text-to-SQL Semantic Parsing
The task of Text-to-SQL semantic parsing is to predict a SQL query S based on input (Q, G) where Q = {q 1 , . . . , q |Q| } is the NL question and G = (V, E) is the DB schema being queried. In the schema, V = {(e 1 , t 1 ), . . . , (e |V| , t |V| )} is a set which usually contains two types of entities (i.e., tables and columns 1 ) and their textual descriptions (i.e., table names and column names), while |E| , l |E| )} contains the relations l between source entity e (s) and target entity e (t) , e.g., table-column relationships, foreign-primary key relationships, 2 etc. The output S = {a 1 , . . . , a |S| } is a sequence of decoder actions which further compose an AST of SQL.
Typical state-of-the-art Text-to-SQL parsers, consist of three components: a NL encoder, a schema encoder and a grammar decoder (Guo et al., 2019;Bogin et al., 2019a).
The NL encoder takes the NL question tokens Q as input, maps them to word embeddings E Q , then feeds them to a Bi-LSTM (Hochreiter and Schmidhuber, 1997). The hidden states of the Bi-LSTM serve as the contextual word representation of each token.
The schema encoder takes G as input and builds a relation-aware entity representation for every entity in the schema. The initial representation of an entity is a combination of its words embeddings and type information. Then self-attention Shaw et al., 2019) or graph-based models (Bogin et al., 2019a;Wang et al., 2020) are utilized to exploit the relational information between each pair of entities from the DB schema, thus produce the final representation of all entities H V ∈ R |V|×dim . We will detail this in §3.
Finally, a grammar decoder (Xiao et al., 2016;Yin and Neubig, 2017; generates an AST of output SQL in a depth-first order. The decoder is typically an auto-regressive model (e.g., LSTM) which estimates the probability of generating an action sequence.
There are two cases of using actions. Depending a specific case, an action is either (i) producing a new production rule to unfold the leftmost non-terminal node in the AST, or (ii) generating an entity (e.g., a table or a column) from the DB schema if it is required by last output production rule. In the former case, at step t, the decoder normally uses its hidden states h t to retrieve a context vector c t from the NL encoder. Then an action embedding a t is produced based on the concatenation of h t and c t . This action embedding will directly predict a production rule from the target vocabulary which is a subset of a fixed number of production rules. For the latter case, the decoder needs to estimate a probability distribution over a schema-specific vocabulary under grammatical constraints which come from the structure of both the output SQL and the DB schema, as well as semantic constraints implied in the NL question.

Dynamic Gating
In this paper, we focus on the decision made when the decoder is looking for an entity to fill in a slot (i.e., case (ii) in the last paragraph). Our decoder predicts the entity based on a mixed probability model consisting of two processes: • Schema linking. The decoder attends to the output of the NL encoder (which can be seen as selecting a most relevant NL mention), then finds the corresponding entity based on string-matching or embedding-matching results.
• Structural linking. The decoder self-attends to the output states from those previous decoding steps which have generated entities, then finds another entity which is structurally linked to the attended entity.
The choice between them is controlled by a gating mechanism called the link gate. Formally, the marginal probability of generating an entity e is defined as follows: where Pr(SCHM) and Pr(STRCT) are the probability of choosing schema linking and structural linking respectively. They are further computed as: Equation 2 stands for our proposed link gate which is computed by the action embedding a t . The reason for purely basing the gate value on a t is that intuitively the choice between the two processes is about the role of the current entity we want to generate. The role of an entity is determined by the SQL clause that contains it. Since a t is directly used to predict a production rule in case (i), it should be able to capture this information. The link gate allows the decoder to dynamically choose between information from our two linking processes, and prevents them from interfering each other.
In practice, we model the probability of the schema linking process generating an entity mentioned in the NL question, namely Pr(e|SCHM), as a multiplication of the attention weights λ ∈ R |Q| over the NL encoder outputs and a schema linking matrix M ∈ R |Q|×|V| . The probability of the structural linking process, namely Pr(e|STRCT), is similarly computed by multiplying decoder self-attention weights and a structural linking matrix T ∈ R |V|×|V| .
The structural linking matrix T captures the relationship between every pair of entities given the relational DB schema. Common structural links include relations between a table and its columns, a table and its primary/foreign key, a primary key and one of its linked foreign keys in other tables, etc. There are also multi-steps links which are the combinations of the one-step links listed above. Note that there may not be a unique link between every pair of entities and some entities may not have a link between them at all. Meanwhile, an entity can link to itself which can be considered to be a special zero-step structural link. It is the only structural link modeled in most current Text-to-SQL semantic parsers.
We compute the structural linking score between an entity e i and e j by an additive attention mechanism (Bahdanau et al., 2015), as follows: where e i and e j are the corresponding entity representations retrieved from H V , while v α and W α are both trainable parameters. Compared to the dot-product attention used in Bogin et al. (2019a) and Bogin et al. (2019b), the additive attention we used here is expected to capture more of the structural relationship between entity pairs rather than only the similarity of entity representations.
T is expected to capture all types of relationships between entities, but it can be overwhelmed by the large workload. So, we single out the zero-step relationship (i.e., copying) and address it by another structural linking matrix T copy = I, which is trivially an identity matrix in this case. To choose between copying and other types of links, a copy gate (ρ copy ) is obtained in the same manner as to how we compute the link gate in Equation 2.
We use decoder self-attention to find the past generated entity which could have structural constraints on the entity that we currently want to generate.
H m is a memory matrix consisting of the action embeddings from every past decoding step which has generated an entity. In this way, we compute two sets of attention weights β copy and β link for copying and linking separately using the additive attention (Bahdanau et al., 2015) again. The motivation for separate attention weights is that these two linking patterns might need to attend to different generated entities. Overall the probability of generating an entity via structural linking is modeled as: Pr(a t = e i |STRCT) = ρ copy (β copy T copy ) i + (1 − ρ copy )(β link T ) i Finally, this probability is mixed with the probability of schema linking controlled by the link gate.

Model Implementation
In this section, we describe how we integrate our proposed method into a grammar decoder and leverage the entity representation from a GNN module. We use the type constrained grammar decoder from . To predict a t at time step t, the decoder will first obtain the context vector c t from the NL encoder by performing dot-product attention (Luong et al., 2015). Then the action embedding is generated by a feed-forward network taking the concatenation of decoder hidden state and context vector as input.
a t is used to predict the production rule or estimate the gate values in the entity generation process. We adopt the idea from (Bogin et al., 2019a;Bogin et al., 2019b) to learn a schema relation-aware entity representation H V by a GNN module. 3 The initial embedding of each entity h (0) e is defined as a non-linear transformation of the combination of its type embedding and the average over the word embeddings of its neighbors in the schema graph. In later time steps, the hidden state is updated by a gated recurrent unit (Cho et al., 2014;Li et al., 2016) is defined as a weighted summation over the hidden states of its neighbor entities: They consider three edge types, i.e., bidirectional edges between a table and its contained columns ↔, unidirectional edges between a foreign key and a connected primary key ← and its reverse version →. Given a fixed GNN recurrence step L, we have the final hidden states of all the entities in the graph as the entity representation H V = {h (L) e |(e, t) ∈ V}. We also adopt their schema linking module to create a schema linking matrix M based on word embedding similarity and some simple manually design features (e.g., editing distance and lemma). In their GLOBALGNN (Bogin et al., 2019b), an additional GNN and an auxiliary training loss are added to filter out irrelevant nodes in the graph, thus producing a better entity representation.
To augment our model with pretrained BERT embeddings, we follow Hwang et al. (2019) and  to feed the concatenation of NL question and the textual descriptions of DB entities to BERT and use the top layer hidden states of BERT as the input embeddings.

Experiments
We evaluate the effectiveness of our proposed method by integrating it into two state-of-the-art semantic parsers on the Spider dataset and further ablate out some components to understand their contributions.

Experiment Setup
We implement our model using PyTorch (Paszke et al., 2019) and AllenNLP (Gardner et al., 2018). For the GNN and GLOBALGNN models we revise and build upon the code released in (Bogin et al., 2019a;Bogin et al., 2019b). We re-ran the experiment and report the results on our re-implementation and found our results slightly improves upon their reported results. In BERT experiments, we use the base uncased BERT model with 768 hidden size provided by HuggingFace's Transformers library (Wolf et al., 2019). We follow the database split setting of Spider, where any databases that appear at testing time are ensured to be unseen at training time. Our code and models are available at https://github.com/ sanxing-chen/linking-tale.

Experimental Results
The experimental results in Table 3 show that our proposed gating mechanism leads a substantial improvement on all the GNN, GLOBALGNN, and BERT baselines. Spider questions are divided into different levels of difficulty (hardness). Most of the improvements come from gains in complicated (i.e., Medium, Hard and Extra Hard) SQL generation. Specially, we observe up to 13.8% gains in the Hard set when applying our method on the GLOBALGNN baseline. One major contribution comes from the partial matching F1 score of IUEN (i.e., SQL clauses INTERSECT, UNION, EXCEPT, NESTED which only appear in Hard and Extra Hard levels) increasing from 25.4% to 39.7%. We also notice that the   Table 3: Exact Set Matching Accuracy on SQL queries with different hardness levels in the development set of Spider. Greatest improvements in the Hard level; small fluctuation in Easy level due to gate bias.
SQL output well-formedness is improved. For instance, before applying our method the decoder would occasionally select the same columns twice to perform the ON clause. 4 After applying our dynamic gating, this issue is virtually eliminated (error rate from 2% to 0.2%). Link gate link Copy gate copy Figure 3: Value distributions of the link gate and copy gate measured in dev set of Spider using the GLOBALGNN model. Values of copy gate are considered when the corresponding link gate value is small (ρ link < 0.1). Figure 3, the values of both gates are polarized to 0 or 1, thus making the gating mechanism act as a binary gate. These statistics coincide with our hypothesis that most entity generation decisions in Text-to-SQL can be solely made by evidence from either schema linking or structural linking. In addition, among all the cases where structural linking is chosen and the copy gate takes control, fewer than 20% of cases favor copying. This suggests that there are lots of circumstances where different kinds of structural linking are adopted.

Alternative Approaches and Ablation
We also conduct several experiments to examine several design choices in our proposed method. Sharing Action Embedding. In the design of our gating mechanism, one critical decision is to use the action embedding a t to perform decoder self-attention and produce the gating values. This is based on our intuition that the action embedding captures the structural information of the output SQL at the current position. To verify this decision, we conduct an ablation experiment by using a dedicated embedding to produce the gating value. This dedicated embedding is produced in exactly same way as we generated a t in Equation 9, but uses a different set of parameters for the feed-forward network. As we can see from Figure 4 ("dedicated embed"), sharing the parameters with the action embedding is important. Keeping entities. Guo et al. (2019) also uses a memory-augmented pointer network to perform a copy mechanism which assists column selection. In contrast with our memory matrix consisting of action embeddings (Equation 7), their memory matrix consists of the entity embeddings of columns that have been selected previously. They further remove the columns from the candidates in the schema linking process once they are generated to prevent the decoder from repeatedly generating the same columns. To

NL
Show the stadium name and the number of concerts in each stadium.  Table 5: Sample predictions of our model. ρ link and ρ copy are abbreviated as ρ l and ρ c respectively. Gate values are not applicable (denoted by N/A) for the first entity since it has no previously generated entity. Some of the results reflect gate bias, see text for details. determine if our dynamic gating and structural linking module can add enough structural constraints to the decoding process to resolve this kind of problem, we conduct an experiment where we also remove the entities in the schema linking process after they are generated (i.e., "removing entity" in Figure 4) to see if it further improves the model. Our results shows that this change can actually hurts the model. Specifically, we observe a drop in accuracy of the WHERE and IUEN clauses, which suggests that in our context, the information about a specific entity in the schema linking process is still useful even after the entity has been generated once.  Table 4: Alternative approaches and ablation results.
Copy Gate. In addition, removing the copy gate and copy mechanism also harms the performance of our model (i.e., "without copy" in Figure 4). This result confirms that it is beneficial to handle different types of structural links separately. We hypothesize that different types of structural links conflict with each other, so they are hard to fit in one structural linking matrix. Overall, these two results further supports our claim that the copy gate can determine when to copy or link to an entity by itself.

Error Analysis and Discussion
Gate bias. Error analysis reveals that our gating mechanism is biased, e.g., for the first few entities being selected in a SQL query the link gate is trained to favor schema linking in most cases. But, such bias could sometimes be wrong. In such cases where structural linking is needed but absent, the model may select duplicate columns or the wrong table during decoding. Similarly, the copy gate might be biased toward copying an entity from memory in GROUP BY clauses. Out of all the SQL clause components, only the GROUP BY clause's partial matching F1 score drops (about 3%) due to this copy gate bias. It is true that the entity needed in the GROUP BY clause is usually selected, but the information from schema linking can still be beneficial, 6 e.g., in the second example of Table 5, the model wants to copy the wrong entity but the link gate rectifies it with schema linking information. Our gating mechanism only relies on the current action embedding (which can be seen as short-term structural information) to determine the gate values. We believe that introducing more global structural constraints is a promising direction to find a more flexible and accurate gating mechanism. Short attention spans. In our experiments, we notice that the action history the model usually attends to is very short, i.e., the model only utilizes the the output memory of the most recent three entities in 99% of cases. This coincides with similar findings in language modeling (Daniluk et al., 2017) where the augmented-memory was expected to facilitate the modeling of long-range dependencies but failed to do so. Although long-term context is important for language modeling, it is not as important in our Text-to-SQL scenario since most dependencies in programming languages like SQL lie within a short span. We are interested in exploring semantic parsing tasks which requires long-term structural constraints using our method in the future. Structural linking patterns. We have shown the effectiveness of our method in dealing with different types of structural links separately using different components of the model in the previous section. So far, the only special type of structural links we can explicitly model is the copy mechanism, and we treat all other types of links uniformly using additive attention in Equation 5. This might limit the model's ability to take advantage of the complicated relationships between entities. Currently, the entity representation provided by the GNN model is still difficult to explain, because the node representations contain a mix of information from different message-passing steps. One could imagine training GNNs with different message-passing steps each modeling a different level of structural linking, could lead to a more clear and expressive linking pattern.

Related Work
Semantic parsing. Semantic parsing research focus on mapping NL to formal languages like lambda calculus (Zettlemoyer and Collins, 2005;Kwiatkowksi et al., 2010;Liang et al., 2011;Dong and Lapata, 2016), Prolog-style queries (Zelle and Mooney, 1996;Tang and Mooney, 2000), and more recently to SQL (Warren and Pereira, 1982;Popescu et al., 2003;Giordani and Moschitti, 2009;Zhong et al., 2017;Iyer et al., 2017). It can also tackle the problem of parsing NL descriptions to complicated general-purpose programming language such as Python (Ling et al., 2016;Rabinovich et al., 2017;Yin and Neubig, 2017). Our proposed method is tested for Text-to-SQL parsing and can be adapted to other semantic parsing applications. Structural mismatch. Programming languages like SQL express the same intent in a completely different way from NL by design (Kate, 2008). The phenomenon called structural mismatch widely exists between NL and various programming language and is a major challenge in semantic parsing (Dong, 2019). To alleviate the structural mismatch problem, early approaches rely on linguistic formalisms like parsing results from flexible CCGs (Zettlemoyer and Collins, 2005;Zettlemoyer and Collins, 2007;Kwiatkowski et al., 2011;Kwiatkowski et al., 2013). Chen et al. (2016) proposed to use sentence rewriting to revise the NL question to a new question which has the same structure with the targeted logical form. Recently, Guo et al. (2019) proposed to first translate the NL question to an intermediate representation (IR) designed to bridge NL and SQL, then use a deterministic algorithm to convert the IR to SQL. In addition to taking a considerable amount of engineering effort, their designed IR is still unable to cover some SQL grammars like the self-join in the ON clause, and is more challenging to apply to other programming languages. We deal with this problem by explicitly modeling the prediction structure with external predefined structure (i.e., DB schema) by structural linking. Memory pointer network. Memory networks were first introduced in the context of the question answering task, where they served as a differentiable long-term knowledge base to enhance an autoregressive model's poor memory Sukhbaatar et al., 2015). Copy mechanisms use attention as a pointer to select and copy items from source text, thus addressing the problem of a variable output vocabulary size (Vinyals et al., 2015;See et al., 2017). Recent research has applied memory-augmented pointer networks to various NLP tasks, including task-oriented dialogue (Wu et al., 2019) and also semantic parsing (Liang et al., 2017;Guo et al., 2019). Our dynamic gating mechanism can also be seen a memory controller, except our memory is read-only and acts as both the query and key in a pointer network. Different from most current techniques, our pointer network does not only perform copying but can also point to a start point of structural linking.

Conclusion
In this paper, we formulated the entity generation process in Text-to-SQL semantic parsing as two kinds of linking problems, namely schema linking and structural linking. We further proposed a dynamic gating mechanism to explicitly model the decision between these two linking processes. Experimental results show the effectiveness of our proposed method and confirm our intuitions. In the future, we would like to apply our proposed method to other semantic parsing tasks, such as general purpose code generation, where structural constraints may be more important.