Editing-Based SQL Query Generation for Cross-Domain Context-Dependent Questions

We focus on the cross-domain context-dependent text-to-SQL generation task. Based on the observation that adjacent natural language questions are often linguistically dependent and their corresponding SQL queries tend to overlap, we utilize the interaction history by editing the previous predicted query to improve the generation quality. Our editing mechanism views SQL as sequences and reuses generation results at the token level in a simple manner. It is flexible to change individual tokens and robust to error propagation. Furthermore, to deal with complex table structures in different domains, we employ an utterance-table encoder and a table-aware decoder to incorporate the context of the user utterance and the table schema. We evaluate our approach on the SParC dataset and demonstrate the benefit of editing compared with the state-of-the-art baselines which generate SQL from scratch. Our code is available at https://github.com/ryanzhumich/sparc_atis_pytorch.


Introduction
Generating SQL queries from user utterances is an important task to help end users acquire information from databases. In a real-world application, users often access information in a multi-turn interaction with the system by asking a sequence of related questions. As the interaction proceeds, the user often makes reference to the relevant mentions in the history or omits previously conveyed information assuming it is known to the system. Therefore, in the context-dependent scenario, the contextual history is crucial to understand the follow-up questions from users, and the system often needs to reproduce partial sequences generated in previous turns. Recently,  proposes a context-dependent text-to-SQL model including an interaction-level encoder and an attention mechanism over previous utterances. To reuse what has been generated, they propose to copy complete segments from the previous query. While their model is successful to reason about explicit and implicit references, it does not need to explicitly address different database schemas because the ATIS contains only the flight-booking domain. Furthermore, the model is confined to copy whole segments which are extracted by a rule-based procedure, limiting its capacity to utilize the previous query when only one or a few tokens are changed in the segment.
To exploit the correlation between sequentially generated queries and generalize the system to different domains, in this paper, we study an editing-based approach for cross-domain contextdependent text-to-SQL generation task. We propose query generation by editing the query in the previous turn. To this end, we first encode the previous query as a sequence of tokens, and the decoder computes a switch to change it at the token level. This sequence editing mechanism models token-level changes and is thus robust to error propagation. Furthermore, to capture the user utterance and the complex database schemas in different domains, we use an utterance-table encoder based on BERT to jointly encode the user utterance and column headers with co-attention, and adopt a table-aware decoder to perform SQL generation with attentions over both the user utterance and column headers. We evaluate our model on SParC (Yu et al., 2019b), a new large-scale dataset for cross-domain semantic parsing in context consisting of coherent question sequences annotated with SQL queries over 200 databases in 138 domains. Experiment results show that by generating from the previous query, our model delivers an improvement of 7% question match accuracy and 11% interaction Context Cross-Domain Interaction (train / dev / test) Question Turn Database   match accuracy over the previous state-of-the-art. Further analysis shows that our editing approach is more robust to error propagation than copying segments, and the improvement becomes more significant if the basic text-to-SQL generation accuracy (without editing) improves.

Datasets
We use SParC 1 (Yu et al., 2019b), a large-scale cross-domain context-dependent semantic parsing dataset with SQL labels, as our main evaluation benchmark. A SParC example is shown in Table  3. We also report performance on ATIS (Hemphill et al., 1990;Dahl et al., 1994a) for direct comparison to . In addition, we evaluate the cross-domain context-independent text-to-SQL ability of our model on Spider 2 (Yu et al.,1 https://yale-lily.github.io/sparc 2 https://yale-lily.github.io/spider 2018c), which SParC is built on. We summarize and compare the data statistics in Table 1 and Table 2. While the ATIS dataset has been extensively studied, it is limited to a particular domain. By contrast, SParC is both context-dependent and cross-domain. Each interaction in SParC is constructed using a question in Spider as the interaction goal, where the annotator asks inter-related questions to obtain information that completes the goal. SParC contains interactions over 200 databases and it follows the same database split as Spider where each database appears only in one of train, dev and test sets. In summary, SParC introduces new challenges to context-dependent text-to-SQL because it (1) contains more complex context dependencies, (2) has greater semantic coverage, and (3) adopts a crossdomain task setting.

Task Formulation
Let X denote a natural language utterance and Y denote the corresponding SQL query. Context-  independent semantic parsing considers individual (X, Y ) pairs and maps X to Y . In contextdependent semantic parsing, we consider an interaction I consisting of n utterance-query pairs in a sequence: At each turn t, the goal is to generate Y t given the current utterance X t and the interaction history Furthermore, in the cross-domain setting, each interaction is grounded to a different database. Therefore, the model is also given the schema of the current database as an input. We consider relational databases with multiple tables, and each table contains multiple column headers: where m is the number of column headers, and each c l consists of multiple words including its table name and column name ( § 3.1).

Methodology
We employ an encoder-decoder architecture with attention mechanisms (Sutskever et al., 2014;Luong et al., 2015) as illustrated in Figure 1. The framework consists of (1) an utterance-table encoder to explicitly encode the user utterance and table schema at each turn, (2) A turn attention incorporating the recent history for decoding, (3) a table-aware decoder taking into account the context of the utterance, the table schema, and the previously generated query to make editing decisions.

Utterance-Table Encoder
An effective encoder captures the meaning of user utterances, the structure of table schema, and the relationship between the two. To this end, we build an utterance-table encoder with co-attention between the two as illustrated in Figure 2. Figure 2b shows the utterance encoder. For the user utterance at each turn, we first use a bi-LSTM to encode utterance tokens. The bi-LSTM hidden state is fed into a dot-product attention layer (Luong et al., 2015) over the column header embeddings. For each utterance token embedding, we get an attention weighted average of the column header embeddings to obtain the most relevant columns (Dong and Lapata, 2018). We then concatenate the bi-LSTM hidden state and the column attention vector, and use a second layer bi-LSTM to generate the utterance token embedding h E . Figure 2c shows the table encoder. For each column header, we concatenate its table name and its column name separated by a special dot token (i.e., table name . column name). Each column header is processed by a bi-LSTM layer. To better capture the internal structure of the table schemas (e.g., foreign key), we then employ a selfattention (Vaswani et al., 2017) among all column headers. We then use an attention layer to capture the relationship between the utterance and the table schema. We concatenate the self-attention vector and the utterance attention vector, and use a second layer bi-LSTM to generate the column header embedding h C .
Note that the two embeddings depend on each other due to the co-attention, and thus the column header representation changes across different utterances in a single interaction.  Table Columns Bi LSTM Concatenation (b) Utterance Encoder.  Utterance- Table BERT Embedding. We consider two options as the input to the first layer bi-LSTM. The first choice is the pretrained word embedding. Second, we also consider the contextualized word embedding based on BERT (Devlin et al., 2019). To be specific, we follow Hwang et al. (2019) to concatenate the user utterance and all the column headers in a single sequence separated by the [SEP] token:

Self-Attention Among
This sequence is fed into the pretrained BERT model whose hidden states at the last layer is used as the input embedding.

Interaction Encoder with Turn Attention
To capture the information across different utterances, we use an interaction-level encoder  on top of the utterance-level encoder.
At each turn, we use the hidden state at the last time step from the utterance-level encoder as the utterance encoding. This is the input to a unidirectional LSTM interaction encoder: The hidden state of this interaction encoder h I encodes the history as the interaction proceeds. Turn Attention When issuing the current utterance, the user may omit or explicitly refer to the previously mentioned information. To this end, we adopt the turn attention mechanism to capture correlation between the current utterance and the utterance(s) at specific turn(s). At the current turn t, we compute the turn attention by the dot-product attention between the current utterance and previous utterances in the history, and then add the weighted average of previous utterance embeddings to the current utterance embedding: The c turn t summarizes the context information and the current user query and will be used as the initial decoder state as described in the following.

Table-aware Decoder
We use an LSTM decoder with attention to generate SQL queries by incorporating the interaction history, the current user utterance, and the table schema.
Denote the decoding step as k, we provide the decoder input as a concatenation of the embedding of SQL query token q k and a context vector c k : where h D is the hidden state of the decoder LSTM D , and the hidden state h D 0 is initialized by c turn t . When the query token is a SQL keyword, q k is a learned embedding; when it is a column header, we use the column header embedding given by the table-utterance encoder as q k . The context vector c k is described below. Context Vector with the Table and User Utterance. The context vector consists of attentions to both the table and the user utterance. First, at each step k, the decoder computes the attention between the decoder hidden state and the column header embedding.
where l is the index of column headers and h C l is its embedding. Second, it also computes the attention between the decoder hidden state and the utterance token embeddings: where i is the turn index, j is the token index, and h E i,j is the token embedding for the j-th token of i-th utterance. The context vector c k is a concatenation of the two: Output Distribution. In the output layer, our decoder chooses to generate a SQL keyword (e.g., SELECT, WHERE, GROUP BY, ORDER BY) or a column header. This is critical for the crossdomain setting where the table schema changes across different examples. To achieve this, we use separate layers to score SQL keywords and column headers, and finally use the softmax operation to generate the output probability distribution:

Query Editing Mechanism
In an interaction with the system, the user often asks a sequence of closely related questions to complete the final query goal. Therefore, the query generated for the current turn often overlaps significantly with the previous ones.
To empirically verify the usefulness of leveraging the previous query, we consider the process of generating the current query by applying copy and insert operations to the previous query 3 . Figure 3 shows the SQL query length and the number of copy and insert operations at different turns. As the interaction proceeds, the user question becomes more complicated as it requires longer SQL query to answer. However, more query tokens overlap with the previous query, and thus the number of new tokens remains small at the third turn and beyond.
Based on this observation, we extend our tableware decoder with a query editing mechanism. We first encode the previous query using another bi-LSTM, and its hidden states are the query token embeddings h Q i,j (i.e., the j -th token of the i-th query). We then extend the context vector with the attention to the previous query: where c query k is produced by an attention to query tokens h Q i,j in the same form as Equation 3.
At each decoding step, we predict a switch p copy to decide if we need copy from the previous query or insert a new token.
p copy = σ(c k W copy + b copy ) p insert = 1 − p copy (5) Then, we use a separate layer to score the query tokens at turn t − 1, and the output distribution is modified as the following to take into account the editing probability: P (y k ) = p copy · P prev SQL (y k ∈ prev SQL) +p insert · P SQL column (y k ∈ SQL column) While the copy mechanism has been introduced by Gu et al. (2016) and See et al. (2017), they focus on summarization or response generation applications by copying from the source sentences. By contrast, our focus is on editing the previously generated query while incorporating the context of user utterances and table schemas.

Related Work
Semantic parsing is the task of mapping natural language sentences into formal representations. It has been studied for decades including using linguistically-motivated compositional representations, such as logical forms (Zelle and Mooney, 1996;Clarke et al., 2010) and lambda calculus (Zettlemoyer and Collins, 2005;Artzi and Zettlemoyer, 2011), and using executable programs, such as SQL queries (Miller et al., 1996;Zhong et al., 2017) and other general-purpose programming languages (Yin and Neubig, 2017;. Most of the early studies worked on a few domains and small datasets such as GeoQuery (Zelle and Mooney, 1996) and Overnight (Wang et al., 2015).
Relatively less effort has been devoted to context-dependent semantic parsing on datasets including ATIS (Hemphill et al., 1990;Dahl et al., 1994b), SpaceBook (Vlachos and Clark, 2014), SCONE (Long et al., 2016;Guu et al., 2017;Fried et al., 2018;Huang et al., 2019), SequentialQA (Iyyer et al., 2017), SParC (Yu et al., 2019b) and CoSQL (Yu et al., 2019a). On ATIS, Miller et al. (1996) maps utterances to semantic frames which are then mapped to SQL queries; Zettlemoyer and Collins (2009) starts with context-independent Combinatory Categorial Grammar (CCG) parsing and then resolves references to generate lambda-calculus logical forms for sequences of sentences. The most relevant to our work is , who generate ATIS SQL queries from interactions by incorporating history with an interaction-level encoder and copying segments of previously generated queries. Furthermore, SCONE contains three domains using stack-or list-like elements and most queries include a single binary predicate. Se-quentialQA is created by decomposing some complicated questions in WikiTableQuestions (Pasupat and Liang, 2015). Since both SCONE and Se-quentialQA are annotated with only denotations but not query labels, they don't include many questions with rich semantic and contextual types. For example, SequentialQA (Iyyer et al., 2017) requires that the answer to follow-up questions must be a subset of previous answers, and most of the questions can be answered by simple SQL queries with SELECT and WHERE clauses. Concurrent with our work, Yu et al. (2019a) introduced CoSQL, a large-scale cross-domain conversational text-to-SQL corpus collected under the Wizard-of-Oz setting. Each dialogue in CoSQL simulates a DB querying scenario with a crowd worker as a user and a college computer science student who is familiar with SQL as an expert. Question-SQL pairs in CoSQL reflect greater diversity in user backgrounds compared to other corpora and involve frequent changes in user intent between pairs or ambiguous questions that require user clarification. These features pose new chal-lenges for text-to-SQL systems.
Our work is also related to recently proposed approaches to code generation by editing (Hayati et al., 2018;Yin et al., 2019;Hashimoto et al., 2018). While they follow the framework of generating code by editing the relevant examples retrieved from training data, we focus on a contextdependent setting where we generate queries from the previous query predicted by the system itself.

Metrics
On both Spider and SParC, we use the exact set match accuracy between the gold and the predicted queries 4 . To avoid ordering issues, instead of using simple string matching, Yu et al. (2018c) decompose predicted queries into different SQL clauses such as SELECT, WHERE, GROUP BY, and ORDER BY and compute scores for each clause using set matching separately. On SparC, we report two metrics: question match accuracy which is the score average over all questions and interaction match accuracy which is average over all interactions.

Baselines
SParC. We compare with the two baseline models released by Yu et al. (2019b).
(1) Context-dependent Seq2Seq (CD-Seq2Seq): This model is adapted from . The original model was developed for ATIS and does not take the database schema as input hence cannot generalize well across domains. Yu et al. (2019b) adapt it to perform context-dependent SQL generation in multiple domains by adding a bi-LSTM database schema encoder which takes bag-of-words representations of column headers as input. They also modify the decoder to select between a SQL keyword or a column header.
(2) SyntaxSQL-con: This is adapted from the original context-agnostic SyntaxSQLNet (Yu et al., 2018b) by using bi-LSTMs to encode the interaction history including the utterance and the associated SQL query response. It also employs a column attention mechanism to compute representations of the previous question and SQL query.

Implementation Details
Our model is implemented in PyTorch (Paszke et al., 2017). We use pretrained 300-dimensional GloVe (Pennington et al., 2014) word embedding. All LSTM layers have 300 hidden size, and we use 1 layer for encoder LSTMs, and 2 layers for decoder LSTMs. We use the ADAM optimizer (Kingma and Ba, 2015) to minimize the tokenlevel cross-entropy loss with a batch size of 16. Model parameters are randomly initialized from a uniform distribution U [−0.1, 0.1]. The main model has an initial learning rate of 0.001 and it will be multiplied by 0.8 if the validation loss increases compared with the previous epoch. When using BERT instead of GloVe, we use the pretrained small uncased BERT model with 768 hidden size 5 , and we fine tune it with a separate constant learning rate of 0.00001. The training typically converges in 10 epochs.

Overall Results
Spider.  (Yu et al., 2019b) 18.5 20.2 4.3 5.2 CD-Seq2Seq (Yu et al., 2019b)    embedding gives significant improvement, achieving 57.6% on dev set and 53.4% on test set, which is comparable to the state-of-the-art results from IRNet with BERT. We attribute our BERT model's high performance to (1) the empirically powerful text understanding ability of pretrained BERT model and (2) the early interaction between utterances and column headers when they are concatenated in a single sequence as the BERT input.
SParC. Table 5 shows the results on SParC dataset. Similar to Spider, our model without previous query as input already outperforms SyntaxSQL-con, achieving 31.4% question matching accuracy and 14.7% interaction matching accuracy. In addition, compared with CD-Seq2Seq, our model enjoys the benefits of the table-utterance encoder, turn attention, and the joint consideration of utterances and table schemas during the decoding stage. This boosts the performance by 10% question accuracy and 6% interaction accuracy.
Furthermore, we also investigate the effect of copying segment. We use the same segment copy procedure as : first deterministically extract segments from the previous query and encode each segment using an LSTM, then generate a segment by computing its output probability based on its segment encoding. However, since the segment extraction from  is exclusively designed for the ATIS dataset, we implement our own segment extraction procedure by extracting SELECT, FROM, GROUP BY, ORDER BY clauses as well as different conditions in WHERE clauses. In this way, 3.9 segments can be extracted per SQL on average. We found that adding segment copying to CD-Seq2Seq gives a slightly lower performance on question matching and a small gain on interaction matching, while using segments extracted from the gold query can have much higher results. This demonstrates that segment copy is vulnerable to error propagation. In addition, it can only copy whole segments hence has difficulty capturing the changes of only one or a few tokens in the query.
To better understand how models perform as the interaction proceeds, Figure 4 (Left) shows the performance split by turns on the dev set. The questions asked in later turns are more difficult to answer given longer context history. While the baselines have lower performance as the turn number increases, our model still maintains 38%-48% accuracy for turn 2 and 3, and 20% at turn 4 or beyond. Similarly, Figure 4 (Right) shows the performance split by hardness levels with the frequency of examples. This also demonstrates our model is more competitive in answering hard and extra hard questions.

ATIS.
We also report our model performance  on ATIS in Table 6. Our model achieves 36.2% dev and 43.9% test string accuracy, comparable to . On ATIS, we only apply our editing mechanism and reuse their utterance encoder instead of the BERT utterance-table encoder, because ATIS is single domain.

Effect of Query Editing
We further investigate the effect of our query editing mechanism. To this end, we apply editing from both the gold query and the predicted query on our model with or without the utterance-table BERT embedding. We also perform an ablation study to validate the contribution of query attention and sequence editing separately.
As shown in Table 5, editing the gold query consistently improves both question match and interaction match accuracy. This shows the editing approach is indeed helpful to improve the generation quality when the previous query is the oracle.
Using the predicted query is a more realistic setting, and in this case, the model is affected by error propagation due to the incorrect queries produced by itself. For the model without the utterance-table BERT embedding, using the predicted query only gives around 1.5% improvement. As shown in Figure 5, this is because the editing mechanism is more helpful for turn 4 which is a small fraction of all question examples. For the model with the utterance-table BERT embedding, the query generation accuracy at each turn is significantly improved, thus reducing the error propagation effect. In this case, the editing approach delivers consistent improvements of 7% increase on question matching accuracy and 11% increase on interaction matching accuracy. Figure 5 also shows that query editing with BERT benefits all turns.
Finally, as an ablation study, Table 5 also reports the result with only query attention (use predicted query) on the dev set. This improves over our vanilla BERT model without query attention and achieves 42.7% question and 21.6% interaction matching accuracy. With query editing, our best model further improves to 47.2% question and 29.5% interaction matching accuracy. This demonstrates the effectiveness of our query attention and query editing separately, both of which are essential to make use of the previous query.

Conclusions
In this paper, we propose an editing-based encoder-decoder model to address the problem of context-dependent cross-domain text-to-SQL generation. While being simple, empirical results demonstrate the benefits of our editing mechanism. The approach is more robust to error propagation than copying segments, and its performance increases when the basic text-to-SQL generation quality (without editing) is better.