NL-EDIT: Correcting Semantic Parse Errors through Natural Language Interaction

We study semantic parsing in an interactive setting in which users correct errors with natural language feedback. We present NL-EDIT, a model for interpreting natural language feedback in the interaction context to generate a sequence of edits that can be applied to the initial parse to correct its errors. We show that NL-EDIT can boost the accuracy of existing text-to-SQL parsers by up to 20% with only one turn of correction. We analyze the limitations of the model and discuss directions for improvement and evaluation. The code and datasets used in this paper are publicly available at http://aka.ms/NLEdit.


Introduction
Major progress in natural language processing has been made towards fully automating challenging tasks such as question answering, translation, and summarization. On the other hand, several studies have argued that machine learning systems that can explain their own predictions (Doshi-Velez and Kim, 2017) and learn interactively from their endusers (Amershi et al., 2014) can result in better user experiences and more effective learning systems. We develop NL-EDIT-an approach that employs both explanations and interaction in the context of semantic parsing.
Most existing systems frame semantic parsing as a one-shot translation from a natural language question to the corresponding logical form (e.g., SQL query) (Yu et al., 2018a;Guo et al., 2019;Wang et al., 2020, inter alia). A growing body of recent work demonstrates that semantic parsing systems can be improved by including users in the parsing loop-giving them the affordance to examine the parses, judge their correctness, and provide feedback accordingly. The feedback often comes in the form of a binary correct/incorrect * Most of the work was done while the first author was an intern at Microsoft Research.

Semantic Parsing:
What is the full name of the candidate with the most votes? Edit: GROUP-BY: remove vote_id GROUP-BY: add candidate_id SELECT: add last_name Figure 1: Example human interaction with NL-EDIT to correct an initial parse through natural language feedback. In the Semantic Parsing Phase (top), an offthe-shelf parser generates an initial SQL query and provides an answer paired with an explanation of the generated SQL. In the Correction Phase (bottom), the user reviews the explanation and provides feedback that describes how the explanation should be corrected. The system parses the feedback as a set of edits that are applied to the initial parse to generate a corrected SQL.
signal (Iyer et al., 2017), answers to a multiplechoice question posed by the system (Gur et al., 2018;Yao et al., 2019), or suggestions of edits that can be applied to the parse .
Unlike other frameworks for interactive semantic parsing that typically expect users to judge the correctness of the execution result or induced logical form, Elgohary et al. (2020) introduced a framework for interactive text-to-SQL in which induced SQL queries are fully explained in natural lan-guage to users, who in turn, can correct such parses through natural language feedback (Figure 1). They construct the SPLASH dataset and use it to evaluate baselines for the semantic parse correction with natural language feedback task they introduce.
We present a detailed analysis of the feedback and the differences between the initial (incorrect) and the correct parse. We argue that a correction model should be able to interpret the feedback in the context of other elements of the interaction (the original question, the schema, and the explanation of the initial parse). We observe from SPLASH that most feedback utterances tend to describe a few edits that the user desires to apply to the initial parse. As such, we pose the correction task as a semantic parsing problem that aims to convert natural language feedback to a sequence of edits that can be deterministically applied to the initial parse to correct it. We use the edit-based modeling framework to show that we can effectively generate synthetic data to pre-train the correction model leading to clear performance gains.
We make the following contributions: (1) We present a scheme for representing SQL query Edits that benefits both the modeling and the analysis of the correction task, (2) we present NL-EDIT, an edit-based model for interactive text-to-SQL with natural language feedback. We show that NL-EDIT outperforms baselines in (Elgohary et al., 2020) by more than 16 points, (3) We demonstrate that we can generate synthetic data through the edit-based framing and that the model can effectively use this data to improve its accuracy and (4) We present a detailed analysis of the model performance including studying the effect of different components, generalization to errors of state-of-the-art parsers, and outline directions for future research.

Background
In the task of text-to-SQL parsing, the objective is given a database schema (tables, columns, and primary-foreign key relations) and a natural language question, generate a SQL query that answers the question when executed against the database. Several recent text-to-SQL models have been introduced (Yu et al., 2018a;Guo et al., 2019;Wang et al., 2020, inter alia) as a result of the availability of SPIDER (Yu et al., 2018b), a large dataset of schema, questions and gold parses spanning several databases in different domains.
The task of SQL parse correction with natural language feedback (Elgohary et al., 2020) aims to correct an erroneous parse based on natural language feedback collected from the user. Given a question, a database schema, an incorrect initial parse, natural language feedback on the initial parse, the task is to generate a corrected parse.
To study this problem, Elgohary et al. (2020) introduced the SPLASH dataset. SPLASH was created by showing annotators questions and a natural language explanation of incorrect parses and asking them to provide feedback, in natural language, to correct the parse. The dataset contained 9,314 question-feedback pairs. Like the SPIDER dataset, it was split into train-dev-test sets by database to encourage the models to generalize to new unseen databases. They contrast the task with conversational semantic parsing (Suhr et al., 2018;Yu et al., 2019b,a;Andreas et al., 2020) and show that the two tasks are distinct and are addressing different aspects of utilizing context. They establish several baseline models and show that the task is challenging for state-of-the-art semantic parsing models. We use these as baselines for this work.

SQL Edits
We define a scheme for representing the edits required to transform one SQL query to another. We use that scheme both in our model and analysis. Our goal is to balance the granularity of the editstoo fine-grained edits result in complex structures that are challenging for models to learn, and too coarse-grained edits result in less compact structures that are harder for models to generate.
We view a SQL query as a set of clauses (e.g, SELECT, FROM, WHERE), each clause has a sequence of arguments ( Figure 2). We mirror the SQL clauses SELECT, FROM, WHERE, GROUP-BY, ORDER-BY, HAVING, and LIMIT. For subqueries, we define a clause SUBS whose arguments are recursively defined as sets of clauses. Subqueries can be linked to the main query in two ways: either through an IEU clause (mirrors SQL INTERSECT/EXCEPT/UNION) whose first argument is one of the keywords INTERSECT, EXCEPT, UNION and its second argument is a pointer to a subquery in SUBS. The second is through nested queries where the arguments of some of the clauses (e.g., WHERE) can point at subqueries in SUBS (e.g., "id NOT IN SUBS 1 ").
With such view of two queries P source and P target , we define their edit D source→target as SELECT: arg 1 :"id", arg 2 :"MAX(grade)"

Source
Target Edit <select> remove maximum grade </select> <select> add average grade </select> <where> remove id not one of </where> <orderby> add id </orderby> Linearize Figure 2: Edit for transforming the source query "SELECT id, MAX(grade) FROM assignments WHERE grade > 20 AND id NOT IN (SELECT id from graduates) GROUP BY id" to the target "SELECT id, AVG(grade) FROM assignment WHERE grade > 20 GROUP BY id ORDER BY id". The source and target are represented as sets of clauses (left and middle). The set of edits and its linearized form (Section 4) are shown on the right. Removing the condition "id NOT IN SUBS 1 " makes the subquery unreferenced, hence pruned from the edit. the set of clause-level edits {D c source→target } for all types of clauses c that appear in P source or P target (Figure 2). To compare two clauses of type c, we simply exact-match their arguments: unmatched arguments in the source (e.g., MAX(grade) in SELECT) are added as toremove arguments to the corresponding edit clause, and unmatched arguments in the target (e.g., "id" in the ORDER-BY) are added as to-add arguments.
Our current implementation follows SPIDER's assumption that the number of subqueries is at most one which implies that computing edits for different clauses can be done independently even for the clauses that reference a subquery (e.g., WHERE in Figure 2). The edit of the SUBS clause is recursively computed as the edit between two queries (any of them can be empty); the subquery of source and the subquery of target, i.e., D SUBS source→target = D source:SUBS 1 →target:SUBS 1 . We keep track of the edits to the arguments that reference the subquery. After all edit clauses are computed, we prune the edits of the SUBS clause if the subquery will no longer be referenced (SUBS 1 in Figure 2). We follow the SPIDER evaluation and discard the values in WHERE/HAVING clauses.
Throughout this paper, we refer to the number of add/remove operations in an edit as the Edit Size, and we denote it as |D source→target |. For example, the edit in Figure 2 is of size four.

Model
We follow the task description in Section 2: the inputs to the model are the elements of the interaction-question, schema, an initial parse P , and feedback. The model predicts a corrected P . The gold parse P is available for training. Our model is based on integrating two key ideas in an encoder-decoder architecture. We start with a discussion of the intuitions behind the two ideas followed by the model details.

Intuitions
Interpreting feedback in context: The feedback is expected to link to all the other elements of the interaction ( Figure 1). The feedback is provided in the context of the explanation of the initial parse, as a proxy to the parse itself. As such, the feedback tends to use the same terminology as the explanation. For example, the SQL explanations of (Elgohary et al., 2020) express "group by" in simple language "for each vote_id, find ...". As a result, human-provided feedback never uses "group by". We also notice that in several SPLASH examples, the feedback refers to particular steps in the explanation as in the examples in Figure 1. Unlike existing models (Elgohary et al., 2020), we replace the initial parse with its natural language explanation. Additionally, the feedback usually refers to columns/tables in the schema, and could often be ambiguous when examined in isolation. Such ambiguities can be usually resolved by relying on the context provided by the question. For example, "find last name" in Figure 1 is interpreted as "find last name besides first name" rather than "replace first name with last name" because the question asks for the "full name". Our first key idea is based on grounding the elements of the interaction by combining self-learned relations by transformer models (Vaswani et al., 2017) and hard-coded relations that we define according to the possible ways different elements can link to each other.
Feedback describes a set of edits: The difference between the erroneous parse and the correct one can mostly be described as a few edits that need to be applied to the initial parse to correct its errors (Section 7). Also, the feedback often only describes the edits to be made (Elgohary et al., 2020). As such, we can pose the task of correction with NL feedback as a semantic parsing task where we convert a natural language deception of  the edits to a canonical form that can be applied deterministically to the initial parse to generate the corrected one. We train our model to generate SQL Edits (Section 3) rather than SQL queries.

Encoder
Our encoder ( Figure 3) starts with passing the concatenation of the feedback, explanation, question, and schema through BERT (Devlin et al., 2019). Following (Wang et al., 2020;Suhr et al., 2018;Scholak et al., 2020), we tokenize the column/table names and concatenate them in one sequence (Schema) starting with the tokens of the tables followed by the tokens of the columns. Then, we average the BERT embeddings of the tokens corresponding to each column (table) to obtain one representation for the column (table). Wang et al. (2020) study the text-to-SQL problem using the SPIDER dataset and show the benefit of injecting preexisting relations within the schema (column exists in a table, primary-foreign key), and between the question and schema items (column and table names) by: (1) name linking: link a question token to a column/table if the token and the item name match and (2) value linking: link a question token to a column if the token appears as a value under that column. To incorporate such relations in their model, they use the relation-aware self-attention formulation presented in (Shaw et al., 2018). The relation-aware transformer (Shaw et al., 2018) assigns a learned embedding for each relation type and combines such embeddings with the self-attention of the original transformer model (Vaswani et al., 2017): If a preexisting relation r holds between two tokens, the embedding of r is added as a bias term to the self-attention computation between the two tokens.
In addition to those relations, we define a new set of relations that aim at contextualizing the feedback with respect to the other elements of the interaction in our setup: (1) [Feedback-Schema] We link the feedback to the schema the same way the question is linked to the schema via both name and value linking, (2) [Explanation-Schema] Columns and tables are mentioned with their exact names in the explanation. We link the explanation to the schema only through exact name matching, (3) [Feedback-Question] We use partial (at the lemma level) and exact matching to link tokens in the feedback and the question, (4) [Feedback-Explanation] We link tokens in the feedback to tokens in the explanation through partial and exact token matching. Since the feedback often refers to particular steps, we link the feedback tokens to explanation tokens that occur in steps that are referred to in the feedback with a separate relation type that indicates step reference in the feedback, and (5) [Explanation-Explanation] We link explanation tokens that occur within the same step. We use the same formulation of relationaware self-attention as (Wang et al., 2020) and add the relation-aware layers on top of BERT to integrate all relations into the model (Figure 3).

Decoder
Using a standard teacher-forced cross-entropy loss, we train our model to generate linearized SQL Edits (Figure 2). At training time, we compute the reference SQL Edit D P →P of the initial parse P and the gold parse P (Section 3). Then we linearize D P →P by listing the clause edits in a fixed order (FROM, WHERE, GROUP-BY, ... etc.). The argument of each clause-representing one add or remove operation-is formatted as <CLAUSE> ADD/REMOVE ARG </CLAUSE>. We express SQL operators in ARG with natural language explanation as in (Elgohary et al., 2020). For example, the argument "AVG(grade)" is expressed as "average grade". At inference time, we generate a corrected parse P by applying the produced edit to the initial parse P .
We use a standard transformer decoder that either generates tokens from the output vocab or copies columns and tables from the encoder output. Since all editing operations should be directed by the feedback, we tried splitting the attention to the encoder into two phases: First, we attend to the feedback only and update the decoder state

Synthetic Feedback
In this section, we describe our process for automatically synthesizing additional examples for training the correction model. Recall that each example consists of a question about a given schema paired with a gold parse, an initial erroneous parse, and feedback. Starting with a seed of questions and their corresponding gold parses from SPIDER's training set (8,099 pairs) 1 , our synthesis process applies a sequence of SQL editing operations to the gold parse to reach an altered parse that we use as the initial parse (Algorithm 1). By manually inspecting the edits (Section 3) we induce for the initial and gold parses in SPLASH training set, we define 26 SQL editors and pair each editor with their most frequent corresponding feedback template(s) (Examples in Table 1). We also associate each editor with a set of constraints that determines whether it can be applied to a given SQL query (e.g., the "Remove-Limit" editor can only be applied to a query that has a limit clause).
Algorithm 1 summarizes the synthesis process. We start by creating N (controls the size of the dataset) clones of each seed example. Elgohary et al. (2020)'s analysis of SPLASH shows that multiple mistakes might be present in the initial SQL, hence we allow our synthesis process to introduce up to four edits (randomly decided in line:4) to each clone p. For each editing step, we sample a feasible edit for the current parse (line:5) with man- We use BERT-base-uncased (Devlin et al., 2019) in all our experiments. We set the number of layers in the relational-aware transformer to eight (Wang et al., 2020) and the number of decoder layers to two. We train with batches of size 24. We use the Adam optimizer (Kingma and Ba, 2015) for training. We freeze BERT parameters during the first 5,000 warm-up steps and update the rest of the parameters with a linearly increasing learning rate from zero to 5 × 10 −4 . Then, we linearly decrease the learning rates from 5 × 10 −5 for BERT and  5 × 10 −4 for the other parameters to zero. 2 We use beam search with a beam of size 20 and take the top-ranked beam that results in a valid SQL after applying the inferred edit. Evaluation: We follow (Elgohary et al., 2020) and use the correction accuracy as our main evaluation measure: each example in SPLASH test set contains an initial parse P and a gold parse P . With a predicted (corrected) parse by a correction model P , they compute the correction accuracy using the exact-set-match (Yu et al., 2018b) between P and P averaged over all test examples. While useful, correction accuracy also has limitations. It expects models to be able to fully correct an erroneous parse with only one utterance of feedback as such, it is defined in terms of the exact match between the corrected and the gold parse. We find ( Table 2) that in several cases, models were still able to make progress by reducing the number of errors as measured by the edit size (Section 3) after correction. As such, we define another set of metrics to measure partial progress. We report (Edit ↓ and Edit ↑ in Table 2) the percentage of examples on which the size of the edit set strictly decreased/increased. To combine Edit ↓ and Edit ↑ in one measure and account for the relative reduction (increase) in the number of edits, we define Given a test set S, the Progress of a correction model is computed as the average relative edit reduction between the initial parse P and the gold parse P by predicting a correction P of P . A perfect model that can fully correct all errors in the initial parse would achieve a 100% progress. A 2 The learning rate schedule is only dependent on the step number regardless of whether we are training on the synthetic data or SPLASH. We tried resetting the learning rates back to their maximum values after switching to SPLASH, but did not observe any improvement in accuracy. model can have a negative progress (e.g., Rulebased re-ranking in Table 2) when it frequently predicts corrections with more errors than those in the initial parse. Unlike correction accuracy, Progress is more aligned with user experience in an interactive environment  as it assigns partial credit for fixing a subset of the errors and also, it penalizes models that predict an even more erroneous parse after receiving feedback.
Results: We compare (Table 2) NL-EDIT to the two top-performing baselines in (Elgohary et al., 2020) and also to the beam re-ranking upper-bound they report. NL-EDIT significantly increases the correction accuracy over the top baseline (Edit-SQL+Feedback) by more than 16% and it also outperforms oracle re-ranking by around 5%. We also note that in 72.4% of the test examples, NL-EDIT was able to strictly reduce the number of errors in the initial parse (Edit ↓) which potentially indicates a more positive user experience than the other models. NL-EDIT achieves 37% Progress which indicates faster convergence to the fully corrected parse than all the other models.

Ablations
Following the same experimental setup in Section 6, we compare NL-EDIT to other variants with one ablated component at a time (Table 3). We ablate the feedback, the explanation, and the question from the encoder input. We also ablate the interaction relations (Section 4.2) that we incorporate in the relation-aware transformer module. We only ablate the new relations we introduce to model the interaction (shown in Figure 3), but we keep the Question-Schema and Schema-Schema relations introduced in (Wang et al., 2020). For each such variant, we train for 20,000 steps on the synthetic dataset then continue training on SPLASH until step 100,000. We also train an ablated variant that does not use the synthetic feedback where we   train for 100,000 steps only on SPLASH. For all variants, we choose the checkpoint with the largest correction accuracy on the dev set and report the accuracy on the SPLASH test set. The results in Table 3 confirm the effectiveness of each component in our model. We find that the model is able to correct 19.8% of the examples without the feedback. We noticed that the ablatedfeedback model almost reaches that accuracy only after training on the synthetic data with very minor improvement (< 1%) after training on SPLASH. Only using the question and the explanation, the model is able to learn about a set of systematic errors that parsers make and how they can be corrected (Gupta et al., 2017;Yin and Neubig, 2019).

Error Analysis
In Figure 4, we breakdown the correction accuracy by the feedback and explanation lengths (in number of tokens) and by the reference edit size (number of required edit operations to fully correct the initial parse). The accuracy drops significantly when the reference edit size exceeds two (Figure 4c), while it declines more gradually as the feedback and explanation increase in length. We manually (Examples in Table 4) inspected the examples with longer feedback than 24, and found that 8% of them the feedback is long because it describes how to rewrite the whole query rather than being lim-Long Feedback Not Describing an Edit: "you should determine the major record format from the orchestra table and make sure it is arranged in ascending order of number of rows that appear for each major record format." Long Feedback Describing an Edit: "replace course id (both) with degree program id, first courses with student enrolment, course description with degree summary name, second courses with degree programs." ited to only the edits to be made. In the remaining 92%, the initial query had several errors (edit size of 5.5 on average) with the corresponding feedback enumerating all of them. Figure 4d shows how the number of errors (measured in edit size) changes after correction. The figure shows that even for examples with a large number of errors (four and five), the model is still able to reduce the number of errors in most cases. We manually inspected the examples with only one error that the model failed to correct. We found 15% of them have either wrong or non-editing feedback and in 29% the model produced the correct edit but with additional irrelevant ones. The dominant source of error in the remaining examples is because of failures with linking the feedback to the schema (Examples in Table 5).

Cross-Parser Generalization
So far, we have been using SPLASH for both training and testing. The erroneous parses (and corresponding feedback) in SPLASH are based on the Seq2Struct parser (Shin, 2019   in model architectures (Wang et al., 2020) and pretraining (Yin et al., 2020;Yu et al., 2021a) has led to parsers that already outperform Seq2Struct by more than 30% in parsing accuracy. 3 Here, we ask whether NL-EDIT that we train on SPLASH (and synthetic feedback) can generalize to parsing errors made by more recent parsers without additional parser-specific training data. We follow the same crowdsourcing process used to construct SPLASH (Section 2) to collect three new test sets based on three recent textto-SQL parsers: EditSQL , TaBERT (Yin et al., 2020) and RAT-SQL (Wang et al., 2020). Following Elgohary et al. (2020), we run each parser on SPIDER dev set and only collect feedback for the examples with incorrect parses that can be explained using their SQL explanation 3 https://yale-lily.github.io/spider framework. Table 6 (Top) summarizes the three new test sets and compares them to SPLASH test set. We note that the four datasets are based on the same set of questions and databases (SPIDER dev).
Table 6 (Bottom) compares the parsing accuracy (measure by exact query match (Yu et al., 2018b)) of each parser when used by itself (No Interaction) to integrating it with NL-EDIT. We report both the accuracy on the examples provided to NL-EDIT (Error Correction) and the End-to-End accuracy on the full SPIDER dev set. NL-EDIT significantly boosts the accuracy of all parsers, but with a notable drop in the gains as the accuracy of the parser improves. To explain that, in Figure 5 we compare the distribution of reference edit size across the four datasets. The figure does not show any significant differences in the distributions that would lead to such a drop in accuracy gain. Likewise, the distributions of the feedback lengths are very similar (the mean is shown in Table 6). As parsers improve in accuracy, they tend to make most of their errors on complex SQL queries. Although the number of errors with each query does not significantly change ( Figure 5), we hypothesize that localizing the errors in a complex initial parse, with a long explanation (Table 6), is the main generalization bottleneck that future work needs to address.

Related Work and Discussion
Natural language to SQL: Natural language interfaces to databases have been an active field of study for many years (Woods et al., 1972;Warren and Pereira, 1982;Popescu et al., 2003;Li and Jagadish, 2014). The development of new large scale datasets, such as WikiSQL (Zhong et al., 2017) and SPIDER (Yu et al., 2018b), has reignited the interest in this area with several new models introduced recently (Choi et al., 2020;Wang et al., 2020;Scholak et al., 2020). Another related line of work has focused on conversation semantic parsing, e.g. SParC (Yu et al., 2019b), CoSQL (Yu et al., 2019a), and SMCalFlow (Andreas et al., 2020), where parsers aim at modeling utterance sequentially and in context of previous utterances.
Interactive Semantic Parsing: Several previous studies have looked at the problem of improving semantic parser with feedback or human interactions (Clarke et al., 2010;Artzi and Zettlemoyer, 2013). Interactions are supported in multiple ways including binary correct/incorrect signal (Iyer et al., 2017), answers to a yes/no or a multiple-choice  Table 6: Evaluating the zero-shot generalization of NL-EDIT to different parsers (EditSQL, TaBERT, and RAT-SQL) after training on SPLASH that is constructed based on the Seq2Struct parser. Top: Summary of the dataset constructed based on each parser. Feedback and explanation length is the number of tokens. Bottom: The Error Correction accuracy on each test set and the end-to-end accuracy of each parser on the full SPIDER dev set with and without interaction. ∆ w/ Interaction is the gain in end-to-end accuracy with the interaction added.
question posed by the system (Yao et al., 2019;Gur et al., 2018) or suggestions of edits that can be applied to the parse . Yao et al. (2019) and Gur et al. (2018) ask yes/no and multiple-choice questions and use the answers in generating the pars. Elgohary et al. (2020) introduce SPLASH (Section 2), a dataset for correcting semantic parsing with natural language feedback. Using language as a medium for providing feedback enables the human to provide rich open-form feedback in their natural way of communication giving them control and flexibility specifying what is wrong and how it should be corrected. Our work uses SPLASH and proposes to pose the problem of semantic parse correction as a parser editing problem with natural language feedback input. This is also related to recent work on casting text generation (e.g. summarization, grammatical error correction, sentence splitting, etc.) as a text editing task (Malmi et al., 2019;Panthaplackel et al., 2020;Stahlberg and Kumar, 2020) where target texts are reconstructed from inputs using several edit operations.
Semantic Parsing with Synthetic Data: Semantic parsing systems have frequently used synthesized data to alleviate the challenge of labeled data scarcity. In their semantic parser overnight work, Wang et al. (2015) proposed a method for training semantic parsers quickly in a new domain using synthetic data. They generate logical forms and canonical utterances and then paraphrase the canonical utterances via crowd-sourcing. Several other approaches have demonstrated the benefit of adopting this approach to train semantic parsers in low-resource settings (Su et al., 2017;Zhong et al., 2017;Cheng et al., 2018;Xu et al., 2020). Most recently, synthetic data was used to continue to pre-train language models for semantic parsing tasks (Herzig et al., 2020;Yu et al., 2021a,b). We build on this line work by showing that we can generate synthetic data automatically without human involvement to simulate edits between an erroneous parse and a correct one.

Conclusions and Future Work
We introduced a model, a data augmentation method, and analysis tools for correcting semantic parse errors in text-to-SQL through natural language feedback. Compared to previous models, our model improves the correction accuracy by 16% and boosts the end-to-end parsing accuracy by up to 20% with only one turn of feedback. Our work creates several avenues for future work: (1) improving the model by better modeling the interaction between the inputs and exploring different patterns for decoder-encoder attention, (2) evaluating existing methods for training with synthetic data (e.g., curriculum learning (Bengio et al., 2009)), (3) optimizing the correction model for better user experience using the progress measure we introduce, and (4) using the SQL edits scheme in other related tasks such as conversational text-to-SQL parsing.