An Imitation Game for Learning Semantic Parsers from User Interaction

Despite the widely successful applications, bootstrapping and fine-tuning semantic parsers are still a tedious process with challenges such as costly data annotation and privacy risks. In this paper, we suggest an alternative, human-in-the-loop methodology for learning semantic parsers directly from users. A semantic parser should be introspective of its uncertainties and prompt for user demonstration when uncertain. In doing so it also gets to imitate the user behavior and continue improving itself autonomously with the hope that eventually it may become as good as the user in interpreting their questions. To combat the sparsity of demonstration, we propose a novel annotation-efficient imitation learning algorithm, which iteratively collects new datasets by mixing demonstrated states and confident predictions and re-trains the semantic parser in a Dataset Aggregation fashion (Ross et al., 2011). We provide a theoretical analysis of its cost bound and also empirically demonstrate its promising performance on the text-to-SQL problem.


Introduction
Semantic parsing has found tremendous applications in building natural language interfaces that allow users to query data and invoke services without programming (Woods, 1973;Zettlemoyer and Collins, 2005;Berant et al., 2013;Su et al., 2017;Yu et al., 2018). The lifecycle of a semantic parser typically consists of two stages: (1) bootstraping, where we keep collecting labeled data via trained annotators and/or crowdsourcing for model training until it reaches commercial-grade performance (e.g., 95% accuracy on a surrogate test set), and (2) fine-tuning, where we deploy the system, analyze the usage, and collect and annotate new data to address the identified problems or emerging needs. Figure 1: A semantic parser proactively interacts with the user in a friendly way to resolve its uncertainties. In doing so it also gets to imitate the user behavior and continue improving itself autonomously with the hope that eventually it may become as good as the user in interpreting their questions.
However, it poses several challenges for scaling up or building semantic parsers for new domains: (1) high boostrapping cost because mainstream neural parsing models are data-hungry and annotation cost of semantic parsing data is relatively high, (2) high fine-tuning cost from continuously analyzing usage and annotating new data, and (3) privacy risks arising from exposing private user conversations to annotators and developers (Lomas, 2019).
In this paper, we suggest an alternative methodology for building semantic parsers that could potentially address all the aforementioned problems.
The key is to involve human users in the learning loop. A semantic parser should be introspective of its uncertainties and proactively prompt for demonstration from the user, who knows the question best, to resolve them. In doing so, the semantic parser would be able to accumulate targeted training data and continue improving itself autonomously without involving any annotators or developers, hence also minimizing privacy risks. The bootstrapping cost could also be significantly reduced because an interactive system needs not to be almost perfectly accurate to be deployed. On the other hand, such interaction opens up the black box and allows users to know more about the reasoning underneath the system and better interpret the final results (Su et al., 2018). A human-in-the-loop methodology like this also opens the door for domain adaptation and personalization.
This work builds on the recent line of research on interactive semantic parsing (Li and Jagadish, 2014;Chaurasia and Mooney, 2017;Gur et al., 2018;Yao et al., 2019b). Specifically, Yao et al. (2019b) provide a general framework, MISP (Model-based Interactive Semantic Parsing), which handles uncertainty modeling and natural language generation. We will leverage MISP for user interaction to prove the feasibility of the envisioned methodology. However, existing studies only focus on interacting with users to resolve uncertainties. None of them has answered the crucial question of how to learn from user interaction, which is the technical focus of this study.
One form of user interaction explored for learning semantic parsers is asking users to validate the execution results (Clarke et al., 2010;Iyer et al., 2017). While appealing, in practice it may be a difficult task for real users because they would not need to ask the question if they knew the answer in the first place. We instead aim to learn semantic parsers from fine-grained interaction where users only need to answer simple questions covered by their background knowledge ( Figure 1). However, learning signals from such fine-grained interactions are bound to be sparse because the system needs to avoid asking too many questions and overwhelming the user, which poses a challenge for learning.
To this end, we propose a novel annotationefficient imitation learning algorithm for learning semantic parsers from such sparse, fine-grained demonstration: The agent (semantic parser) only requests for demonstration when it is uncertain about a state (parsing step). For the certain/confident states, the actions chosen by the current policy are deemed as correct. The policy is updated iteratively in a Dataset Aggregation fashion (Ross et al., 2011). At each iteration, all the state-action pairs, demon-strated or confident, are included to form a new training set and train a new policy in a supervised way. Intuitively, using confident predictions for training mitigates the sparsity issue, but it may also introduce noise. We provide a theoretical analysis of the proposed algorithm and show that, under mild assumptions, the quality of the final policy is mainly determined by the quality of the initial policy and confidence estimation accuracy.
Using simulated users, we also empirically compare our method with a number of baselines on the text-to-SQL parsing problem, including the powerful but costly baseline of full expert annotation. On the WikiSQL (Zhong et al., 2017) dataset, compared with the full annotation baseline, we show that, when bootstrapped using only 10% of the training data, our method can achieve almost the same test accuracy (2% absolute loss) while using less than 10% of the annotations, without even taking into account the different unit cost of annotation from users vs. domain experts. We also show that the quality of the final policy is largely determined by the quality of the initial policy, further confirming the theoretical analysis. Finally, we demonstrate that the system can generalize to more complicated semantic parsing tasks such as Spider (Yu et al., 2018).

Related Work
Interactive Semantic Parsing. Our work extends interactive semantic parsing, a recent idea that leverages system-user interactions to improve semantic parsing on the fly (Li and Jagadish, 2014;He et al., 2016;Chaurasia and Mooney, 2017;Su et al., 2018;Gur et al., 2018;Yao et al., 2019a,b). As an example, Gur et al. (2018) built a neural model to identify and correct error spans in a generated SQL query via dialogues. Yao et al. (2019b) further generalized the interaction framework by formalizing a model-based intelligent agent called MISP. Our system leverages MISP to support interactivity but focuses on developing an algorithm for continually improving the base parser from end user interactions, which has not been accomplished by previous work.

Feedback-based Interactive Learning.
Learning interactively from user feedback has been studied for machine translation (Nguyen et al., 2017;Petrushkov et al., 2018;Kreutzer and Riezler, 2019) and other NLP tasks (Sokolov et al., 2016;Gao et al., 2018;Hancock et al., 2019). Most relevant to this work, Hancock et al. (2019) constructed a chatbot that learns to request feedback when the user is unsatisfied with the system response, and then further improves itself periodically from the satisfied responses and feedback responses. The work reaffirms the necessity of human-in-the-loop autonomous learning systems like ours.
In the field of semantic parsing, Clarke et al. (2010) andIyer et al. (2017) learned semantic parsers from user validation on the query execution results. However, often times it may not very practical to expect end users able to validate answer correctness (e.g., consider validating an answer "103" for the question "how many students have a GPA higher than 3.5" from a massive table). Active learning is also leveraged to selectively obtain gold labels for semantic parsing and save human annotations (Duong et al., 2018;Ni et al., 2020). Our work is complementary to this line of research as we focus on learning interactively from end users (not "teachers").
Imitation Learning. Traditional imitation learning algorithms (Daumé et al., 2009;Ross and Bagnell, 2010;Ross et al., 2011;Ross and Bagnell, 2014) iteratively execute and train a policy by collecting expert demonstrations for every policy decision. Despite its efficacy, the learning demands costly annotations from experts. In contrast, we save expert effort by selectively requesting for demonstrations. This idea is related to active imitation learning (Chernova and Veloso, 2009;Kim and Pineau, 2013;Judah et al., 2014;Zhang and Cho, 2017). For example, Judah et al. (2014) used active learning to select informative trajectories from the unlabeled data pool for expert demonstrations. However, their setting assumes a "teacher" to intentionally provide labels and an unlabeled data pool, while our algorithm targets at end users who are using the system. Similar to our approach, Chernova and Veloso (2009) solicited expert demonstrations only for uncertain states. However, their algorithm simply abandons policy actions that are confident, leading to sparse training data. Instead, our algorithm utilizes confident policy actions to combat the sparsity issue and is additionally provided with a theoretical analysis.

Preliminaries
Formally, we assume the semantic parsing model generates a semantic parse by executing a sequence of actions a t (parsing decisions) at each time step t. In practice, the definition of action depends on the specific semantic parsing model, as we will illustrate shortly. A state s t is then defined as a tuple of (q, a 1:t−1 ), where q is the initial natural language question and a 1:t−1 = (a 1 , ..., a t−1 ) is the current partial parse. In particular, the initial state s 1 = (q, φ) contains only the question. Denote a semantic parser asπ, which is a policy function (Sutton and Barto, 2018) that takes a state s t as input and outputs a probability distribution over the action space. The semantic parsing process can be formulated as sampling a trajectory τ by alternately observing a state and sampling an action from the policy, i.e., τ = (s 1 , a 1 ∼π(s 1 ), ..., s T , a T ∼π(s T )), assuming a trajectory length T . The probability of the generated semantic parse becomes: pπ(a 1:T |s 1 ) = T t=1 pπ(a t |s t ).
An interactive semantic parser typically follows the aforementioned definition and requests the user's validation of a specific action a t . Based on the feedback, a correct action a * t can be inferred to replace the original a t . The parsing process continues with a * t afterwards. In this work, we adopt MISP (Yao et al., 2019b) as the back-end interactive semantic parsing framework, which enables system-user interaction via a policy probability based uncertainty estimator, a grammar-based natural language generator, and a multi-choice questionanswering interaction design, as shown in Figure 1.
Example. Consider the SQLova parser (Hwang et al., 2019), which generates a query by filling "slots" in a pre-defined SQL sketch "SELECT Agg SCol WHERE WCol OP VAL". To complete the SQL query in Figure 1, it first takes three steps: SCol="School/Club Team" (a 1 ), Agg="COUNT" (a 2 ) and WCol="School/Club Team" (a 3 ). MISP detects that a 3 is uncertain because its probability is lower than a pre-specified threshold. It validates a 3 with the user and corrects it with WCol="Player" (a * 3 ). The parsing continues with OP="=" (a 4 ) and VAL="jalen rose" (a 5 ). The trajectory length T = 5 in this case.

Learning Semantic Parsers from User Interaction
In this section, we present an imitation learning algorithm for learning semantic parsers from user interactions. The algorithm is annotation-efficient and can train a parser without requiring a large amount of user feedback (or "annotations"), an important property for practical use in an end-userfacing system. Note that while we apply this algorithm to semantic parsing in this work, in principle the algorithm can be applied to other structured prediction tasks (e.g., text summarization or machine translation) as well.

An Imitation Learning Formulation
Under the interactive semantic parsing framework, a learning algorithm intuitively can aggregate (s t , a * t ) pairs collected from user interactions and trains the parser to enforce a * t under the state s t = (q, a 1:t−1 ). However, this is not achievable by conventional supervised learning since the training needs to be conducted in an interactive environment, where the partial parse a 1:t−1 is generated by the parser itself.
Instead, we formulate it as an imitation learning problem (Daumé et al., 2009;Ross and Bagnell, 2010;Ross et al., 2011;Ross and Bagnell, 2014). Consider the user as a demonstrator, then the derived action a * t can be viewed as an expert demonstration which is interactively sampled from the demonstrator's policy (or expert policy) π * , 2 i.e., a * t ∼ π * (s t ). The goal of our algorithm is thus to train policyπ to imitate the expert policy π * . A general procedure is described in Algorithm 1, wherê π is learned iteratively for every m user questions (Line 1-9).

Annotation-efficient Imitation Learning
Consider parsing a user question and collecting training data using the parserπ i in the i-th iteration (Line 5). A standard imitation learning algorithm such as DAGGER (Ross et al., 2011) usually requests expert demonstration a * t for every state s t in the sampled trajectory. However, it requires a considerable amount of user annotations, which may not be practical when interacting with end users.
We propose an annotation-efficient imitation learning algorithm, which saves user annotations by selectively requesting user intervention, as shown in function PARSE&COLLECT. Specifically, in each parsing step, the system first previews whether it is confident about its own decision a t (Line 13-14), which is determined when its probability is Train policyπ i+1 on D using Eq. (1). 8: end for 9: return bestπ i on validation.

12:
for t = 1 to T do 13: Preview action a t = arg max aπi (s t );

14:
if pπ i (a t |s t ) ≥ µ then 15: Trigger user interaction and derive expert demonstration a * t ∼ π * (s t ); 20: Execute a * t . return D i . 26: end function no less than a threshold, i.e., pπ i (a t |s t ) ≥ µ. In this case, the algorithm executes and collects the policy action a t (Line 15-16); otherwise, a systemuser interaction will be triggered and the derived demonstration a * t ∼ π * (s t ) will be collected and executed to continue parsing (Line 17-22).
Denote a collected state-action pair as (s t ,ã t ), whereã t could be a t or a * t depending on whether an interaction is requested. To trainπ i+1 (Line 7), our algorithm adopts a reduction-based approach similar to DAGGER and reduces imitation learning to iterative supervised learning. Formally, we define our training loss function as a weighted negative log-likelihood: where D is the aggregated training data over i iterations and w t denotes the weight of (s t ,ã t ).
We consider assigning weight w t in three cases: (1) For confident actions a t , we set w t = 1. This essentially treats confident actions as gold decisions, which resembles self-training (Nigam and Ghani, 2000).
(2) For user-confirmed decisions (valid demonstrations a * t ), such as enforcing a WHERE condition on "Player" in Figure 1, w t is also set to 1 to encourage the parser to imitate the correct decisions from users. (3) For uncertain actions that cannot be addressed via human interactions (invalid demonstrations a * t ), we assign w t = 0. This could happen when some of the incorrect precedent actions are not fixed. For example, in Figure 1, if the system missed correcting the WHERE condition on "School/Club Team", then whatever value it generates after "WHERE School/Club Team=" is wrong, and thus any action a * t derived from human feedback would be invalid. A possible training strategy in this case can set w t to be negative, similar to (Welleck et al., 2020). However, empirically we find this strategy fails to train the parser to correct its mistake in generating School/Club Team but rather disturbs model training. To solve this problem, we directly set w t = 0 to remove the impact of unaddressed actions. A similar solution is also adopted in (Petrushkov et al., 2018;Kreutzer and Riezler, 2019). As shown in Section 6, this way of training weight assignment enables stable improvement in iterative model learning while requiring fewer user annotations.

Theoretical Analysis
While our system enjoys the benefit of learning from a small amount of user feedback, one crucial question is whether it can still achieve the same level of performance as a system trained on full expert annotations, if one could afford that and manage the privacy risk. In other words, what is the performance gap between our system and a fully supervised system? In this section, we answer this question by showing that the performance gap is mainly bounded by the learning policy's probability of trusting a confident action that turns out to be wrong, which can be controlled in practice.
In the analysis, we follow prior work (Ross and Bagnell, 2010;Ross et al., 2011) to assume a unified trajectory length T and focus the proof on the "infinite sample" case, which assumes an infinite number of samples in each iteration (i.e., m = ∞ in Algorithm 1), such that the state space can be full explored by the current policy. An analysis under the "finite sample" case can be found in Appendix A.5.

Cost Function for Analysis
Different from typical imitation learning tasks (e.g., Super Tux Kart (Ross et al., 2011)), in semantic parsing, there exists only one gold trajectory semantically identical to the question and can return correct execution results. 3 Whenever a policy action is different from the gold one, the whole trajectory will not yield the correct semantic meaning. Therefore, we analyze a policy's performance only when it is conditioned on a gold partial parse, i.e., s t ∈ d t π * , where d t π * is the state distribution in step t when executing the expert policy π * for first t-1 steps. Let (s,π) = 1 − pπ(a = a * |s) be the loss ofπ making a mistake at state s. By summing up a policy's expected loss over T steps, we define the cost of the policy as: where d π * = 1 T T t=1 d t π * denotes the average expert state distribution (assuming time step t is a random variable uniformly sampled from 1 ∼ T ). A detailed derivation is shown in Appendix A.1.
The betterπ is, the smaller this cost becomes. Although it is not exactly the same as the objective evaluated in experiments, which measures the correctness of a complete trajectory (rather than a single policy action) sampled fromπ, this simplified version makes theoretical analysis easier and reflects a consistent relative performance among algorithms. Next, we will derive the bound of each policy's cost in order to compare their performance.

Cost Bound of Supervised Approach
A fully supervised system trains a parser on expertannotated (q, a * 1:T ) pairs, where the gold semantic parse a * 1:T can be viewed as generated by executing the expert policy π * . This gives the policyπ sup : where Π is the policy space induced by the model architecture. A detailed derivation in Appendix A.2 shows the cost bound of the supervised approach: Theorem 5.1. For supervised approach, let N = min π∈Π E s∼d π * [l(s, π)], then J(π sup ) = T N .
The theorem gives an exact bound (as shown by the equality) since the supervised approach, given the "infinite sample" assumption, trains a policy under the same state distribution d π * as the one being evaluated in the cost function (Eq. (2)). As we will show next, when adopting an annotation-efficient learning strategy, our proposed algorithm breaks this consistency and thus induces a performance gap compared with the supervised approach.

Cost Bound of Our Proposed Algorithm
During its iterative learning, Algorithm 1 produces a sequence of policiesπ 1:N = (π 1 ,π 2 , ...,π N ), where N is the number of training iterations, and returns the one with the best test-time performance on validation asπ (Line 9). Recall that our algorithm samples a trajectory by executing actions from both the previously learned policyπ i and the expert policy π * (when an interaction is requested). Let π i denote such a "mixture" policy. The cost of the learned policyπ can be bounded as: The above derivation shows that the bound comprises of two parts. The first term E s∼dπ i [ (s,π i )] calculates the expected training loss ofπ i . Notice that, in training, each trajectory is sampled from the mixture policy (s ∼ d π i ), while in evaluation, we measure a policy's performance conditioned on a gold partial parse (s ∼ d π * in Eq. (2)). This discrepancy, which does not exist in the supervised approach, explains the performance loss of our algorithm, which is bounded by the second term max ||d π i − d π * || 1 , the L 1 distance between d π i and d π * weighted by the maximum loss value l max thatπ i encounters over the training. Bounding the two terms gives the following theorem: denotes the best expected policy loss in hindsight, and e i denotes the probability thatπ i does not query the expert policy (i.e., being confident) but its own action is wrong under d π * . A detailed derivation can be found in Appendix A.3-A.4.

Remarks.
A comparison of Theorem 5.1 and Theorem 5.2 shows that the performance gap led by our algorithm is mainly bounded by 1 N N i=1 e i . Intuitively this is because whenever a learning policy in our algorithm collects its own, but wrong, action as the gold one for training, it introduces noise that does not exist in the supervised approach's training set. This finding inspires us to restrict the gap by lowering down the learning policy's error rate when it does not query the expert. Empirically this can be achieved by setting: • Accurate policy confidence estimation, such that actions regarded confident are generally correct. • Moderate model initialization, such that generally the policy is less likely to make wrong actions throughout the iterative training. For the first point, we set a high confidence threshold µ, which has been demonstrated to be reliable for MISP (Yao et al., 2019b). In the future, it can even be replaced by a machine learning module (see a discussion in Section 7). We empirically validate the second point in our experiments.

Experiments
In this section, we conduct experiments to demonstrate the annotation efficiency of our algorithm (Section 4) and that it can train semantic parsers to reach high performance when the system is reasonably instantiated, consistent with our theoretical analysis in Section 5.

Experimental Setup
We test our system on the WikiSQL dataset (Zhong et al., 2017). The dataset contains a large scale of annotated question-SQL pairs (56,355 pairs for training) and thus serves as a good resource for experimenting iterative learning. For the base semantic parser, we choose SQLova (Hwang et al., 2019), one of the top-performing models on Wik-iSQL, to ensure a reasonable model capacity in terms of data utility along iterative training.
To instantiate the proposed algorithm, we set a high confidence threshold µ = 0.95 following Yao et al. (2019b) and experiment with different initialization settings as suggested by our analysis in Section 5, using 10%, 5% and 1% of the total training data. During iterative learning, questions from the remaining training data arrive in a random order to simulate user questions. The parser is trained with simulated user feedback (which is obtained by directly comparing the synthesized query with the gold one) iteratively for every m = 1, 000 questions. We test systems under different training iterations N and report results averaged over three random runs. More implementation details are included in Appendix B.1.

System Comparison
We compare our system (denoted as MISP-L since it builds a Learning algorithm upon MISP) with the traditional supervised approach (denoted as Full Expert). To investigate the skyline capability of our system, we also present a variant called MISP-L*, which is assumed with perfect confidence measurement and interaction design, so that it can precisely identify and correct its mistakes during parsing. This is implemented by allowing the system to compare its synthesized query with the gold one. Note that this is not a realized automatic system; we show its performance as an upper bound of MISP-L.
On the other hand, while the learning systems by Clarke et al. (2010) and Iyer et al. (2017), which request user validation on query execution results, may not very practical to interact with end users, we include them nonetheless in the interest of comprehensive comparison. This leads to two baseline systems. The Binary User system requests binary user feedback on whether executing the generated SQL query returns correct database results and collects only queries with correct execution results to further improve the parser, similar to (Clarke et al., 2010). The Binary User+Expert system additionally collects full expert SQL annotations when the execution results of the generated SQL queries are wrong, similar to (Iyer et al., 2017).

Experimental Results
We evaluate each system by answering the two research questions (RQs): • RQ1: Can the system learn a semantic parser without requiring a large amount of annotations?
• RQ2: For interactive systems, while requiring weaker supervision, can they train the parser to reach a performance comparable to the traditional supervised system?
For RQ1, we measure the number of user/expert annotations a system requires to train a parser. For Full Expert, this number is equal to the trajectory length of the gold query (e.g., 5 for the query in Figure 1); for MISP-L and MISP-L*, it is the number of user interactions during training. For Binary User(+Expert), it is hard to quantify "one annotation", which varies according to the actual database size and the query difficulty. In experiments, we approximate this number by calculating it in the same way as Full Expert, with the assumption that in general validating an answer is as hard as validating the SQL query itself. More accurate metrics can be explored by conducting a user study, as we discussed in Section 7. Note that while we do not differentiate the actual cost (e.g., time and financial cost) of users and experts in this aspect, we emphasize that our system enjoys an additional benefit of collecting training examples from a much cheaper and more abundant source while serving end users' needs at the same time.
Figure 2 (top) shows each system's parsing accuracy on WikiSQL test set after they have been trained on certain amounts of annotations. Consistently under all initialization conditions, MISP-L consumes a comparable or smaller amount of annotations to train the parser to reach the same parsing accuracy. As shown in Figure 5 in Appendix, MISP-L requires an average of no more than one interaction for most questions along the iterative training. Given the limited size of WikiSQL training set, the simulation experiments currently can only show the system's performance under a small number of annotations. However, we expect this gain to continue as it receives more user questions in the long-term deployment.
To answer RQ2, Figure 2 (bottom) compares each system's parsing accuracy after they have been trained for the same number of iterations. The results demonstrate that when a semantic parser is moderately initialized (10%/5% initialization setting), MISP-L can further improve it to reach a comparable accuracy as Full Expert (0.776/0.761 vs. 0.794 in the last iteration). In the extremely weak 1% initialization setting (using only around 500 initial training examples), all interactive learning systems suffer from a huge performance loss. . We experiment systems with three initialization settings, using 10%, 5% and 1% of the training data respectively. This is consistent with our finding in theoretical analysis (Section 5). In Appendix C, we plot the value of e i , the probability thatπ i makes a confident but wrong decision given a gold partial parse, showing that a better initialized policy generally obtains a smaller e i throughout the training and thus a tighter cost bound.
For both RQ1 and RQ2, our system surpasses Binary User, the execution feedback-based system. In experiments, we find out that the inferior performance of Binary User is mainly due to the "spurious program" issue (Guu et al., 2017), i.e., a SQL query having correct execution results can still be incorrect in terms of semantics. 4 MISP-L circumvents this issue by directly validating the semantic meaning of intermediate parsing decisions.
Finally, when it is assumed with perfect interaction design and confidence estimator, MISP-L* shows striking superiority in both aspects. Since it always corrects wrong decisions immediately, MISP-L* can collect and derive the same training examples as Full Expert, and thus trains the parser to Full Expert's performance level. Meanwhile, it requires only 10% of the annotations that Full Expert consumes. These observations implies large 4 For example, contrast "WHERE C1=A" with "WHERE C1=A and C2=B". They can give the same execution results when all records satisfying "C1=A" also meet "C2=B" by accident. However, semantically the latter includes an extra condition which may not be specified by the question. room for MISP-L to be improved in the future.

Generalize to Complex SQL Queries
Since queries in WikiSQL are generally simple and follow a pre-specified "SELECT...WHERE..." sketch, the last part of our experiments investigates whether our system can generalize to complex SQL queries. To this end, we test our system on the Spider dataset (Yu et al., 2018), where SQL queries can contain complicated keywords like GROUP BY. For the base semantic parser, we choose EditSQL (Zhang et al., 2019), one of the open-sourced top models on Spider. Given the small size of Spider (7,377 question-SQL query pairs for training after data cleaning), we only experiment with one initialization setting, using 10% of the training set. Since all Spider models do not predict the specific values in a SQL query (e.g., "jalen rose" in Figure 1), 5 we cannot execute the generated query to simulate the binary execution feedback. Therefore, we only compare our system with the Full Expert baseline. Parsers are evaluated on Spider Dev set since the test set is not publicly available. We include all implementation details in Appendix B.2. Figure 3 (top) shows that our system and its variant consistently achieve comparable or better annotation efficiency. We expect this advantage to continue as the system receives more questions and interactions from users beyond the Spider dataset. However, we also notice that the gain is smaller and MISP-L suffers from a larger performance loss compared with Full Expert (Figure 3, bottom), due to the poor parser initialization and the SQL query complexity. This can be addressed via adopting better interaction designs and a more accurate confidence estimation, as shown by MISP-L*.

Conclusion and Future Work
In this paper, we explore building an interactive semantic parser that continually improves itself from end user interaction, without involving annotators or developers. To this end, we propose an annotation-efficient imitation learning algorithm to learn from the sparse, fine-grained demonstrations. We prove the quality of the algorithm theoretically and show its advantage over the traditional full expert annotation approach via experiments.
As a pilot study on this research topic, we train systems with simulated user feedback. One important future work is to conduct a large-scale user study and collect interactions from real users. This is not trivial and has to account for uncertainties such as noisy user feedback. By analyzing real users' statistics (e.g., average time spent on each question), we believe a more accurate and realistic formulation of user/expert annotation cost can be derived to guide future research.
Besides, we would like to explore more accurate confidence measurement to improve our system, as suggested by our theoretical analysis. In experiments, we observe that the two neural semantic parsers (especially the more complicated EditSQL) tend to be overconfident, and training them with more data does not mitigate this issue. To address that, future directions include neural network calibration (Guo et al., 2017) and using machine learning components (e.g., a reinforcement learning-based active selector (Fang et al., 2017)) to replace the confidence threshold.
Finally, the proposed annotation-efficient imitation learning algorithm can be generalized to other NLP tasks (Sokolov et al., 2016)  In this section, we give a detailed theoretical analysis to derive the cost bound of the supervised approach and our proposed annotation-efficient imitation learning algorithm in Section 4. Following Ross et al. (2011), we first focus the proof on an infinite sample case, which assumes an infinite number of samples to train a policy in each iteration (i.e., m = ∞ in Algorithm 1). As an overview, we start the analysis by introducing the "cost function" we use to analyze each policy in Appendix A.1, which represents an inverse quality of a policy. In Appendix A.2, we derive the bound of the cost of the supervised approach. Appendix A.3 and Appendix A.4 then discuss the cost bound of our proposed algorithm. Finally, in Appendix A.5, we show the cost bound of our algorithm in finite sample case.

A.1 Cost Function for Analysis
In a semantic parsing task, whenever a policy action is different from the gold one, the whole trajectory cannot yield the correct semantic meaning. Therefore, we analyze a policy's performance only when it is conditioned on a gold partial parse, i.e., s t ∈ d t π * , where d t π * is the state distribution in step t when executing the expert policy π * for first t-1 steps. Given a question q and denoting a * 1:t as the gold partial trajectory sampled by the expert policy π * , we define the cost of sampling a partial trajectory a 1:t = (a 1 , ..., a t ) as: Based on this definition, we further define the expected cost ofπ in a single time step t, given the question q and the gold partial parse a 1:t−1 ∼ π * , as: where pπ(a t = a * t ) denotes the probability of sampling a gold action a * t from the policyπ. By taking an expectation over all questions q ∈ Q, we have the following derivations: The second equality holds by the definition s t = (q, a 1:t−1 ). In this analysis, we follow Ross and Bagnell (2010); Ross et al. (2011) to assume a unified decision length T . By summing up the above expected cost over the T steps, we define the total cost of executing policyπ for T steps as: Denote (s,π) = 1 − pπ(a = a * |s), a ∼ π(s), a * ∼ π * (s) as the "loss function" in our analysis, which is bounded within [0, 1], then the cost of policyπ can be simplified as: where d π * = 1 T T t=1 d t π * is the average expert state distribution, when we assume the time step t to be a random variable under the uniform distribution U(1, T ) (the second equality).
The better a policyπ is, the smaller this cost becomes. Our analysis thus compares each policy by deriving the "bound" of their costs.

A.2 Derivation of Cost Bound for Supervised Approach
In this section, we analyze the cost bound for the supervised approach. Recall that the supervised approach trains a policyπ using the standard supervised learning algorithm with supervision from π * at every decision step. Therefore, it finds the best policyπ sup on infinite samples as: where Π denotes the policy space induced by the model architecture, and the expectation over s is sampled from the whole d π * state space because of the "infinite sample" assumption. The supervised approach thus obtains the following cost bound: This gives the following theorem: Theorem A.1. For supervised approach, let N = min π∈Π E s∼d π * [ (s, π)], then J(π sup ) = T N .
The cost bound of the supervised approach represents its exact performance as implied by the equality. This is because the approach trains a policy (Eq. (4)) under the same state distribution d π * (given the "infinite sample" assumption) as in evaluation (Eq. (3)). As we will show next, the proposed annotation-efficient imitation learning algorithm breaks this consistency while enjoying the benefit of high annotation efficiency, which explains the performance gap.

A.3 No-regret Assumption
The derivation of our proposed annotation-efficient imitation learning algorithm's cost bound leverages a "no-regret" assumption: ). Many no-regret algorithms (Hazan et al., 2007;Kakade and Tewari, 2009) that guarantee γ N ∈ O( 1 N ) require convexity or strongly-convexity of the loss function. However, the loss function used in our application, which is built on the top of a deep neural network model, does not satisfy this requirement. In this analysis, we simplify the setting and directly make this assumption for convenience of the proof. A more accurate regret bound for non-convex neural networks can be researched in the future.
Another concern is that the collected online training labels come from not only the expert policy π * (when it is queried), but also the learning policyπ i (when the agent has a high confidence on its policy action). Labels from the learning policy may bring noise amid the model fitting to the expert policy. However, in practice the impact from such noisy labels are limited when the confidence threshold µ is set at a high value (e.g., 0.95). In this case, labels fromπ i are generally clean and lead to increasing performance during iterative training. Therefore, it is still safe to make this no-regret assumption.

A.4 Derivation of Cost Bound for Our Proposed Algorithm
As shown in Algorithm 1, our algorithm produces a sequence of policiesπ 1:N = (π 1 ,π 2 , ...,π N ), where N is the number of training iterations, and the algorithm returns the one with the best test-time performance on validation asπ. In training, our algorithm executes actions from both the learning policyπ i (when the model is confident) and the expert policy π * . We denote this "mixture" policy as π i . Then for the first N iterations, we have the cost bound of our algorithm as: From the last inequality, we can see that the cost bound of our algorithm is restricted by two terms. The first term E s∼dπ i [ (s,π i )] denotes the expected loss ofπ i under the state induced by π i during training (under the "infinite sample" assumption, as mentioned in the beginning of the analysis). By applying the no-regret assumption (Assumption A.1), this term can be bound by 1 denotes the best expected training loss in hindsight.
The second term denotes the L 1 distance between state distributions induced by π i and π * i , weighted by the maximum loss value l max thatπ i encounters over the training. As we notice, unlike the supervised approach, our algorithm trains a policy under d π i , which is different from the state distribution d π * used to evaluate the policy (Eq. (3)). This discrepancy explains the performance loss of our algorithm compared to the supervised approach and is bounded by the aforementioned L 1 distance. To further bound this term, we define e i as the probability thatπ i makes a confident (i.e., without querying the expert policy) but wrong action under d π * , and introduce the following lemma: Proof. Let β it be the probability of querying the expert policy under d t π * ,˜ it the error rate ofπ i under d t π * , and d any state distribution besides d π * . We can then express d π i by: The distance between d π i and d π * thus becomes
By applying Assumption A.1 and Lemma A.1 to Eq. (3), we derive the following inequality: Given a large enough N (N ∈Õ(T )), by the no-regret assumption, we can further simplify the above as: which leads to our theorem: This instantiation is assumed with perfect confidence estimation and interaction design, such that it can precisely detects and corrects its intermediate mistakes during parsing. Therefore, MISP-L* presents an upper bound performance (i.e., the tightest cost bound) of our algorithm. This can be interpreted theoretically. In fact, for MISP-L*, e i is always zero since the system has ensured that its policy action is correct when it does not query the expert policy. In this case, d π i = d π * , so N = min π∈Π l(s, π)]. Therefore, according to Theorem A.2, MISP-L* has a cost bound of: By comparing this bound with the cost bound in Theorem A.1, it is observed that MISP-L* shares the same cost bound as the supervised approach (except for the inequality relation and the constant). This is explainable since MISP-L* indeed collects exactly the same training labels (from π * ) as the supervised approach.

A.5 Cost Bound of Our Proposed Algorithm in Finite Sample Case
The theorems in previous sections hold when the algorithm observes infinite trajectories. However, in practice, our algorithm will observe the training loss from only a finite set of m trajectories at each iteration i using π i . For this consideration, in the following discussion, we provide a proof of the cost bound of our proposed algorithm under the finite sample case. In the finite sample setting, our algorithm observes the training loss from a finite number of trajectories. We define D i as the m trajectories collected in the i th iteration. In every iteration, the algorithm observes loss i (π i ) = E s∼D i ( (s,π i )). By the no-regret assumption (Assumption A.1), the average observed loss for each iterations can still be bounded by the following inequality: 1 π) to denote the loss of the best policy on the finite samples.
Following Eq. (5), we need to switch the derivation from the expected loss ofπ i over d π i (i.e., E s∼dπ i [ (s,π i )]) to that over D i (i.e., E s∼D i [ (s,π i )]), the actual state distribution that π i is trained on. To fill this gap, we introduce Y ij to denote the difference between the expected loss ofπ i under d π i and the average loss ofπ i under the j th sample trajectory with π at iteration i. The random variables Y ij over all i ∈ {1, 2, ..., N } and j ∈ {1, 2, ..., m} are all zero mean, bounded in [− max , max ] and form a martingale in the order of Y 11 , Y 12 , ..., Y 1m , Y 21 , ..., Y N m . By Azuma-Hoeffding's inequality (Azuma, 1967;Hoeffding, 1994) with probability 1 − δ. Following the derivations in Eq. (5) and by introducing Y ij , with probability of 1 − δ, we obtain the following inequalities by definition: Notice that we need mN to be at least O(T 2 log(1/δ)), so thatγ N and l max 2 log(1/δ) mN are negligible. This leads to the following theorem: Theorem A.3. For our proposed annotationefficient imitation learning algorithm, with probability at least 1 − δ, when mN isÕ(T 2 log(1/δ)), there exists a policyπ ∈π 1: The above shows that the cost of our algorithm can still be bounded in the finite sample setting. Comparing this bound with the bound under the infinite sample setting, we can observe that the bound is still related to e i , the probability thatπ i takes a confident but incorrect action under d π * . B Implementation Details

B.1 Interactive Semantic Parsing Framework
Our system assumes an interactive semantic parsing framework to collect user feedback. In experiments, this is implemented by adapting MISP (Yao et al., 2019b), an open-sourced framework that has demonstrated a strong ability to improve test-time parsing accuracy. 6 In this framework, an agent is comprised of three components: a world model that wraps the base semantic parser and a feedback incorporation module to interpret user feeds and update the semantic parse, an error detector that decides whether to request for user intervention, and an actuator that delivers the agent's request by asking a natural language question, such that users without domain expertise can understand.
We follow MISP's instantiation for text-to-SQL tasks to adopt a probability-based uncertainty estimator as the error detector, which triggers user interactions when the probability of the current decision is lower than a threshold. 7 The actuator is instantiated by a grammar-based natural language generator. We use the latest version of MISP that allows multi-choice interactions to improve the system efficiency, i.e., when the parser's current decision is validated as wrong, the system presents multiple alternative options for user selection. An additional "None of the above options" option is included in case all top options from the system are wrong. Figure 1 shows an example of the user interaction. From there, the system can derive a correct decision to address its uncertainty (e.g., taking "Player" as a WHERE column).
User Simulator. Our experiments train each system with simulated user feedback. To this end, we build a user simulator similar to the one used by Yao et al. (2019b), which can access the groundtruth SQL queries. It gives yes/no answer or selects a choice by directly comparing the sampled policy action with the true one in the gold query.

B.2 EditSQL Experiment Details
In the data preprocessing step, EditSQL (Zhang et al., 2019) transforms each gold SQL query into a sequence of tokens, where the From clause is removed and each column Col is prepended by its paired table name, i.e., Tab.Col. However, we observe that sometimes this transformation is not convertible. For example, consider the question "what are the first name and last name of all candidates?" and its gold SQL query: "SELECT T2.first name , T2.last name FROM candidates AS T1 JOIN people AS T2 ON T1.candidate id = T2.person id". EditSQL transforms this query into : "select people.first name , people.last name".
The transformed sequence accidentally removes the information about table candidates in the original SQL query, leading to semantic meaning inconsistent with the question. When using such erroneous sequences as the gold targets in model training, we cannot simulate consistent user feedback, e.g., when the user is asked whether her query is relevant to the table candidates, the simulated user cannot give an affirmative answer given the transformed sequence. To avoid inconsistent user feedback, we remove question-SQL pairs whose transformed sequence is inconsistent with the original gold SQL query, from the training data. This reduces the size of the training set from 8,421 to 7,377. The validation set is kept untouched for fair evaluation.
The implementation of interactive semantic parsing for EditSQL is the same as Section B.1, except that, in order to cope with the complicated structure of Spider SQL queries, for columns in WHERE, GROUP BY, ORDER BY and HAVING clauses, we additionally provide an option for the user to "remove" the clause, e.g., removing a WHERE clause by picking the "The system does not need to consider any conditions." option. The confidence threshold µ is 0.995 as we observe that EditSQL tends to be overconfident. Figure 4 shows MISP-L's performance on Wik-iSQL validation set. We also show in Figure 5 the average number of annotations (i.e., user interactions) per question during the iterative training. Overall, as the base parser is further trained, the system tends to request fewer user interactions. In most cases throughout the training, the system requests no more than one user interaction, demonstrating the annotation efficiency of our algorithm.

C.2 SQLova Results in Theoretical Analysis
As we proved in Section 5, the performance gap between our proposed algorithm and the supervised approach is mainly decided by 1 N N i=1 e i , an average probability thatπ i makes a confident but wrong decision under d π * (i.e., given a gold partial parse) over N training iterations. More specifically, from our proof of Lemma A.1, e i can be expressed as: where˜ it denotes policyπ i 's conditional error rate under d t π * when it does not query the expert (i.e., being confident about its own action) at step t, and 1−β it denotes the probability thatπ i does not query the expert under d t π * .˜ it (1 − β it ) thus represents a joint probability thatπ i makes confident but wrong action under d t π * at step t. To show a reflection of our theoretical analysis on the experiments, we present the values of the following three variables during training: (1) i = 1 T T t=1˜ it , the average value of˜ it over T time steps. A smaller˜ i implies a lower conditional error rate and thus a smaller e i and a smaller performance gap. (2) β i = 1 T T t=1 β it , the average value of β it over T time steps. A smaller β i (i.e., a larger 1 − β i ) means a smaller probability thatπ i queries the expert (i.e., being more confident). This could lead to a larger e i and thus a larger performance gap.
(3) e i as defined above. A smaller e i indicates a smaller performance gap between our algorithm and the supervised approach.
We plot the results in Figure 6. For all initialization settings, we observe that the base parser tends to make more confident actions under a gold partial parse (i.e., decreasing β i ) when it is trained for more iterations. Meanwhile, the error rate of its confident actions under a gold partial parse is also reduced (i.e., decreasing˜ i ). When combining the two factors, e i is shown to keep decreasing, implying that with more iterations that the parser is trained, it gets a tighter cost bound and better performance.
Finally, we notice that a differently initialized parser can end up with a different performance. This is reasonable since a better initialized parser presumably should have a better overall error rate. This is also consistent with our observation in the main experimental results (Section 6). . We experiment systems with three initialization settings, using 10%, 5% and 1% of the training data respectively.