Data-Anonymous Encoding for Text-to-SQL Generation

On text-to-SQL generation, the input utterance usually contains lots of tokens that are related to column names or cells in the table, called table-related tokens. These table-related tokens are troublesome for the downstream neural semantic parser because it brings complex semantics and hinders the sharing across the training examples. However, existing approaches either ignore handling these tokens before the semantic parser or simply use deterministic approaches based on string-match or word embedding similarity. In this work, we propose a more efficient approach to handle table-related tokens before the semantic parser. First, we formulate it as a sequential tagging problem and propose a two-stage anonymization model to learn the semantic relationship between tables and input utterances. Then, we leverage the implicit supervision from SQL queries by policy gradient to guide the training. Experiments demonstrate that our approach consistently improves performances of different neural semantic parsers and significantly outperforms deterministic approaches.


Introduction
Interacting with relational databases or tables through natural language is an important and challenging problem (Androutsopoulos et al., 1995;Li and Jagadish, 2014;Pasupat and Liang, 2015;Zhong et al., 2017;Yu et al., 2018c). To tackle this problem, the neural semantic parser has been widely studied (Dong and Lapata, 2016;Jia and Liang, 2016;Herzig and Berant, 2017;Dong and Lapata, 2018), which automatically maps the input natural language into the logical form (e.g., SQL query) following the typical encoderdecoder structure (Hochreiter and Schmidhuber, 1997;Bahdanau et al., 2015;Vinyals et al., 2015). * This work was done when the author was visiting Microsoft Research Asia.
Semantic parsing usually tackles two kinds of problems (Goldman et al., 2018;Herzig and Berant, 2018), i.e., the lexical problem and the structural problem. On text-to-SQL generation, the lexical problem refers to mapping tokens in input utterances to constants in SQL queries, e.g., the column names or cells in SQL queries. The structural problem refers to mapping intentions conveyed by input utterances to operators in SQL queries, e.g., the aggregators or the existence of WHERE clause in SQL queries. Intuitively, the lexical problem can be formulated as a sequential tagging problem, called anonymization, where each token in the input utterance will be tagged as being related to a column name, a cell or nothing. For ease of reference, we call the tokens in input utterances related to column names or cells table-related tokens, and the tagged utterance anonymous utterance.
If the lexical problem can be reduced before the neural semantic parsing, the training difficulties will be greatly alleviated. The reason is two-fold. First, by anonymizing table-related tokens in the input utterance before the neural semantic parser, we can conceal the complex semantics of tablerelated tokens. Second, different input utterances, which have different table-related tokens but the same structure, can be reduced to the same anonymous utterance before they are fed into the parser. This will result in sharing across training data and thus alleviate training difficulties (Goldman et al., 2018;Herzig and Berant, 2018). For example, in Figure 1a and 1b, input utterances (denoted as x) seemingly ask different questions about country and college team, but they can be reduced to a similar anonymous utterance (denoted asx) by replacing the token related to the column name as c and the token related to the cell as v.
However, reducing the lexical problem before the neural semantic parser is far from being wellstudied on text-to-SQL generation although it has been demonstrated to be helpful on other tasks (Goldman et al., 2018;Herzig and Berant, 2018). First, most approaches ignores reducing the lexical problem (Zhong et al., 2017;Xu et al., 2017;Dong and Lapata, 2018;Shi et al., 2018;Wang et al., 2018a;Hwang et al., 2019). Without conducting anonymization before the neural semantic parser, these approaches cannot receive the aforementioned benefit, although they may have modules to predict the presence of column names and cells inside the parser. Second, a few studies have tried the anonymization by deterministic approaches, which compare the similarity based on string-match or word embedding between the input utterance and the table Yu et al., 2018a). These approaches cannot fully understand the semantic relationship between input utterances and tables, and ignore the relationship with components of SQL queries. For example, in Figure 1c, although the token 'player' is similar to the column name 'Player', it should not be tagged as a table-related token because the token "player" is just a non-sense demonstrative pronoun and the column 'Player' is not related to the SQL query.
To this end, we propose to use the learningbased approach to reduce the lexical problem, i.e., anonymize the table-related tokens. First, we propose a two-stage anonymization model to learn the semantic relationship between input utterances and tables. Then, considering that there is no labeled data for the anonymization, we propose to extract the set of column names and cells appearing in the SQL query and use this set as the super-vision. Another benefit of leveraging such implicit supervision is that we can make the anonymization model consider the relationship with the components of the SQL query when it learns the semantic relationship, and thus avoid suffering the same disadvantage as deterministic approaches. Furthermore, to bridge the gap that our model is a sequential tagging model while the supervision extracted from the SQL query is an unordered set, we leverage the policy gradient (Williams, 1992) to train the model. Moreover, we train the anonymization model and the neural semantic parser as whole by a varational inference-like framework with the anonymous utterance as the hidden variable.
Experimental results demonstrate that our approach can consistently improve performances of different neural semantic parsers and significantly outperform typical deterministic approach on the anonymization problem.

Related Work
Semantic Parsing. Semantic parsing aims to map natural languages into executable programs. In the area of semantic parsing, the programs could be in various types, e.g., λ-calculus (Zettlemoyer and Collins, 2005), Python (Oda et al., 2015), SQL (Zhong et al., 2017), etc; the source of the knowledge can also be different, e.g., the knowledge base, the table, the image (Suhr et al., 2017), etc; and the supervision can take different forms as well, e.g., question-denotation pairs (Pasupat and Liang, 2015), question-program pairs, etc. In this work, we focus on text-to-SQL generation with the table as the source of the knowledge and with question-SQL query pairs as supervision.
Reducing Lexical Problem. On tasks other than text-to-SQL generation, some studies have tried to reduce lexical problem by anonymizing tokens that are related to the program constants. Goldman et al. (2018) lifts tokens in utterances to an abstract form by referring to fixed mappings on visual reasoning task. Herzig and Berant (2018) employs a rule-based method to transform content words in utterances to an abstract representation. Although these studies inspire us to consider reducing lexical problem on text-to-SQL generation, the rules they employed cannot be directly applied to text-to-SQL generation.
On text-to-SQL generation, a few studies tried to anonymize table-related tokens although they do not aim at reducing lexical problem. To better understand rare entities, Yu et al. (2018a) used string match to recognize column names. To learn a domain-transferable semantic parser,  detected column names by measuring the closeness of edit distance and word embedding. As discussed in Section 1, these deterministic approaches can not fully understand the semantic relationship between input utterances and tables, and ignore the relationship with components of SQL queries.
Entity Linking. Our approach can be regarded as an implementation of entity linking on the concrete task, because general entity linking (Shen et al., 2015;Kolitsas et al., 2018) approaches fail to handle particular challenges in our scenario. On the one hand, there are lots of cases cannot be handled by the deterministic entity linking method which only relies on measuring the similarity; on the other hand, there is no labeled data for the learning-based entity linking method.

Problem Formulation
Denote the input utterance as x = x 1 . . . x |x| and the corresponding SQL query as y = y 1 . . . y |y| . We formulate the lexical problem as a sequential tagging problem. Formally, for each token x t in the input utterance, we give it a tagx t ∈ {COL 1 , . . . , COL K , CELL, UNK}, where COL k represents that x t is related to the k-th column name in the table, CELL represents that x t is related to a cell in the table and UNK represents that x t is not related to the table. Here K is the number of column names in the table. Note that indexes of column names cannot be ignored in the algorithm to better leverage the implicit supervision from SQL queries (see Section 4.2) although it is ignored when giving examples in Figure 1 and Table 4 for ease of read. For ease of reference, we call this sequential tagging problem anonymization and the tagged sequencex =x 1 . . .x |x| anonymous utterance.
The typical neural semantic parser aims to estimate p (y|x). In our work, we decompose p (y|x) into two processes (as shown in Figure 2), i.e., (1) Specifically, one process is to learn the anonymous utterancex given the input utterance x for the purpose of reducing lexical problem, i.e., while the other process is to learn the neural semantic parser with anonymous utterancex as the additional input, i.e., In the following, we will discuss how to learn the anonymous utterance (Eqn (2)) and how to learn the neural semantic parser with anonymous utterance as additional input (Eqn (1)) in detail.

Anonymization Model
The anonymization is a structural problem by nature. First, we need to differentiate whether the token is related column names, cells or not related to  Figure 3: Illustration of the anonymization model when the table content is not available. We take the t-th token in the input utterance as the example.
the table. Second, we need to further recognize the concrete column/cell that the token is related to. Therefore, we design a model that uses two stages, i.e., channel selection stage and the column/cell binding stage, to tackle these two subproblems respectively. The probability distributions produced by these two stages together determine the result, where a t ∈ {COL, CELL, UNK} is the selected channel indicating that the token is related to column names, cells or nothing respectively, p channel (·) is the probability produced by the channel selection stage, and p binding (·) is the probability produced by the column/cell binding stage. For ease of reference, we call the two-stage model anonymization model, and illustrate it in Figure 3. In the following, we introduce details of its different components.
Input Encoder. We leverage BiLSTM or BERT (Devlin et al., 2018) to encode both the input utterance and the table. Due to the privacy problem, the table content is not allowed to be used on most text-to-SQL tasks (Zhong et al., 2017;Yu et al., 2018c). Therefore, we use the concatenation of the embedding of the tokens in the input utterance, the embedding of the separator and the embedding of the column names in the table as the input for BiLSTM or BERT. We denote the output of input encoder as {Q 1 , . . . , Q T , E [SEP] , C 1 , . . . , C K }, where Q t , t ∈ {1, . . . , T } is the vector for the encoding of the t-th token in the input utterance, E [SEP] is the vector for the encoding of the separator, and C k , k ∈ {1, . . . , K} is the vector for the encoding of the k-th column name in the table. Note that we run a BiLSTM or BERT between column names instead of over each column name. The reason is that Yu et al. (2018a) have shown that although the order of column names does not matter, running a BiLSTM or BERT between colomn names can capture relationships between them, which will benefit the accuracy and the training time.
Channel Selection. We implement a linear gate with encodings of the input utterance and aggregated encodings of column names as input: where W gate stands for learnable parameters, C Q t is the aggregated encoding of column names, and [Q t ; C Q t ] is the concatenation of Q t and C Q t . Specifically, to obtain the most relevant column names for the t-th token in the input utterance, we leverage attention mechanism (Bahdanau et al., 2015) towards column names {C 1 , . . . , C K } to compute the aggregated encoding of the column names C Q t , i.e., Here, v 1 , W 1 and W 2 are learnable parameters.
Column/Cell Binding 1 . The probability distribution generated by this stage, i.e., p binding (·), can be categorized into three types, i.e., p COL binding (·), p CELL binding (·) and p UNK binding (·), corresponding to the three channels in the first stage respectively. Obviously, p UNK binding (x t |a t , x) = 1. For column names, we produce a probability distribution over the K columns in the table through measuring the relevance between the t-th token and the k-th column name: where v 2 , W 3 and W 4 are learnable parameters.
For cells, considering that table content is not available due to the privacy problem, we simply predict a substring from the input utterance. Specifically, we follow the longest match principle to merge consecutive tokens in the input utterance, which are labeled as CELL by the channel selection stage, to one cell. Therefore, the generated by column/cell binding stage is simply set as 1, i.e., p CELL binding (x t |a t , x) = 1.

Implicit Supervision from SQL Queries
To train the anonymization model, the ground truth for the anonymous utterance is indispensable. However, there is no such labeled data, and manually labeling the whole training data for each text-to-SQL task is unrealistic, especially when the amount of training data is tremendous.
To tackle this problem, we propose to extract the set of column names and cells appearing in the SQL query and use this set as the supervision to guide the training of the anonymization model. Another benefit of such approach is that we can make our model consider the relationship with components of the SQL query, and thus avoid suffering the same trouble as deterministic approaches. Concretely, for each SQL query y, we denote the set of column names and cells appearing in it as S SQL , which consists of three parts, i.e., S SQL = S sel col ∪ S other col ∪ S cell : 1. S sel col is the set of column names appearing in the SELECT clause; 2. S other col is the set of column names appearing in other clauses; 3. S cell is the set of cells appearing in the whole SQL query.

Policy Gradient for Sequential Tagging
Maximizing Expected Reward. Ideally, if we have the ground truth, we can train the anonymization model by minimizing the gap (e.g., KL divergence) between the predicted probabilities and the ground truth. Unfortunately, it is infeasible because the implicit supervision from SQL queries (i.e., S SQL ) takes a form of the unordered set while the anonymization model faces a sequential tagging task. To address this problem, we propose to encourage the set of column names and cells appearing in the predicted anonymous utterance, denoted as S pred , to be similar to that extracted from the SQL query, i.e., S SQL . To this end, we define a reward of the predicted anonymous utterance r(x) as the similarity between S pred and S SQL , and then train the anonymization model by maximizing expected reward, where D represents the set of training pairs. However, directly computing expected reward requires traversing all the possible anonymous utterancex, which is unrealistic. Therefore, we leverage Monte Carlo estimate as the approximation of the expected reward. Concretely, we sample N anonymous utterances following probability distribution generated by the anonymization model, and then average the reward of the samples to estimate the expected reward, wherex j denotes the j-th sample andx j ∼ p θ (·|x).
To maximize the above objective function by gradient descent, we employ REIN-FORCE (Williams, 1992) method, i.e, Measurement of Reward. For ease of reference, we decompose S pred into two parts, i.e., S pred = S pred col ∪ S pred cell , where S pred col and S pred cell represent the set of column names and the set of cells appearing in the predicted anonymous utterance respectively. The measurement of the reward r(x) is designed based on the similarity between S pred and S SQL by referring to the following principles: 1. The predicted anonymous utterance should contain the column names in the SELECT clause of the SQL query, i.e., S sel col ⊂ S pred col . The anonymization model will be punished when it misses the column names in S sel col . This is motivated by our preliminary analysis that almost every column name in the SELECT clause has the corresponding token in the input utterance.
2. The column names in the predicted anonymous utterance should be at least a subset of the column names appearing in the SQL query, i.e., S pred col ⊆ S sel col ∪ S other col . Unlike S sel col , the column names appearing in other clauses of the SQL query may possible not have corresponding tokens in the input utterance. For example, in Figure 1a, the column name 'Position' does not have corresponding tokens in the input utterance. Therefore, it is unreasonable to strictly force column names in the predicted anonymous utterance to be the same as that appearing in the SQL query. Instead, it is better to punish the anonymization model when it predicts column names outside the set of column names appearing in the SQL query.
3. The cell names in the predicted anonymous utterance should be the same as that appearing in the SQL query, i.e., S pred cell = S cell . If there is any missing or extra cell in S pred cell , the anonymization model will be punished.
According to above principles, we design the reward r(x) as, 1. if S sel col ⊆ S pred col , S pred cell = S cell and S pred col ⊆ S sel col ∪S other col are all true, r(x) = 1; 2. if one of S sel col S pred col , S pred cell = S cell and S pred col S sel col ∪ S other col is true, r(x) = −1.

Training and Inference
To train the anonymization model and the parser as a whole, we regard the anonymous utterancex as the hidden variable for the neural semantic parser p(y|x). Then, maximizing the log likelihood of p(y|x) is equivalent to maximizing its Evidence Lower BOund (ELBO) (Kim et al., 2018), i.e., One popular strategy to maximizing ELBO is the coordinate ascent. Specifically, it iteratively executes following two steps (Neal and Hinton, 1998): 1) variational E-step, which maximizes ELBO w.r.t. θ keeping ϕ fixed, i.e., θ = arg min θ KL (q θ (x|x) p ϕ (x|y, x)); and 2) variational M-step, which maximizes ELBO w.r.t. ϕ keeping θ fixed, i.e., ϕ = arg max ϕ E q θ (x|x) [log p ϕ (y,x|x)].
In our scenario, ϕ and θ refer to the learnable parameters in the neural semantic parser and the anonymization model respectively. For variational E-step, it usually finds the best variational approximation to the true posterior (Neal and Hinton, 1998). As discussed in Section 4.2, the true posterior we can obtain is in the form of the unordered set. Thus, we actually minimize the the expected reward, i.e., J ER (θ), instead of the KL divergence (see Section 4.3). For variational M-step, to save training time, we simply sample one example greedily, which approximates to maximizing log likelihood of p ϕ (y,x|x), i.e., J MLE (ϕ). Moreover, since performing coordinate ascent on the entire dataset is too expensive, the variational Estep and M-step are usually performed over minibatches (Hoffman et al., 2013). To further improve time efficiency, we optimize objectives of the variational E-step and M-step simultaneously instead of alternatively. Thus, the actual objective is, where J MLE (ϕ) is the log likelihood of the generated SQL query given the input utterance and the anonymous utterance, and J ER (θ) is the expected reward of the anonymous utterance given the input utterance (see Eqn (8) for details.) Specifically, when optimizing J MLE (ϕ), we use the concatenation of the encoding of the input utterance (denoted as e(x t )) and the encoding of the anonymous utterance (denoted as h(x t )) as the input for the neural semantic parser, i.e., g t = [e(x t ); h(x t )], t ∈ {1, . . . , T } (see Figure 2). For e(x t ), it is determined by the parser itself. For h(x t ), it is concatenated by two parts: 1) embedding of the channel name, i.e., COL, CELL and UNK, 2) embedding of index k when the channel is COL, indicating that the t-th token is related to the k-th column name in the table.
At test time, the prediction for input utterance x is obtained byx = arg maxx p (x |x) andŷ = arg max y p (y |x, x).

Experimental Setup
We conduct experiments on the WikiSQL (Zhong et al., 2017) and Spider (Yu et al., 2018c). Each example consists of a natural language question, a SQL query and a table. On WikiSQL, one question is only related to one table; while on Spider, one question is usually related to multiple tables.
Our model is implemented in PyTorch (Paszke et al., 2017). The type of the input encoder in the anonymization model (i.e., BiLSTM or BERT) is set the same as that in the concrete parser. Embedding vectors of the anonymous utterance are initiated by GloVe (Pennington et al., 2014). We    use the manually labeled training data (which is 10% of the whole training data) to initiate our anonymization model and then optimize the entire framework on the whole training data. All the other hyperparameters, e.g., the learning rate, hyperparameters in ADAM optimizer, the number of training epochs, etc., are tuned on the dev set 2 .

Performance of Neural Semantic Parsing
First, we show that reducing the lexical problem through the anonymization model can improve the performance of neural semantic parser.
To this end, we add the anonymization model to typical neural semantic parsers as presented in Section 4. We use '[A]+DAE' to denote the neural semantic parser with the anonymization model, where A stands for the original name of the concrete parser and DAE is the abbreviated name of our approach, i.e., data-anonymous encoding.
On WikiSQL, the performance is evaluated by query-match accuracy (ACC QM ) and execution accuracy (ACC EX ), which measure accuracies of canonical representation and execution result matches between the predicted SQL and the ground truth respectively (Yu et al., 2018a). Table 1 shows the results. First, we can observe that query match accuracy on test data can be improved by 6.4% at most and 1.1% at least. Furthermore, for TypeSQL, query match accuracy can be further improved by 1.1% although it has used string-match based approach to anonymize tablerelated tokens. Moreover, we perform ablation studies by 1) removing the supervision for the anonymization model (denoted as '-Supervision' in Table 1), and 2) simply using the output of the trained anonymization model as the input for the parser without training them as a whole (denoted as '-Co-training' in Table 1). We can observe that the performance improvement is limited without supervision and co-training, indicating that both of them are indispensable.
On Spider, the performance is evaluated by exact-match accuracy on different difficulty levels of SQL queries, i.e., easy, medium, hard and extra hard. (Yu et al., 2018c). Table 2 shows the results. First, the overall accuracy can be improved by 3.2% and 2.5% respectively. Furthermore, per-   formances on medium, hard and extra hard SQL queries achieve more improvement than that on easy SQL queries, indicating that our approach is more helpful for solving complicated cases.

Performance of Reducing Input Utterances
To further demonstrate the effectiveness of reducing the lexical problem, we show that different input utterance can be reduced to the same anonymous utterance. To this end, we process input utterances and anonymous utterances by 1) converting the characters to the lowercase, 2) lemmatizing the tokens and 3) removing the articles. Table 3 shows that by anonymizing tablerelated tokens, the number of distinct utterances is reduced from 8387 to 5488 on dev set and from 15828 to 9680 on test set. Furthermore, Table 4 shows three anonymous utterances that are most frequent on dev set and examples of corresponding input utterances. All of these indicate that although input utterances can hardly be identical, they often share the same anonymous utterance.

Performance of Anonymization Methods
In addition, we compare performances of different anonymization methods, including 1) Type-SQL, which uses exact string match to detect column names, 2) AnnotatedSeq2Seq, which detects column names by setting a threshold for the edit distance and the cosine similarity of word embedding, and 3) DAE, our learning-based approach 3 . Anonymization methods are evaluated by following metrics: 1) whether the column names in SELECT clause is included in the pre- ). Here, Z is the amount of test data, I(·) is the indicator function, and superscript i is the index of data. Table 5 shows that DAE significantly outperforms TypeSQL and AnnotatedSeq2Seq on all the evaluation metrics. First, for ACC SC , DAE outperforms TypeSQL and AnotatedSeq2Seq by 16% and 3.5% on test data; for ACC OC , DAE outperforms TypeSQL and AnnotatedSeq2Seq by 0.8% and 28% on test data. Moreover, DAE can achieve around 86% for ACC CE , while other methods fail to recognize cells when the table content is not available due to the privacy problem.

Conclusion
In this work, we propose a learning-based approach to reduce the lexical problem before the neural semantic parser on text-to-SQL generation. Specifically, we propose a two-stage anonymization model and leverage implicit supervision from SQL queries by policy gradient to guide its training. In the future, we plan to improve the performance of the anonymization model by exploring more efficient expected reward. In addition, we also plan to extend our approach to the tasks with question-denotation pairs as supervision.