Natural Language to Structured Query Generation via Meta-Learning

In conventional supervised training, a model is trained to fit all the training examples. However, having a monolithic model may not always be the best strategy, as examples could vary widely. In this work, we explore a different learning protocol that treats each example as a unique pseudo-task, by reducing the original learning problem to a few-shot meta-learning scenario with the help of a domain-dependent relevance function. When evaluated on the WikiSQL dataset, our approach leads to faster convergence and achieves 1.1%–5.4% absolute accuracy gains over the non-meta-learning counterparts.


Introduction
Conventional supervised training is a pervasive paradigm for NLP problems. In this setting, a model is trained to fit all the training examples and their corresponding targets. However, while sharing the same surface form of the prediction task, examples of the same problem may vary widely. For instance, recognizing textual entailment is a binary classification problem on whether the hypothesis follows a given textual statement, but the challenge datasets consist of a huge variety of inference categories and genres (Dagan et al., 2013;Williams et al., 2017). Similarly, for a semantic parsing problem that maps natural language questions to SQL statements, the number of conditions in a SQL query or the length of a question can vary substantially (Zhong et al., 2017).
The inherently high variety of the examples suggests an alternative training protocol: instead of learning a monolithic, one-size-fits-all model, it could be more effective to learn multiple models, where each one is designed for a specific "task" that covers a group of similar examples. How- * Work performed while XH was at Microsoft Research. ever, this strategy is faced with at least two difficulties. As the number of tasks increases, each task will have much fewer training examples for learning a robust model. In addition, the notion of "task", namely the group of examples, is typically not available in the dataset.
In this work, we explore this alternative learning setting and address the two difficulties by adapting the meta-learning framework. Motivated by the few-shot learning scenario (Andrychowicz et al., 2016;Ravi and Larochelle, 2016;Vinyals et al., 2016), meta-learning aims to learn a general model that can quickly adapt to a new task given very few examples without retraining the model from scratch (Finn et al., 2017). We extend this framework by effectively creating pseudo-tasks with the help of a relevance function. During training, each example is viewed as the test example of an individual "task", where its top-K relevant instances are used as training examples for this specific task. A general model is trained for all tasks in aggregation. Similarly during testing, instead of applying the general model directly, the top-K relevant instances (in the training set) to the given test example are first selected to update the general model, which then makes the final prediction. The overview of the proposed framework is shown in Figure 1.
When empirically evaluated on a recently proposed, large semantic parsing dataset, WikiSQL (Zhong et al., 2017), our approach leads to faster convergence and achieves 1.1%-5.4% absolute accuracy gain over the non-meta-learning counterparts, establishing a new state-of-the-art result. More importantly, we demonstrate how to design a relevance function to successfully reduce a regular supervised learning problem to a meta-learning problem. To the best of our knowledge, this is the first successful attempt in adapting meta-learning to a semantic task.  K from all training datapoints given a datapoint D j for constructing a pseudo-task T j as in the few-shot meta-learning setup. (Bottom) We optimize the model parameters θ such that the model can learn to adapt a new task with parameters θ j via a few gradient steps on the training examples of the new task. The model is updated by considering the test error on the test example of the new task. See Section 2 for detail.

Background: Meta-Learning
Our work is built on the recently proposed Model-Agnostic Meta-Learning (MAML) framework (Finn et al., 2017), which we describe briefly here. MAML aims to learn the learners (for the tasks) and the meta-learner in the few-shot metalearning setup (Vinyals et al., 2016;Andrychowicz et al., 2016;Ravi and Larochelle, 2016). Formally, it considers a model that is represented by a function f θ with parameters θ. When the model adapts to a new task T i , the model changes parameters θ to θ i , where a task contains K training examples and one or more test examples (K-shot learning). MAML updates the parameters θ i by one or a few rounds of gradient descent based on the training examples of task T i . For example, with one gradient update, where the step size α is a hyper-parameter; L T i (f θ ) is a loss function that evaluates the error between the prediction f θ (x (j) ) and target y (j) , where x (j) , y (j) are an input/output pair sampled from the training examples of task T i . Model parameters θ are trained to optimize the performance of f θ i on the unseen test examples from T i across tasks. The meta-objective is: The goal of MAML is to optimize the model parameters θ such that the model can learn to adapt new tasks with parameters θ i via a few gradient steps on the training examples of new tasks. The model is improved by considering how the test error on unseen test data from T i changes with respect to the parameters. The meta-objective across tasks is optimized using stochastic gradient descent (SGD). The model parameters θ are updated as follows: where β is the meta step size.

Approach
As discussed in Section 1, to reduce traditional supervised learning to a few-shot meta-learning problem, we introduce a relevance function, which effectively helps group examples to form pseudotasks. Because the relevance function is problemdependent, we first describe the semantic parsing problem below, followed by the design of our relevance function and the complete algorithm.

The Semantic Parsing Task
The specific semantic parsing problem we study in this work is to map a natural language question to a SQL query, which can be executed against a given table to find the answer to the original question. In particular, we use the currently largest natural language questions to SQL dataset, WikiSQL (Zhong et al., 2017), to develop our model and to conduct the experiments.

Relevance Function
The intuition behind the design of a relevance function is that examples of the same type should have higher scores. For the questions to SQL problem, we design a simple relevance function that depends on (1) the predicted type of the corresponding SQL query and (2) the question length.
There are five SQL types in the WikiSQL dataset: {Count, Min, Max, Sum, Avg, Select}. We train a SQL type classifier f sql using SVMs with bag-of-words features of the input question, which achieves 93.5% training accuracy and 88% test accuracy in SQL type prediction. Another soft indication on whether two questions can be viewed as belonging to the same "task" is their lengths, as they correlate to the lengths of the mapped SQL queries. The length of a question is the number of tokens in it after normal-Algorithm 1 Pseudo-Task MAML (PT-MAML) Require: Training Datapoints D = {x (j) , y (j) } Require: α, β: step size hyperparameters Require: K: support set size hyperparameter 1: Construct a task T j with training examples using a support set S (j) K and a test example D j = (x (j) , y (j) ). 2: Denote p(T ) as distribution over tasks 3: Randomly initialize θ 4: while not done do 5: Compute adapted parameters with gradient descent: using each D i from T i and L T i for the metaupdate 11: end while izing entity mentions to single tokens. 1 Our relevance function only considers examples of the same predicted SQL types. If examples x (i) and x (j) have the same SQL type, then their relevance score is 1 − |q len (x (i) ) − q len (x (j) )|, where q len calculates the question length. Notice that the relevance function does not need to be highly accurate as there is no formal definition on which examples should be grouped in the same pseudo-task. A heuristic-based function that encodes some domain knowledge typically works well based on our preliminary study. In principle, the relevance function can also be jointly learned with the metalearning model, which we leave for future work.

Algorithm
Given a relevance function, the adaptation of the meta-learning using the MAML framework can be summarized in Algorithm 1, called Pseudo-Task MAML (PT-MAML). For each training example x (j) , we create a pseudo-task T j using the top-K relevant examples as the support set S (j) K (Step 1). The remaining steps of the algorithm mimics the original MAML design, update task-level models (Step 8) and the meta-level, general model (Step 10) using gradient descent.

Experiments
In this section, we introduce the WikiSQL dataset and preprocessing steps, the learner model in our meta-learning setup, and the experimental results.

Dataset
We evaluate our model on the WikiSQL dataset (Zhong et al., 2017). We follow the data preprocessing in (Wang et al., 2017). Specifically, we first preprocess the dataset by running both tables and question-query pairs through Stanford Stanza  using the script included with the WikiSQL dataset, which normalizes punctuations and cases of the dataset. We further normalize each question based on its corresponding table: for table entries and columns occurring in questions or queries, we normalize their format to be consistent with the table. After preprocessing, we filter the training set by removing pairs whose ground truth solution contains constants not mentioned in the question, as our model requires the constants to be copied from the question. We train and tune our model only on the filtered training and filtered development set, but we report our evaluation on the full development and test sets. We obtain 59,845 (originally 61,297) training pairs, 8,928 (originally 9,145) development pairs and 17,283 test pairs (the test set is not filtered).

Learner Model
We use the model of Wang et al. (2017) as the learner in our meta-learning setup. The model is a grammar-aware Seq2Seq encoder-decoder model with attention . The encoder is a bidirectional LSTM, which takes the concatenation of the table header (column names) of the queried table and the question as input to learn a joint representation. The decoder is another LSTM with attention mechanism. There are three output layers corresponding to three decoding types, which restricts the vocabulary it can sample from at each decoding step. The three decoding types are defined as follows: • τ V (SQL operator): The output has to be a SQL operator, i.e., a terminal from V = {Select, From, Where, Id, Max, Min, Count, Sum, Avg, And, =, >, ≥, <, ≤, <END>, <GO>}. • τ C (column name): The output has to be a column name, which will be copied from either the table header or the query section of where Acc lf represents the logical form accuracy and Acc ex represents the SQL execution accuracy. "Pointer loss", "Max loss", and "Sum loss" are the non-metalearning counterpart from Wang et al. (2017). "Meta + X" denotes the meta-learning model with learner "X".
the input sequence. Note that the column required for the correct SQL output may or may not be mentioned explicitly in the question. • τ Q (constant value): The output is a constant that would be copied from the question section of the input sequence.
The grammar of SQL expressions in the the WikiSQL dataset can be described in regular expression as "Select f c From t Where (c op v) * " (f refers to an aggregation function, c refers to a column name, t refers to the table name, op refers an comparator and v refers to a value). The form can be represented by a decoding-type sequence τ V τ V τ C τ V τ C τ V (τ C τ V τ Q ) * , which will ensure only decoding-type corrected tokens can be sampled at each decoding step. Wang et al. (2017) propose three cross-entropy based loss functions: "Pointer loss", which is the cross-entropy between target index and the chosen index, "Max loss", which computes the probability of copying a token v in the input as the maximum probability of pointers that point to token v, and "Sum loss", which computes the probability of copying a token v in the input as the sum of probabilities of pointers that point to token v. See (Wang et al., 2017) for more detail.

Model Hyperparameters
We use the pre-trained n-gram embeddings by Hashimoto et al. (2017)  Accuracy meta_train train meta_dev dev Figure 2: Logical form accuracy comparison, where "meta train" and "meta dev" are the train and development set accuracy using the "Meta + Sum loss" model, "train" and "dev" are the train and development set accuracy using the "Sum loss" model (Wang et al., 2017).
batch size of 200 for 100 epochs. During training, we clip gradients at 5 and add gradient noise with η = 0.3, γ = 0.55 to stabilize training (Neelakantan et al., 2015). We found the meta-learning model is trained stably without back-propagating to second order gradients. We select the support set size K to be 2 based on the development set. Empirically, the performance does not improve when we use a larger K. We set the learning rates α = 0.001 and β = 0.1 based on the development set. The model is implemented in Tensorflow and trained using Adagrad (Duchi et al., 2011). Table 1 shows the experimental results of our model on the WikiSQL dataset. We select the model based on the best logical form accuracy on the development set, and compare our results to augmented pointer network and the Seq2SQL model (with RL) in (Zhong et al., 2017). Both logical form accuracy (denoted by Acc lf ) that compares the exact SQL syntax match, and the SQL execution results (denoted by Acc ex ) are reported. We compare our approach with its nonmeta-learning counterpart using "Pointer loss", "Max loss", and "Sum loss" losses from (Wang et al., 2017). Our model achieves 1.1%-5.3% and 1.2%-5.4% gains on the test set logical form and execution accuracy, respectively. We also investigate the training and development set logical form accuracy over different epochs by "Meta + Sum loss" and "Sum loss" models. The results are shown in Figure 2. One interesting observation is that the "Meta + Sum loss" model converges much faster than the "Sum loss" model especially in the first 10 epochs. We attribute this improvement to the ability to adapt to new tasks even with a small number of training examples.  Figure 3: Logical form accuracy comparison. "Meta + Sum loss (o), Sum loss (x)" indicates the generated SQL is incorrect by the "Sum loss" model and is correct by the "Meta + Sum loss" model. Similarly, "Meta + Sum loss (x), Sum loss (o)" indicates the generated SQL is incorrect by the "Meta + Sum loss" model and is correct by the "Sum loss" model.

Results
We compare the logical form error on the test set between the "Sum loss" model (Wang et al., 2017) and the proposed "Meta + Sum loss" model. Among the 17,283 test examples, there are 6,661 and 6,428 errors by the "Sum loss" and the "Meta + Sum loss", respectively. There are 5,190 common errors by both models. We examine the test examples where "Sum loss" is correct while "Meta + Sum loss" is not and vice versa, shown in Figure 3. We observe that the differences are mainly in ground truth SQL length = 7 and 10, where the "Meta + Sum loss" model outperforms "Sum loss" model by a large margin. We show some examples for the two cases in the supplementary material.

Related Work
Meta Learning One popular direction of metalearning (Thrun and Pratt, 1998;Schmidhuber, 1987;Naik and Mammone, 1992) is to train a meta-learner that learns how to update the parameters of the learners model (Bengio et al., 1992;Schmidhuber, 1992). This direction has been applied to learning to optimize deep neural networks (Hochreiter et al., 2001;Andrychowicz et al., 2016;Li and Malik, 2017;Ha et al., 2017). Fewshot learning methods have also adapted metalearning approaches for image recognition (Koch, 2015;Ravi and Larochelle, 2016;Vinyals et al., 2016) and reinforcement learning (Finn et al., 2017). Given that the few-shot learning setup cannot directly work in standard supervised learning problems, we explore reducing a regular supervised learning problem to the few-shot metalearning scenario by creating pseudo-tasks with a relevance function.
Semantic Parsing Mapping natural language to logic forms has been actively studied in natural language processing research (Zettlemoyer and Collins, 2005;Giordani and Moschitti, 2010;Artzi and Zettlemoyer, 2011;Berant et al., 2013;Vlachos and Clark, 2014;Yih et al., 2014Yih et al., , 2015Wang et al., 2015;Golub and He, 2016;Iyer et al., 2017;. However, unlike conventional approaches, which fit one model for all training examples, the proposed approach learns to adapt to new tasks. By using the support set based on the relevance function, the proposed model can adapt to a unique model for each example. Program Induction / Synthesis Program induction (Reed and De Freitas, 2016;Neelakantan et al., 2015;Graves et al., 2014;Yin et al., 2015;Devlin et al., 2017) aims to infer latent programs given input/output examples, while program synthesis models (Zhong et al., 2017;Parisotto et al., 2017) aim to generate explicit programs and then execute them to get output. The learner model we used in this work follows the line of program synthesis models and trains on pairs of natural language (question) and program (SQL) directly.

Conclusion
In this paper, we propose a new learning protocol that reduces a regular supervised learning problem to the few-shot meta-learning scenario. This is done by effectively creating pseudo-tasks with the help of a relevance function. When evaluated on the newly released, large semantic parsing dataset, WikiSQL, our approach leads to faster convergence and enjoys 1.1%-5.4% absolute accuracy gains over the non-meta-learning counterparts, achieving a new state-of-the-art result.
While the initial finding is encouraging, we believe the potential of this meta-learning framework has not yet been fully realized. In the future, we plan to explore more variations of the meta-learning setup, such as using different relevance functions, including the ones that are jointly learned. We also would like to understand this approach better by testing it on more natural language processing tasks.