Meta-Learning for Domain Generalization in Semantic Parsing

The importance of building semantic parsers which can be applied to new domains and generate programs unseen at training has long been acknowledged, and datasets testing out-of-domain performance are becoming increasingly available. However, little or no attention has been devoted to learning algorithms or objectives which promote domain generalization, with virtually all existing approaches relying on standard supervised learning. In this work, we use a meta-learning framework which targets zero-shot domain generalization for semantic parsing. We apply a model-agnostic training algorithm that simulates zero-shot parsing by constructing virtual train and test sets from disjoint domains. The learning objective capitalizes on the intuition that gradient steps that improve source-domain performance should also improve target-domain performance, thus encouraging a parser to generalize to unseen target domains. Experimental results on the (English) Spider and Chinese Spider datasets show that the meta-learning objective significantly boosts the performance of a baseline parser.


Introduction
Semantic parsing is the task of mapping natural language (NL) utterances to executable programs. While there has been much progress in this area, earlier work has primarily focused on evaluating parsers in-domain (e.g., tables or databases) and often with the same programs as those provided in training (Finegan-Dollak et al., 2018). A much more challenging goal is achieving domain generalization, i.e., building parsers which can be successfully applied to new domains and are able to produce complex unseen programs. Achieving this generalization goal would, in principle, let users query arbitrary (semi-)structured data on the Web and reduce the annotation effort required to build multi-domain NL interfaces (e.g., Apple Siri or Amazon Alexa). Current parsers struggle in this setting; for example, we show in Section 5 that a modern parser trained on the challenging Spider dataset (Yu et al., 2018b) has a gap of more than 25% in accuracy between in-and out-of-domain performance. While the importance of domain generalization has been previously acknowledged (Cai and Yates, 2013;, and datasets targetting zero-shot (or out-ofdomain) performance are becoming increasingly available (Pasupat and Liang, 2015;Wang et al., 2015;Zhong et al., 2017;Yu et al., 2018b), little or no attention has been devoted to studying learning algorithms or objectives which promote domain generalization.
Conventional supervised learning simply assumes that source-and target-domain data originate from the same distribution, and as a result struggles to capture this notion of domain generalization for zero-shot semantic parsing. Previous approaches (Guo et al., 2019b;Herzig and Berant, 2018) facilitate domain generalization by incorporating inductive biases in the model, e.g., designing linking features or functions which should be invariant under domain shifts. In this work, we take a different direction and improve the domain generalization of a semantic parser by modifying the learning algorithm and the objective. We draw inspiration from meta-learning (Finn et al., 2017; and use an objective that optimizes for domain generalization. That is, we consider a set of tasks, where each task is a zero-shot semantic parsing task with its own source and target domains. By optimizing towards better target-domain performance on each task, we encourage a parser to extrapolate from source-domain data and achieve better domain generalization. Specifically, we focus on text-to-SQL parsing where we aim at translating NL questions to SQL queries and conduct evaluations on unseen databases. Consider the example in Figure 1, a parser needs to process questions to a new database at test time. To simulate this scenario during training, we synthesize a set of virtual zero-shot parsing tasks by sampling disjoint source and target domains 1 for each task from the training domains. The objective we require is that gradient steps computed towards better source-domain performance would also be beneficial to target-domain performance. One can think of the objective as consisting of both the loss on the source domain (as in standard supervised learning) and a regularizer, equal to the dot product between gradients computed on source-and target-domain data. Maximizing this regularizer favours finding model parameters that work not only on the source domain but also generalize to target-domain data. The objective is borrowed from  who adapt a Model-Agnostic Meta-Learning (MAML; Finn et al. 2017) technique for domain generalization in computer vision. In this work, we study the effectiveness of this objective in the context of semantic parsing. This objective is model-agnostic, simple to incorporate and does not require any changes in the parsing model itself. Moreover, it does not introduce new parameters for meta-learning.
Our contributions can be summarized as follows.
• We handle zero-shot semantic parsing by applying a meta-learning objective that directly optimizes for domain generalization.
• We propose an approximation of the metalearning objective that is more efficient and allows more scalable training.
• We perform experiments on two text-to-SQL benchmarks: Spider and Chinese Spider. Our 1 We use the terms domain and database interchangeably. new training objectives obtain significant improvements in accuracy over a baseline parser trained with conventional supervised learning.
• We show that even when parsers are augmented with pre-trained models, e.g., BERT, our method can still effectively improve domain generalization in terms of accuracy.

Related Work
Zero-Shot Semantic Parsing Developing a parser that can generalize to unseen domains has attracted increased attention in recent years. Previous work has mainly focused on the sub-task of schema linking as a means of promoting domain generalization. In schema linking, we need to recognize which columns or tables are mentioned in a question. For example, a parser would decide to select the column Status because of the word statuses in Figure 1. However, in the setting of zero-shot parsing, columns or tables might be mentioned in a question without ever being observed during training. One line of work tries to incorporate inductive biases, e.g., domain-invariant n-gram matching features (Guo et al., 2019b;, cross-domain alignment functions (Herzig and Berant, 2018), or auxiliary linking tasks  to improve schema linking. However, in the cross-lingual setting of Chinese Spider (Min et al., 2019), where questions and schemas are not in the same language, it is not obvious how to design such inductive biases like n-gram matching features. Another line of work relies on large-scale unsupervised pre-training on massive tables (Herzig et al., 2020;Yin et al., 2020) to obtain better representations for both questions and database schemas. Our work is orthogonal to these approaches and can be easily coupled with them. As an example, we show in Section 5 that our training procedure can improve the performance of a parser already enhanced with n-gram matching features (Guo et al., 2019b;. Our work is similar in spirit to Givoli and Reichart (2019), who also attempts to simulate source and target domains during learning. However, their optimization updates on virtual source and target domains are loosely connected by a two-step training procedure where a parser is first pre-trained on virtual source domains and then fine-tuned on virtual target domains. As we will show in Section 3, our training procedure does not fine-tune on virtual target domains but rather, uses them to evaluate a gradient step (for every batch) on source domains. This is better aligned with what is expected of the parser at test time: there will be no fine-tuning on real target domains at test time so there should not be any fine-tuning on simulated ones at train time either. Moreover, Givoli and Reichart (2019) treat the division of training domains to virtual train and test domains as a hyper-parameter, which is possible for a handful of domains, but problematic when dealing with hundreds of domains as is the case for text-to-SQL parsing.
Meta-Learning for NLP Meta-learning has been receiving soaring interest in the machine learning community. Unlike conventional supervised learning, meta-learning operates on tasks, instead of data points. Most previous work (Vinyals et al., 2016;Ravi and Larochelle, 2017;Finn et al., 2017) has focused on few-shot learning where metalearning helps address the problem of learning to learn fast for adaptation to a new task or domain. Applications of meta-learning in NLP are cast in a similar vein and include machine translation (Gu et al., 2018) and relation classification (Obamuyide and Vlachos, 2019). The meta-learning framework however is more general, with the algorithms or underlying ideas applied, e.g., to continual learning , semi-supervised learning (Ren et al., 2018), multi-task learning (Yu et al., 2020) and, as in our case, domain generalization .
Very recently, there have been some applications of MAML to semantic parsing tasks (Huang et al., 2018;Guo et al., 2019a;Sun et al., 2019). These approaches simulate few-shot learning scenarios in training by constructing a pseudo-task for each example. Given an example, similar examples are retrieved from the original training set. MAML then encourages strong performance on the retrieved examples after an update on the original example, simulating test-time fine-tuning. Lee et al. (2019) use matching networks (Vinyals et al., 2016) to enable one-shot text-to-SQL parsing where tasks for meta-learning are defined by SQL templates, i.e., a parser is expected to generalize to a new SQL template with one example. In contrast, the tasks we construct for meta-learning aim to encourage generalization across domains, instead of adaptation to a new task with one (or few) examples. One clear difference lies in how meta-train and meta-test sets are constructed. In previous work (e.g., Huang et al. 2018), these come from the same domain whereas we simulate domain shift and sample different sets of domains for meta-train and meta-test.
Domain Generalization Although the notion of domain generalization has been less explored in semantic parsing, it has been studied in other areas such as computer vision (Ghifary et al., 2015;Zaheer et al., 2017;. Recent work Balaji et al., 2018) employed optimization-based meta-learning to handle domain shift issues in domain generalization. We employ the meta-learning objective originally proposed in , where they adapt MAML to encourage generalization in unseen domains (of images). Based on this objective, we propose a cheap alternative that only requires first-order gradients, thus alleviating the overhead of computing second-order derivatives required by MAML.

Meta-Learning for Domain Generalization
We first formally define the problem of domain generalization in the context of zero-shot text-to-SQL parsing. Then, we introduce DG-MAML, a training algorithm that helps a parser achieve better domain generalization. Finally, we propose a computationally cheap approximation thereof.

Problem Definition
Domain Generalization Given a natural language question Q in the context of a relational database D, we aim at generating the corresponding SQL P . In the setting of zero-shot parsing, we have a set of source domains D s where labeled question-SQL pairs are available. We aim at developing a parser that can perform well on a set of unseen target domains D t . We refer to this problem as domain generalization.
Parsing Model We assume a parameterized parsing model that specifies a predictive distribution p θ (P |Q, D) over all possible SQLs. For domain generalization, a parsing model needs to properly condition on its input of questions and databases such that it can generalize well to unseen domains.

Conventional Supervised Learning
Assuming that question-SQL pairs from source domains and target domains are sampled i.i.d from the same distribution, the typical training objective of supervised learning is to minimize the loss function of the negative log-likelihood of the gold SQL query: where N is the size of mini-batch B. Since a minibatch is randomly sampled from all training source domains D s , it usually contains question-SQL pairs from a mixture of different domains.
Distribution of Tasks Instead of treating semantic parsing as a conventional supervised learning problem, we take an alternative view based on metalearning. Basically, wea re interested in a learning algorithm that can benefit from a distribution of choices of source and target domains, denoted by p(τ ), where τ refers to an instance of a zero-shot semantic parsing task that has its own source and target domains.
In practice, we usually have a fixed set of training source domains D s . We construct a set of virtual tasks τ by randomly sampling disjoint source and target domains from the training domains. Intuitively, we assume that divergences between the test and training domains during the learning phase are representative of differences between training and actual test domains. This is still an assumption, but considerably weaker compared to the i.i.d. assumption used in conventional supervised learning. Next, we introduce the training algorithm called DG-MAML motivated by this assumption.

Learning to Generalize with DG-MAML
Having simulated source and target domains for each virtual task, we now need a training algorithm that encourages generalization to unseen target domains in each task. For this, we turn to optimization-based meta-learning algorithms (Finn et al., 2017;Nichol et al., 2018; and apply DG-MAML (Domain Generalization with Model-Agnostic Meta-Learning), a variant of MAML (Finn et al., 2017) for this purpose. Intuitively, DG-MAML encourages the optimization in the source domain to have a positive effect on the target domain as well.
During each learning episode of DG-MAML, we randomly sample a task τ which has its own source domain D τ s and target domain D τ t . For the sake of efficiency, we randomly sample mini-batch question-SQL pairs B s and B t from D τ s and D τ t , respectively, for learning in each task. DG-MAML conducts optimization in two steps, namely metatrain and meta-test.
Meta-Train DG-MAML first optimizes parameters towards better performance in the virtual source domain D τ s by taking one step of stochastic gradient descent (SGD) from the loss under B s .
where α is a scalar denoting the learning rate of meta-train. This step resembles conventional supervised learning where we use stochastic gradient descent to optimize the parameters.

Meta-Test
We then evaluate the resulting parameter θ in the virtual target domain D t by computing the loss under B t , which is denoted as L Bt (θ ).
Our final objective for a task τ is to minimize the joint loss on D s and D t : where we optimize towards the better source and target domain performance simultaneously. Intuitively, the objective requires that the gradient step conducted in the source domains in Equation (2) be beneficial to the performance of the target domain as well. In comparison, conventional supervised learning, whose objective would be equivalent to L Bs (θ) + L Bt (θ), does not pose any constraint on the gradient updates. As we will elaborate shortly, DG-MAML can be viewed as a regularization of gradient updates in addition to the objective of conventional supervised learning. We summarize our DG-MAML training process in Algorithm 1. Basically, it requires two steps of gradient update (Step 5 and Step 7). Note that θ is a function of θ after the meta-train update. Hence, optimizing L τ (θ) with respect to θ involves optimizing the gradient update in Equation (2) as well. That is, when we update the parameters θ in the final update of Step 7, the gradients need to back-propagate though the meta-train updates in Step 5. The update function in Step 7 could be based on any gradient descent algorithm. In this work we use Adam (Kingma and Ba, 2015).
Comment Note that DG-MAML is different from MAML (Finn et al., 2017) which is typically used in the context of few-shot learning. In our case, it encourages domain generalization during training, and does not require an adaptation phase.

Algorithm 1 DG-MAML Training Algorithm
Require: Training databases D Require: Learning rate α 1: for step ← 1 to T do 2: Sample a task τ of (D τ s , D τ t ) from D

Analysis of DG-MAML
To give an intuition of the objective in Equation (3), we follow previous work (Nichol et al., 2018; and use the first-order Taylor series expansion to approximate it: where in the last step we expand the function L Bs at θ. The approximated objective sheds light on what DG-MAML optimizes. In addition to minimizing the losses from both source and target domains, which are L Bs (θ) + L Bt (θ), DG-MAML further tries to maximize ∇ θ L Bs (θ) · ∇ θ L Bt (θ), the dot product between the gradients of source and target domain. That is, it encourages gradients to generalize between source and target domain within each task τ .

First-Order Approximation
The final update in Step 7 of Algorithm 1 requires second-order derivatives, which may be problematic, inefficient or non-stable with certain classes of models (Mensch and Blondel, 2018). Hence, we propose an approximation that only requires computing first-order derivatives.
First, the gradient of the objective in Equation (3) can be computed as: where I is an identity matrix and ∇ 2 θ L Bs (θ) is the Hessian of L Bs at θ. We consider the alternative of ignoring this second-order term and simply assume that ∇ θ θ = I. In this variant, we simply combine gradients from source and target domains. We show in the Appendix that this objective can still be viewed as maximizing the dot product of gradients from source and target domain.
The resulting first-order training objective, which we refer to as DG-FMAML, is inspired by Reptile, a first-order meta-learning algorithm (Nichol et al., 2018) for few-shot learning. A two-step Reptile would compute SGD on the same batch twice while DG-FMAML computes SGD on two different batches, B s and B t , once. To put it differently, DG-FMAML tries to encourage crossdomain generalization while Reptile encourages in-domain generalization.

Semantic Parser
In general, DG-MAML is model-agnostic and can be coupled with any semantic parser to improve its domain generalization. In this work, we use a base parser that is based on RAT-SQL , which currently achieves state-of-theart performance on Spider. 2 Formally, RAT-SQL takes as input question Q and schema S of its corresponding database. Then it produces a program which is represented as an abstract syntax tree T in the context-free grammar of SQL (Yin and Neubig, 2018). RAT-SQL adopts the encoder-decoder framework for text-to-SQL parsing. It has three components: an initial encoder, a transformer-based encoder and an LSTM-based decoder. The initial encoder provides initial representations, denoted as Q init and S init for the question and the schema, respectively. A relation-aware transformer (RAT) module then takes the initial representations and further computes context-aware representations Q enc and S enc for the question and the schema, respectively. Finally, a decoder generates a sequence of production rules that constitute the abstract syntax tree T based on Q enc and S enc . To obtain Q init and S init , the initial encoder could either be 1) LSTMs (Hochreiter and Schmidhuber, 1997) on top of pre-trained word embeddings, like GloVe (Pennington et al., 2014), or 2) pre-trained contextual embeddings like BERT (Devlin et al., 2019). In our work, we will test the effectiveness of our method for both variants.
As shown in , the encodings Q enc and S enc , which are the output of the RAT module, heavily rely on schema-linking features. These features are extracted from a heuristic function that links question words to columns and tables based on n-gram matching, and they are readily available in the conventional mono-lingual setting of the Spider dataset. However, we hypothesize that the parser's over-reliance on these features is specific to Spider, where annotators were shown the database schema and asked to formulate queries. As a result, they were prone to re-using terms from the schema verbatim in their questions. This would not be the case in a real-world application where users are unfamiliar with the structure of the underlying database and free to use arbitrary terms which would not necessarily match column or table names (Suhr et al., 2020). Hence, we will also evaluate our parser in the cross-lingual setting where Q and S are not in the same language, and such features would not be available.

Experiments
To evaluate DG-MAML, we integrate it with a base parser and test it on zero-shot text-to-SQL tasks. By designing an in-domain benchmark, we also show that the out-of-domain improvement does not come at the cost of in-domain performance. We also present some analysis to show how DG-MAML affects domain generalization.

Datasets and Metrics
We evaluate DG-MAML on two zero-shot textto-SQL benchmarks, namely, (English) Spider (Yu et al., 2018b) and Chinese Spider (Min et al., 2019). Chinese Spider is a Chinese version of Spider that translates all NL questions from English to Chinese and keeps the original English database. It introduces the additional challenge of encoding crosslingual correspondences between Chinese and English. 3 In both datasets, we report exact set match accuracy, following Yu et al. (2018b). We also report execution accuracy in the Spider dataset.

Baselines
Two kinds of features are widely used in recent semantic parsers to boost domain generalization: 3 Please see the appendix for details of the datasets. schema-linking features (as mentioned in Section 4) and pre-trained emebddings such as BERT. To show that our method can still achieve additional improvements, we compare with strong baselines that are integrated with schema-linking features and pre-trained embeddings. In the analysis (Section 5.6), we will also show the effect of our method when both features are absent in the base parsers.

Implementation and Hyperparameters
Our base parser is based on RAT-SQL , which is implemented in PyTorch (Paszke et al., 2019). For English questions and schemas, we use GloVe (Pennington et al., 2014) and BERTbase (Devlin et al., 2019) as the pre-trained embeddings for encoding. For Chinese questions, we use Tencent embeddings  and Multilingual-BERT (Devlin et al., 2019). In all experiments, we use a batch size of B s = B t = 12 and train for up to 20,000 steps. See the Appendix for details on other hyperparameters.

Main Results
Our main results on Spider and Chinese Spider are listed in Table 1 and 2, respectively.
Non-BERT Models DG-MAML boosts the performance of non-BERT base parsers on Spider and Chinese Spider by 2.1% and 4.5% respectively, showing its effectiveness in promoting domain generalization. In comparison, the performance margin for DG-MAML is more significant in the crosslingual setting of Chinese Spider. This is presumably due to the fact that heuristic schema-linking features, which help promote domain generalization for Spider, are not applicable in Chinese Spider. We will present more analysis on this in Section 5.6.
BERT Models Most importantly, improvements on both datasets are not cancelled out when the base parsers are augmented with pre-trained representations. On Spider, the improvements brought by DG-MAML remain roughly the same when the base parser is integrated with BERT-base. As a result, our base parser augmented with BERT-base and DG-MAML achieves the best execution accuracy compared with previous models. On Chinese Spider, DG-MAML helps the base parser with multilingual BERT achieve a substantial improvement. Overall, DG-MAML consistently boosts the performance of the base parser, and is complementary to using pre-trained representations.

Our Models
Base Parser + BERT-base 66.8 64.1 Base Parser + BERT-base + DG-MAML 69.3 66.1 Table 1: Accuracy (%) on the development and test sets of Spider. The first half shows set match accuracy for both non-BERT and BERT models; the second half shows execution accuracy of BERT models. Due to the number of model submissions constraint enforced by the Spider team, we only evaluate our BERT models on the test set for now. The number with ♦ is produced by running the code of .

In-Domain vs. Out-of-Domain
To confirm that the base parser struggles when applied out-of-domain, we construct an in-domain setting and measure the gap in performance. This setting also helps us address a natural question: does using DG-MAML hurt in-domain performance? This would not have been surprising as the parser is explicitly optimized towards better performance on unseen target domains.
To answer these questions, we create a new split of Spider. Specifically, for each database from the training and development set of Spider, we include 80% of its question-SQL pairs in the new training set and assign the remaining 20% to the new test set. As a result, the new split consists of 7702 training examples and 1991 test examples. When using this split, the parser is tested on databases that all have been seen during training. We evaluate the non-BERT parsers with the same metric of set match for evaluation.
Does the parser struggle out-of-domain? As in-domain and out-of-domain setting have differ-  Somewhat surprisingly, we instead observe a modest improvement (+1.1%) over the base parser. This suggests that DG-MAML, despite optimizing the model towards domain generalization, captures, to a certain degree, a more general notion of generalization or robustness, which appears beneficial even in the in-domain setting.

Additional Experiments and Analysis
We first discuss additional experiments on linking features and DG-FMAML, and then present further analysis probing how DG-MAML works. As the test sets for both datasets are not publicly available, we will use the development sets.
Linking Features As mentioned in Section 2, previous work addressed domain generalization by focusing on the sub-task of schema linking. For Spider, where questions and schemas are both in English,  leverage n-gram matching features which improve schema linking and significantly boost parsing performance. However, in Chinese Spider, it is not easy and obvious how to design such linking heuristics. Moreover, as pointed out by Suhr et al. (2020), the assumption  Specifically, we consider a variant of the base parser that does not use this feature, and train it with conventional supervised learning and with DG-MAML for Spider. As shown 4 in Table 3, we confirm that those features have a big impact on the base parser. More importantly, in the absence of those features, DG-MAML boosts the performance of the base parser by a larger margin. This is consistent with the observation that DG-MAML is more beneficial for Chinese Spider than Spider, in the sense that the parser would need to rely more on DG-MAML when these heuristics are not integrated or not available for domain generalization.
Effect of DG-FMAML We investigate the effect of the first-order approximation in DG-FMAML to see if it would provide a reasonable performance compared with DG-MAML. We evaluate it on the development sets of the two datasets, see Table 3. DG-FMAML consistently boosts the performance of the base parser, although it lags behind DG-MAML. For a fair comparison, we use the same batch size for DG-MAML and DG-FMAML. However, because DG-FMAML uses less memory, it could potentially benefit from a larger batch size. In practice, DG-FMAML is twice faster to train than DG-MAML, see Appendix for details.  Probing Domain Generalization Schema linking has been the focus of previous work on zeroshot semantic parsing. We take the opposite direction and use this task to probe the parser to see if it, at least to a certain degree, achieves domain generalization due to improving schema linking. We hypothesize that improving linking is the mechanism which prevents the parser from being trapped in overfitting the source domains.
We propose to use 'relevant column recognition' as a probing task. Specifically, relevant columns refer to the columns that are mentioned in SQL queries. For example, the SQL query "Select Status, avg(Population) From City Groupby Status" in Figure 1 contains two relevant columns: 'Status' and 'Population'. We formalize this task as a binary classification problem. Given a NL question and a column from the corresponding database, a classifier should predict whether the column is mentioned in the gold SQL query. We hypothesize that representations from the DG-MAML parser will be more predictive of relevance than those of the baseline, and the probing classifier will detect this difference in the quality of the representations.
We first obtain the representations for NL questions and schemas from the parsers and keep them fixed. The binary classifier is then trained based only on these representations. For classifier training we use the same split as the Spider dataset, i.e., the classifier is evaluated on unseen databases. Details of the classifier are provided in the Appendix. The results are shown in Table 4. The classifier trained on the parser with DG-MAML achieves better performance. This confirms our hypothesis that using DG-MAML makes the parser have better encodings of NL questions and database schemas and that this is one of the mechanisms the parsing model uses to ensure generalization.

Conclusions
The task of zero-shot semantic parsing has been gaining momentum in recent years. However, previ-ous work has not proposed algorithms or objectives that explicitly promote domain generalization. We rely on the meta-learning framework to encourage domain generalization. Instead of learning from individual data points, DG-MAML learns from a set of virtual zero-shot parsing tasks. By optimizing towards better target-domain performance in each simulated task, DG-MAML encourages the parser to generalize better to unseen domains.
We conduct experiments on two zero-shot textto-SQL parsing datasets. In both cases, using DG-MAML leads to a substantial boost in performance. Furthermore, we show that the faster first-order approximation DG-FMAML can also help a parser achieve better domain generalization.
hyperparameter sweep for non-BERT RAT-SQL, which partially explains why our non-BERT base parser is not as good as it in Spider. However, after the integration of BERT representations, our base parser slightly outperforms RAT-SQL, as shown in the main paper.
Preprocessing A major difference between our base parser and RAT-SQL  is the way of preprocessing. During preprocessing, input questions, column names and table names in schemas are tokenized and lemmatized by Stanza (Qi et al., 2020) which can handle both English and Chinese.
Learning Rates We use the learning rate of α = 5 × 10 −4 for meta-train. For the final update of parameters, we use Adam (Kingma and Ba, 2015) with the learning rate 6×10 −4 . We manually search for the best meta-train learning rates from 1 × 10 −4 to 9 × 10 −4 with the step size 2 × 10 −4 , based on performance on the development set. Other hyperparameters are not tuned. For the learning rate of final update (not α of meta-train), we use the same scheduler as . Specifically, during the first 500 steps, the learning rate linearly increases from 0 to 6 × 10 −4 . Then, it is annealed to 0 with 6 × 10 −4 (1 − step−500 9500 ) −0.5 .
Hardware and Model Size Our non-BERT models are trained using NVIDIA GeForce RTX 2080, which has a memory size of 11GB. The base parser has around 10 million parameters, where around 1.5 million parameters are pre-trained embeddings that are mostly fixed during training. For BERT models, we first find the best hyperparameters using GeForce RTX 2080 with a small batch size; then we train them using V100 to save cost.
Average Runtime The average training time for the non-BERT base parser, DG-MAML and DG-FMAML are 10, 24, 13 hours per run. For BERT models, the numbers are 36, 68, 42 hours per run.

C.1 Loss Curve
In Figure 2, we show the loss curves of the models on the two datasets during training. In comparison, DG-MAML helps to reduce the gap between training and validation loss.

D Classifier for Probing Domain Generalization
The classifier takes the input of a pair of (column, question) and outputs a binary label indicating whether the column is relevant. As explained in the paper, we retrieve the representations of columns and questions from a pre-trained parser. We denote the representation of a column as c ∈ R k , and the representation of a question as q ∈ R n×k where n is the number of words in the question and k is the size of encoding.
For each pair of (c, q), we first align the column c softly with the question q using an attention function, and obtain an aligned representation t for the column. Then we compute a score of relevance based on the aligned representation. Finally, a probability p of relevance is computed through a sigmoid function σ.