Cross-domain Semantic Parsing via Paraphrasing

Existing studies on semantic parsing mainly focus on the in-domain setting. We formulate cross-domain semantic parsing as a domain adaptation problem: train a semantic parser on some source domains and then adapt it to the target domain. Due to the diversity of logical forms in different domains, this problem presents unique and intriguing challenges. By converting logical forms into canonical utterances in natural language, we reduce semantic parsing to paraphrasing, and develop an attentive sequence-to-sequence paraphrase model that is general and flexible to adapt to different domains. We discover two problems, small micro variance and large macro variance, of pre-trained word embeddings that hinder their direct use in neural networks, and propose standardization techniques as a remedy. On the popular Overnight dataset, which contains eight domains, we show that both cross-domain training and standardized pre-trained word embeddings can bring significant improvement.


Introduction
Semantic parsing, which maps natural language utterances into computer-understandable logical forms, has drawn substantial attention recently as a promising direction for developing natural language interfaces to computers. Semantic parsing has been applied in many domains, including querying data/knowledge bases (Woods, 1973;Zelle and Mooney, 1996;Berant et al., 2013), controlling IoT devices (Campagna et al., 2017), and communicating with robots (Chen and Mooney, 2011;Tellex et al., 2011;Bisk et al., 2016).
Despite the wide applications, studies on semantic parsing have mainly focused on the indomain setting, where both training and testing data are drawn from the same domain. How to build semantic parsers that can learn across domains remains an under-addressed problem. In this work, we study cross-domain semantic parsing. We model it as a domain adaptation problem (Daumé III and Marcu, 2006), where we are given some source domains and a target domain, and the core task is to adapt a semantic parser trained on the source domains to the target domain ( Figure 1). The benefits are two-fold: (1) by training on the source domains, the cost of collecting training data for the target domain can be reduced, and (2) the data of source domains may provide information complementary to the data collected for the target domain, leading to better performance on the target domain. This is a very challenging task. Traditional domain adaptation (Daumé III and Marcu, 2006;Blitzer et al., 2006) only concerns natural languages, while semantic parsing concerns both natural and formal languages. Different domains often involve different predicates. In Figure 1, from the source BASKETBALL domain a semantic parser can learn the semantic mapping from natural language to predicates like team and season, but in the target SOCIAL domain it needs to handle predicates like employer instead. Worse still, even for the same predicate, it is legitimate to use arbitrarily different predicate symbols, e.g., other symbols like hired by or even predicate1 can also be used for the employer predicate, reminiscent of the symbol grounding problem (Harnad, 1990). Therefore, directly transferring the mapping from natural language to predicate symbols learned from source domains to the target domain may not be much beneficial. Inspired by the recent success of paraphrasing based semantic parsing (Berant and Liang, 2014;Wang et al., 2015), we propose to use natural language as an intermediate representation for crossdomain semantic parsing. As shown in Figure 1, logical forms are converted into canonical utterances in natural language, and semantic parsing is reduced to paraphrasing. It is the knowledge of paraphrasing, at lexical, syntactic, and semantic levels, that will be transferred across domains.
Still, adapting a paraphrase model to a new domain is a challenging and under-addressed problem. To give some idea of the difficulty, for each of the eight domains in the popular OVERNIGHT (Wang et al., 2015) dataset, 30% to 55% of the words never occur in any of the other domains, a similar problem observed in domain adaptation for machine translation (Daumé III and Jagarlamudi, 2011). The paraphrase model therefore can get little knowledge for a substantial portion of the target domain from the source domains. We introduce pre-trained word embeddings such as WORD2VEC (Mikolov et al., 2013) to combat the vocabulary variety across domains. Based on recent studies on neural network initialization, we conduct a statistical analysis of pre-trained word embeddings and discover two problems that may hinder their direct use in neural networks: small micro variance, which hurts optimization, and large macro variance, which hurts generalization. We propose to standardize pre-trained word embeddings, and show its advantages both analytically and experimentally.
On the OVERNIGHT dataset, we show that crossdomain training under the proposed framework can significantly improve model performance. We also show that, compared with directly using pretrained word embeddings or normalization as in previous work, the proposed standardization technique can lead to about 10% absolute improvement in accuracy.

Problem Definition
Unless otherwise stated, we will use u to denote input utterance, c for canonical utterance, and z for logical form. We denote U as the set of all possible utterances. For a domain, suppose Z is the set of logical forms, a semantic parser is a mapping f : U → Z that maps every input utterance to a logical form (a null logical form can be included in Z to reject out-of-domain utterances).
In cross-domain semantic parsing, we assume there are a set of K source domains It is in principle advantageous to model the source domains separately (Daumé III and Marcu, 2006), which retains the possibility of separating domaingeneral information from domain-specific information, and only transferring the former to the target domain. For simplicity, here we merge the source domains into a single domain Z s with train- The task is to learn a semantic parser f : U → Z t for a target domain Z t , for which we have a set of training examples . Some characteristics can be summarized as follows: • Z t and Z s can be totally disjoint.
• The input utterance distribution of the source and the target domains can be independent and differ remarkably.
In the most general and challenging case, Z t and Z s can be defined using different formal languages. Because of the lack of relevant datasets, here we restrain ourselves to the case where Z t and Z s are defined using the same formal language, e.g., λ-DCS (Liang, 2013) as in the OVERNIGHT dataset.

Framework
Our framework follows the research line of semantic parsing via paraphrasing (Berant and Liang, 2014;Wang et al., 2015). While previous work focuses on the in-domain setting, we discuss its applicability and advantages in the cross-domain setting, and develop techniques to address the emerging challenges in the new setting.
Canonical utterance. We assume a one-to-one mapping g : Z → C, where C ⊂ U is the set of canonical utterances. In other words, every logical form will be converted into a unique canonical utterance deterministically ( Figure 1). Previous work (Wang et al., 2015) has demonstrated how to design such a mapping, where a domaingeneral grammar and a domain-specific lexicon are constructed to automatically convert every logical form to a canonical utterance. In this work, we assume the mapping is given 1 , and focus on the subsequent paraphrasing and domain adaptation problems.
This design choice is worth some discussion. The grammar, or at least the lexicon for mapping predicates to natural language, needs to be provided by domain administrators. This indeed brings an additional cost, but we believe it is reasonable and even necessary for three reasons: (1) Only domain administrators know the predicate semantics the best, so it has to be them to reveal that by grounding the predicates to natural language (the symbol grounding problem (Harnad, 1990)). (2) Otherwise, predicate semantics can only be learned from supervised training data of each domain, bringing a significant cost on data collection. (3) Canonical utterances are understandable by average users, and thus can also be used for training data collection via crowdsourcing (Wang et al., 2015;Su et al., 2016), which can amortize the cost. Take comparatives as an example. In logical forms, comparatives can be legitimately defined using arbitrarily different predicates in different domains, e.g., <, smallerInSize, or even predicates with an ambiguous surface form, like lt. When converting logical form to canonical utterance, however, domain administrators have to choose common natural language expressions like "less than" and "smaller", providing a shared ground for cross-domain semantic parsing.
Paraphrase model. In the previous work based on paraphrasing (Berant and Liang, 2014;Wang et al., 2015), semantic parsers are implemented as log-linear models with hand-engineered domainspecific features (including paraphrase features). Considering the recent success of representation learning for domain adaptation (Glorot et al., 2011;Chen et al., 2012), we propose a paraphrase model based on the sequence-to-sequence (Seq2Seq) model (Sutskever et al., 2014), which can be trained end to end without feature engineering. We show that it outperforms the previous loglinear models by a large margin in the in-domain setting, and can easily adapt to new domains.
Pre-trained word embeddings. An advantage of reducing semantic parsing to paraphrasing is that external language resources become easier to incorporate. Observing the vocabulary variety across domains, we introduce pre-trained word embeddings to facilitate domain adaptation. For the example in Figure 1, the paraphrase model may have learned the mapping from "play for" to "whose team is" in a source domain. By acquiring word similarities ("play"-"work" and "team"-"employer") from pre-trained word embeddings, it can establish the mapping from "work for" to "whose employer is" in the target domain, even without in-domain training data. We analyze statistical characteristics of the pre-trained word embeddings, and propose standardization techniques to remedy some undesired characteristics that may bring a negative effect to neural models.
Domain adaptation protocol. We will use the following protocol: (1) train a paraphrase model using the data of the source domain, (2) use the learned parameters to initialize a model in the target domain, and (3) fine-tune the model using the training data of the target domain.

Prior Work
While most studies on semantic parsing so far have focused on the in-domain setting, there are a number of studies of particular relevance to this work. In the recent efforts of scaling semantic parsing to large knowledge bases like Freebase (Bollacker et al., 2008), researchers have explored several ways to infer the semantics of knowledge base relations unseen in training, which are often based on at least one (often both) of the following assumptions: (1) Distant supervision. Freebase entities can be linked to external text corpora, and serve as anchors for seeking semantics of Freebase relations from text. For example, Cai and Alexander (2013), among others (Berant et al., 2013;Xu et al., 2016), use sentences from Wikipedia that contain any entity pair of a Freebase relation as the support set of the relation.
(2) Self-explaining predicate symbols. Most Freebase relations are described using a carefully chosen symbol (surface form), e.g., place of birth, which provides strong cues for their semantics. For example, Yih et al. (2015) directly compute the similarity of input utterance and the surface form of Freebase relations via a convolutional neural network. Kwiatkowski et al. (2013) also extract lexical features from input utterance and the surface form of entities and relations. They have actually evaluated their model on Freebase subdomains not covered in training, and have shown impressive results. However, in the more general setting of cross-domain semantic parsing, we may have neither of these luxuries. Distant supervision may not be available (e.g., IoT devices involving no entities but actions), and predicate symbols may not provide enough cues (e.g., predicate1). In this case, seeking additional inputs from domain administrators is probably necessary.
In parallel of this work, Herzig and Berant (2017) have explored another direction of semantic parsing with multiple domains, where they use all the domains to train a single semantic parser, and attach a domain-specific encoding to the training data of each domain to help the semantic parser differentiate between domains. We pursue a different direction: we train a semantic parser on some source domains and adapt it to the target domain. Another difference is that their work directly maps utterances to logical forms, while ours is based on paraphrasing.
Cross-domain semantic parsing can be seen as a way to reduce the cost of training data collection, which resonates with the recent trend in semantic parsing. Berant et al. (2013) propose to learn from utterance-denotation pairs instead of utterance-logical form pairs, while Wang et al. (2015) and Su et al. (2016) manage to employ crowd workers with no linguistic expertise for data collection. Jia and Liang (2016) propose an interesting form of data augmentation. They learn a grammar from existing training data, and generate new examples from the grammar by recombining segments from different examples. We use natural language as an intermediate representation to transfer knowledge across domains, and assume the mapping from the intermediate representation (canonical utterance) to logical form can be done deterministically. Several other intermediate representations have also been used, such as combinatory categorial grammar (Kwiatkowski et al., 2013;Reddy et al., 2014), dependency tree (Reddy et al., , 2017, and semantic role structure (Goldwasser and Roth, 2013). But their main aim is to better represent input utterances with a richer structure. A separate ontology matching step is needed to map the intermediate representation to logical form, which requires domain-dependent training.
A number of other related studies have also used paraphrasing. For example, Fader et al. (2013) leverage question paraphrases to for question answering, while Narayan et al. (2016) generate paraphrases as a way of data augmentation.
Cross-domain semantic parsing can greatly benefit from the rich literature of domain adaptation and transfer learning (Daumé III and Marcu, 2006;Blitzer et al., 2006;Pan and Yang, 2010;Glorot et al., 2011). For example, Chelba and Acero (2004) use parameters trained in the source domain as prior to regularize parameters in the target domain. The feature augmentation technique from Daumé III (2009) can be very helpful when there are multiple source domains. We expect to see many of these ideas to be applied in the future.

Paraphrase Model
In this section we propose a paraphrase model based on the Seq2Seq model (Sutskever et al., 2014) with soft attention. Similar models have been used in semantic parsing (Jia and Liang, 2016;Dong and Lapata, 2016) but for directly mapping utterances to logical forms. We demon-strate that it can also be used as a paraphrase model for semantic parsing. Several other neural models have been proposed for paraphrasing (Socher et al., 2011;Hu et al., 2014;Yin and Schütze, 2015), but it is not the focus of this work to compare all the alternatives.
For an input utterance u = (u 1 , u 2 , . . . , u m ) and an output canonical utterance c = (c 1 , c 2 , . . . , c n ), the model estimates the conditional probability p(c|u) = n j=1 p(c j |u, c 1:j−1 ). The tokens are first converted into vectors via a word embedding layer φ. The initialization of the word embedding layer is critical for domain adaptation, which we will further discuss in Section 4.
The encoder, which is implemented as a bi-directional recurrent neural network (RNN), first encodes u into a sequence of state vectors (h 1 , h 2 , . . . , h m ). The state vectors of the forward RNN and the backward RNN are respectively computed as: where gated recurrent unit (GRU) as defined in (Cho et al., 2014) is used as the recurrence. We then concatenate the forward and backward state vectors, . . , m. We use an attentive RNN as the decoder, which will generate the output tokens one at a time. We denote the state vectors of the decoder RNN as (d 1 , d 2 , . . . , d n ). The attention takes a form similar to (Vinyals et al., 2015). For the decoding step j, the decoder is defined as follows: where W 0 , W 1 , W 2 , v and U are model parameters. The decoder first calculates normalized attention weights α ji over encoder states, and get a summary state h j . The summary state is then used to calculate the next decoder state d j+1 and the output probability distribution p(c j |u, c 1:j−1 ).

Training. Given a set of training examples
, we minimize the cross-entropy loss − 1 N N i=1 log p(c i |u i ), which maximizes the log probability of the correct canonical utterances. We apply dropout (Hinton et al., 2012) on both input and output of the GRU cells to prevent overfitting.
Testing. Given a domain {Z, C}, there are two ways to use a trained model. One is to use it to generate the most likely output utterance u given an input utterance u (Sutskever et al., 2014), In this case u can be any utterance permissable by the output vocabulary, and may not necessarily be a legitimate canonical utterance in C. This is more suitable for large domains with a lot of logical forms, like Freebase. An alternative way is to use the model to rank the legitimate canonical utterances (Kannan et al., 2016): which is more suitable for small domains having a limited number of logical forms, like the ones in the OVERNIGHT dataset. We will adopt the second strategy. It is also very challenging; random guessing leads to almost no success. It is also possible to first find a smaller set of candidates to rank via beam search (Berant et al., 2013;Wang et al., 2015).

Pre-trained Word Embedding for Domain Adaptation
Pre-trained word embeddings like WORD2VEC have a great potential to combat the vocabulary variety across domains. For example, we can use pre-trained WORD2VEC vectors to initialize the word embedding layer of the source domain, with the hope that the other parameters in the model will co-adapt with the word vectors during training in the source domain, and generalize better to the out-of-vocabulary words (but covered by WORD2VEC) in the target domain. However, deep neural networks are very sensitive to initialization (Erhan et al., 2010), and a statistical analysis of the pre-trained WORD2VEC vectors reveals some characteristics that may not be desired for initializing deep neural networks. In this section we present the analysis and propose a standardization technique to remedy the undesired characteristics.  Analysis. Our analysis will be based on the 300dimensional WORD2VEC vectors trained on the 100B-word Google News corpus 2 . It contains 3 million words, leading to a 3M-by-300 word embedding matrix. The "rule of thumb" to randomly initialize word embedding in neural networks is to sample from a uniform or Gaussian distribution with unit variance, which works well for a wide range of neural network models in general. We therefore use it as a reference to compare different word embedding initialization strategies. Given a word embedding matrix, we compute the L2 norm of each row and report the mean and the standard deviation. Similarly, we also report the variance of each row (denoted as micro variance), which indicates how far the numbers in the row spread out, and pair-wise cosine similarity, which indicates the word similarity captured by WORD2VEC. The statistics of the word embedding matrix with different initialization strategies are shown in Table 1. Compared with random initialization, two characteristics of the WORD2VEC vectors stand out: (1) Small micro variance. Both the L2 norm and the micro variance of the WORD2VEC vectors are much smaller. (2) Large macro variance. The variance of different WORD2VEC vectors, reflected by the standard deviation of L2 norm, is much larger (e.g., the maximum and the minimum L2 norm are 21.1 and 0.015, respectively). Small micro variance can make the variance of neuron activations starts off too small 3 , implying a poor starting point in the parameter space. On the other hand, because of the magnitude difference, large macro variance may make a model hard to gener-alize to words unseen in training.
Standardization. Based on the above analysis, we propose to do unit variance standardization (standardization for short) on pre-trained word embeddings. There are two possible ways, perexample standardization, which standardizes each row of the embedding matrix to unit variance by simply dividing by the standard deviation of the row, and per-feature standardization, which standardizes each column instead. We do not make the rows or columns zero mean. Per-example standardization enjoys the goodness of both random initialization and pre-trained word embeddings: it fixes the small micro variance problem as well as the large macro variance problem of pre-trained word embeddings, while still preserving cosine similarity, i.e., word similarity. Perfeature standardization does not preserve cosine similarity, nor does it fix the large macro variance problem. However, it enjoys the benefit of global statistics, in contrast to the local statistics of individual word vectors used in per-example standardization. Therefore, in problems where the testing and training vocabularies are similar, per-feature standardization may be more advantageous. Both standardizations lose vector magnitude information. Levy et al. (2015) have suggested per-example normalization 4 of pre-trained word embeddings for lexical tasks like word similarity and analogy, which do no involve deep neural networks. Making the word vectors unit length alleviates the large macro variance problem, but the small micro variance problem remains (Table 1).
Discussion. This is indeed a pretty simple trick, and per-feature standardization (with zero mean) is also a standard data preprocessing method. However, it is not self-evident that this kind of standardization shall be applied on pre-trained word embeddings before using them in deep neural networks, especially with the obvious downside of rendering the word embedding algorithm's loss function sub-optimal.
We expect this to be less of a issue for largescale problems with a large vocabulary and abundant training examples. For example, Vinyals et al. (2015) have found that directly using the WORD2VEC vectors for initialization can bring a consistent, though small, improvement in neural constituency parsing. However, for smaller-scale problems (e.g., an application domain of semantic parsing can have a vocabulary size of only a few hundreds), this issue becomes more critical. Initialized with the raw pre-trained vectors, a model may quickly fall into a poor local optimum and may not have enough signal to escape. Because of the large macro variance problem, standardization can be critical for domain adaptation, which needs to generalize to many words unseen in training.
The proposed standardization technique appears in a similar spirit to batch normalization (Ioffe and Szegedy, 2015). We notice two computational differences, that ours is applied on the inputs while batch normalization is applied on internal neuron activations, and that ours standardizes the whole word embedding matrix beforehand while batch normalization standardizes each mini-batch on the fly. In terms of motivation, the proposed technique aims to remedy some undesired characteristics of pre-trained word embeddings, and batch normalization aims to reduce the internal covariate shift. It is of interest to study the combination of the two in future work.

Data Analysis
The OVERNIGHT dataset (Wang et al., 2015) contains 8 different domains. Each domain is based on a separate knowledge base, with logical forms written in λ-DCS (Liang, 2013). Logical forms are converted into canonical utterances via a simple grammar, and the input utterances are collected by asking crowd workers to paraphrase the canonical utterances. Different domains are designed to stress different types of linguistic phenomena. For example, the CALENDAR domain requires a semantic parser to handle temporal language like "meetings that start after 10 am", while the BLOCKS domain features spatial language like "which block is above block 1".
Vocabularies vary remarkably across domains (Table 2). For each domain, only 45% to 70% of the words are covered by any of the other 7 domains. A model has to learn the out-of-vocabulary words from scratch using in-domain training data. The pre-trained WORD2VEC embedding covers most of the words of each domain, and thus can connect the domains to facilitate domain adaptation.
Words that are still missing are mainly stop words and typos, e.g., "ealiest".

Experiment Setup
We compare our model with all the previous methods evaluated on the OVERNIGHT dataset. Wang et al. (2015) use a log-linear model with a rich set of features, including paraphrase features derived from PPDB (Ganitkevitch et al., 2013), to rank logical forms. Xiao et al. (2016) use a multi-layer perceptron to encode the unigrams and bigrams of the input utterance, and then use a RNN to predict the derivation sequence of a logical form under a grammar. Similar to ours, Jia and Liang (2016) also use a Seq2Seq model with bi-directional RNN encoder and attentive decoder, but it is used to predict linearized logical forms. They also propose a data augmentation technique, which further improves the average accuracy to 77.5%. But it is orthogonal to this work and can be incorporated in any model including ours, therefore not included.
The above methods are all based on the indomain setting, where a separate parser is trained for each domain. In parallel of this work, Herzig and Berant (2017) have explored another direction of cross-domain training: they use all of the domains to train a single parser, with a special domain encoding to help differentiate between domains. We instead model it as a domain adaptation problem, where training on the source and the target domains are separate. Their model is the same as Jia and Liang (2016). It is the current best-performing method on the OVERNIGHT dataset.
We use the standard 80%/20% split of training and testing, and randomly hold out 20% of training for validation. In cross-domain experiments, for each target domain, all the other domains are combined as the source domain. Hyper-parameters are selected based on the validation set. State size of both the encoder and the decoder are set to 100, and word embedding size is set to 300. Input and output dropout rate of the GRU cells are 0.7 and 0.5, respectively, and mini-batch size is 512. We use Adam with the default parameters suggested in the paper for optimization. We use gradient clipping with a cap for global norm at 5.0 to alleviate the exploding gradients problem of recurrent neural networks. Early stopping based on the validation set is used to decide when to stop training. The selected model is retrained using the whole training set (training + validation). The    (Abadi et al., 2016), and the code can be found at https://github.com/ ysu1989/CrossSemparse.

Comparison with Previous Methods
The main experiment results are shown in Table 3. Our base model (Random + I) achieves an accuracy comparable to the previous best in-domain model (Jia and Liang, 2016). With our main novelties, cross-domain training and word embedding standardization, our full model is able to outperform the previous best model, and achieve the best accuracy on 6 out of the 8 domains. Next we examine the novelties separately.

Word Embedding Initialization
The in-domain results clearly show the sensitivity of model performance to word embedding initialization. Directly using the raw WORD2VEC vectors or with per-example normalization, the performance is significantly worse than random initialization (6.2% and 7.3%, respectively). Based on the previous analysis, however, one should not be too surprised. The small micro variance problem hurts optimization. In sharp contrast, both of the proposed standardization techniques lead to better in-domain performance than random initialization (1.4% and 2.5%, respectively), setting a new best in-domain accuracy (78.2%) on OVERNIGHT. The results show that the pre-trained WORD2VEC vectors can indeed provide useful information, but only when they are properly standardized.

Cross-domain Training
A consistent improvement from cross-domain training is observed across all word embedding initialization strategies. Even for raw WORD2VEC embedding or per-example normalization, crossdomain training helps the model escape the poor initialization, though still inferior to the alternative initializations. The best results are again obtained with standardization, with per-example standardization bringing a slightly larger improvement than per-feature standardization. We observe that the improvement from cross-domain training is correlated with the abundance of the in-domain training data of the target domain. To further examine this observation, we use the ratio between the number of examples (N ) and the vocabulary size (|V|) to indicate the data abundance of a domain (the higher, the more abundant), and compute the Pearson correlation coefficient between data abundance and accuracy improvement from cross-domain training (X−I). The results in Ta-  ble 4 show a consistent, moderate to strong negative correlation between the two variables. In other words, cross-domain training is more beneficial when in-domain training data is less abundant, which is reasonable because in that case the model can learn more from the source domain data that is missing in the training data of the target domain.

Using Downsampled Training Data
Compared with the vocabulary size and the number of logical forms, the in-domain training data in the OVERNIGHT dataset is indeed abundant. In cross-domain semantic parsing, we are more interested in the scenario where there is insufficient training data for the target domain. To emulate this scenario, we downsample the in-domain training data of each target domain, but still use all training data from the source domain (thus N t N s ). The results are shown in Figure 2. The gain of cross-domain training is most significant when indomain training data is scarce. As we collect more in-domain training data, the gain becomes smaller, which is expected. These results reinforce those from Table 4. It is worth noting that the effect of downsampling varies across domains. For domains with quite abundant training data like SO-CIAL, using only 30% of the in-domain training data, the model can achieve an accuracy almost as good as when using all the data.

Discussion
Scalability, including vertical scalability, i.e., how to scale up to handle more complex inputs and logical constructs, and horizontal scalability, i.e., how to scale out to handle more domains, is one of the most critical challenges semantic parsing is facing today. In this work, we took an early step towards horizontal scalability, and proposed a paraphrasing based framework for cross-domain semantic parsing. With a sequence-to-sequence paraphrase model, we showed that cross-domain training of semantic parsing can be quite effective under a domain adaptation setting. We also studied how to properly standardize pre-trained word embeddings in neural networks, especially for domain adaptation.
This work opens up a number of future directions. As discussed in Section 2.3, many conventional domain adaptation and representation learning ideas can find application in cross-domain semantic parsing. In addition to pre-trained word embeddings, other language resources like paraphrase corpora (Ganitkevitch et al., 2013) can be incorporated into the paraphrase model to further facilitate domain adaptation. In this work we require a full mapping from logical form to canonical utterance, which could be costly for large domains. It is of practical interest to study the case where only a lexicon for mapping schema items to natural language is available. We have restrained ourselves to the case where domains are defined using the same formal language, and we look forward to evaluating the framework on domains of different formal languages when such datasets with canonical utterances become available.