StructVAE: Tree-structured Latent Variable Models for Semi-supervised Semantic Parsing

Semantic parsing is the task of transducing natural language (NL) utterances into formal meaning representations (MRs), commonly represented as tree structures. Annotating NL utterances with their corresponding MRs is expensive and time-consuming, and thus the limited availability of labeled data often becomes the bottleneck of data-driven, supervised models. We introduce StructVAE, a variational auto-encoding model for semi-supervised semantic parsing, which learns both from limited amounts of parallel data, and readily-available unlabeled NL utterances. StructVAE models latent MRs not observed in the unlabeled data as tree-structured latent variables. Experiments on semantic parsing on the ATIS domain and Python code generation show that with extra unlabeled data, StructVAE outperforms strong supervised models.


Introduction
Semantic parsing tackles the task of mapping natural language (NL) utterances into structured formal meaning representations (MRs). This includes parsing to general-purpose logical forms such as λ-calculus Collins, 2005, 2007) and the abstract meaning representation (AMR, Banarescu et al. (2013); Misra and Artzi (2016)), as well as parsing to computerexecutable programs to solve problems such as question answering (Berant et al., 2013;Yih et al., 2015;Liang et al., 2017), or generation of domainspecific (e.g., SQL) or general purpose programming languages (e.g., Python) (Quirk et al., 2015;Yin and Neubig, 2017;Rabinovich et al., 2017). 1 Code available at http://pcyin.me/struct vae While these models have a long history (Zelle and Mooney, 1996;Tang and Mooney, 2001), recent advances are largely attributed to the success of neural network models (Xiao et al., 2016;Ling et al., 2016;Dong and Lapata, 2016;Iyer et al., 2017;Zhong et al., 2017). However, these models are also extremely data hungry: optimization of such models requires large amounts of training data of parallel NL utterances and manually annotated MRs, the creation of which can be expensive, cumbersome, and time-consuming. Therefore, the limited availability of parallel data has become the bottleneck of existing, purely supervised-based models. These data requirements can be alleviated with weakly-supervised learning, where the denotations (e.g., answers in question answering) of MRs (e.g., logical form queries) are used as indirect supervision (Clarke et al. (2010); Liang et al. (2011); Berant et al. (2013), inter alia), or dataaugmentation techniques that automatically generate pseudo-parallel corpora using hand-crafted or induced grammars (Jia and Liang, 2016;Wang et al., 2015).
In this work, we focus on semi-supervised learning, aiming to learn from both limited amounts of parallel NL-MR corpora, and unlabeled but readily-available NL utterances. We draw inspiration from recent success in applying variational auto-encoding (VAE) models in semisupervised sequence-to-sequence learning (Miao and Blunsom, 2016;Kociský et al., 2016), and propose STRUCTVAE -a principled deep generative approach for semi-supervised learning with tree-structured latent variables (Fig. 1). STRUCT-VAE is based on a generative story where the surface NL utterances are generated from treestructured latent MRs following the standard VAE architecture: (1) an off-the-shelf semantic parser functions as the inference model, parsing an observed NL utterance into latent meaning representations ( § 3.2); (2) a reconstruction model decodes the latent MR into the original observed utterance ( § 3.1). This formulation enables our model to perform both standard supervised learning by optimizing the inference model (i.e., the parser) using parallel corpora, and unsupervised learning by maximizing the variational lower bound of the likelihood of the unlabeled utterances ( § 3.3).
In addition to these contributions to semisupervised semantic parsing, STRUCTVAE contributes to generative model research as a whole, providing a recipe for training VAEs with structured latent variables. Such a structural latent space is contrast to existing VAE research using flat representations, such as continuous distributed representations (Kingma and Welling, 2013), discrete symbols (Miao and Blunsom, 2016), or hybrids of the two (Zhou and Neubig, 2017).
We apply STRUCTVAE to semantic parsing on the ATIS domain and Python code generation. As an auxiliary contribution, we implement a transition-based semantic parser, which uses Abstract Syntax Trees (ASTs, § 3.2) as intermediate MRs and achieves strong results on the two tasks. We then apply this parser as the inference model for semi-supervised learning, and show that with extra unlabeled data, STRUCTVAE outperforms its supervised counterpart. We also demonstrate that STRUCTVAE is compatible with different structured latent representations, applying it to a simple sequence-to-sequence parser which uses λ-calculus logical forms as MRs.

Semi-supervised Semantic Parsing
In this section we introduce the objectives for semi-supervised semantic parsing, and present high-level intuition in applying VAEs for this task.

Supervised and Semi-supervised Training
Formally, semantic parsing is the task of mapping utterance x to a meaning representation z. As noted above, there are many varieties of MRs that can be represented as either graph structures (e.g., AMR) or tree structures (e.g., λ-calculus and ASTs for programming languages). In this work we specifically focus on tree-structured MRs (see Fig. 2 for a running example Python AST), although application of a similar framework to graph-structured representations is also feasible.
Traditionally, purely supervised semantic parsers train a probabilistic model p φ (z|x) using parallel data L of NL utterances and annotated MRs (i.e., L = { x, z }). As noted in the introduction, one major bottleneck in this approach is the lack of such parallel data. Hence, we turn to semi-supervised learning, where the model additionally has access to a relatively large amount of unlabeled NL utterances U = {x}. Semi-supervised learning then aims to maximize the log-likelihood of examples in both L and U: The joint objective consists of two terms: (1) a supervised objective J s that maximizes the conditional likelihood of annotated MRs, as in standard supervised training of semantic parsers; and (2) a unsupervised objective J u , which maximizes the marginal likelihood p(x) of unlabeled NL utterances U, controlled by a tuning parameter α. Intuitively, if the modeling of p φ (z|x) and p(x) is coupled (e.g., they share parameters), then optimizing the marginal likelihood p(x) using the unsupervised objective J u would help the learning of the semantic parser p φ (z|x) (Zhu, 2005). STRUCTVAE uses the variational auto-encoding framework to jointly optimize p φ (z|x) and p(x), as outlined in § 2.2 and detailed in § 3.

VAEs for Semi-supervised Learning
From Eq. (1), our semi-supervised model must be able to calculate the probability p(x) of unlabeled NL utterances. To model p(x), we use VAEs, which provide a principled framework for generative models using neural networks (Kingma and Welling, 2013). As shown in Fig. 1, VAEs define a generative story (bold arrows in Fig. 1, explained in § 3.1) to model p(x), where a latent MR z is sampled from a prior, and then passed to the reconstruction model to decode into the surface utterance x. There is also an inference model q φ (z|x) that allows us to infer the most probable latent MR z given the input x (dashed arrows in Fig. 1, explained in § 3.2). In our case, the inference process is equivalent to the task of semantic parsing if we set q φ (·) p φ (·). VAEs also provide a framework to compute an approximation of p(x) using the inference and reconstruction models, allowing us to effectively optimize the unsupervised and supervised objectives in Eq. (1) in a joint fashion (Kingma et al. (2014), explained in § 3.3).

Generative Story
STRUCTVAE follows the standard VAE architecture, and defines a generative story that explains how an NL utterance is generated: a latent meaning representation z is sampled from a prior distribution p(z) over MRs, which encodes the latent semantics of the utterance. A reconstruction model p θ (x|z) then decodes the sampled MR z into the observed NL utterance x. Both the prior p(z) and the reconstruction model p(x|z) takes tree-structured MRs as inputs.
To model such inputs with rich internal structures, we follow Konstas et al. (2017), and model the distribution over a sequential surface representation of z, z s instead. Specifically, we have p(z) p(z s ) and p θ (x|z) p θ (x|z s ) 2 . For code generation, z s is simply the surface source code of the AST z. For semantic parsing, z s is the linearized s-expression of the logical form. Linearization allows us to use standard sequence-to-sequence networks to model p(z) and p θ (x|z). As we will explain in § 4.3, we find these two components perform well with linearization.
Specifically, the prior is parameterized by a Long Short-Term Memory (LSTM) language model over z s . The reconstruction model is an attentional sequence-to-sequence network (Luong et al., 2015), augmented with a copying mechanism (Gu et al., 2016), allowing an out-ofvocabulary (OOV) entity in z s to be copied to x (e.g., the variable name my list in Fig. 1 and its AST in Fig. 2). We refer readers to Appendix B for details of the neural network architecture.

Inference Model
STRUCTVAE models the semantic parser p φ (z|x) as the inference model q φ (z|x) in VAE ( § 2.2), which maps NL utterances x into tree-structured meaning representations z. q φ (z|x) can be any trainable semantic parser, with the corresponding MRs forming the structured latent semantic space. In this work, we primarily use a semantic parser based on the Abstract Syntax Description Language (ASDL) framework (Wang et al., 1997) as the inference model. The parser encodes x into ASTs (Fig. 2). ASTs are the native meaning representation scheme of source code in modern programming languages, and can also be adapted to represent other semantic structures, like λ-calculus logical forms (see § 4.2 for details). We remark that STRUCTVAE works with other semantic parsers with different meaning representations as well (e.g., using λ-calculus logical forms for semantic parsing on ATIS, explained in § 4.3).
Our inference model is a transition-based parser inspired by recent work in neural semantic parsing and code generation. The transition system is an adaptation of Yin and Neubig (2017) (hereafter YN17), which decomposes the generation process of an AST into sequential applications of treeconstruction actions following the ASDL grammar, thus ensuring the syntactic well-formedness of generated ASTs. Different from YN17, where ASTs are represented as a Context Free Grammar learned from a parsed corpus, we follow Rabinovich et al. (2017) and use ASTs defined under the ASDL formalism ( § 3.2.1).

Generating ASTs with ASDL Grammar
First, we present a brief introduction to ASDL. An AST can be generated by applying typed constructors in an ASDL grammar, such as those in Fig. 3 for the Python ASDL grammar. Each constructor specifies a language construct, and is assigned to a particular composite type. For example, the constructor Call has type expr (expression), and it denotes function calls. Constructors are associated with multiple fields. For instance, the Call constructor and has three fields: func, args and keywords. Like constructors, fields are also strongly typed. For example, the func field of Call has expr type. Fields with composite types are instantiated by constructors of the same type, while fields with primitive types store values (e.g., identifier names or string literals). Each field also has  identifier arg GENTOKEN[reverse] t 10 expr value Name(identifier id) t 11 identifier id GENTOKEN[T rue] t 12 keyword* keywords REDUCE (close the frontier field) Figure 2: Left An example ASDL AST with its surface source code. Field names are labeled on upper arcs. Blue squares denote fields with sequential cardinality. Grey nodes denote primitive identifier fields, with annotated values. Fields are labeled with time steps at which they are generated. Right Action sequences used to construct the example AST. Frontier fields are denoted by their signature (type name). Each constructor in the Action column refers to an APPLYCONSTR action.  Each node in an AST corresponds to a typed field in a constructor (except for the root node). Depending on the cardinality of the field, an AST node can be instantiated with one or multiple constructors. For instance, the func field in the example AST has single cardinality, and is instantiated with a Name constructor; while the args field with sequential cardinality could have multiple constructors (only one shown in this example).
Our parser employs a transition system to generate an AST using three types of actions. Fig. 2 (Right) lists the sequence of actions used to generate the example AST. The generation process starts from an initial derivation with only a root node of type stmt (statement), and proceeds according to the top-down, left-to-right traversal of the AST. At each time step, the parser applies an action to the frontier field of the derivation: APPLYCONSTR[c] actions apply a constructor c to the frontier composite field, expanding the derivation using the fields of c. For fields with single or optional cardinality, an APPLYCONSTR action instantiates the empty frontier field using the constructor, while for fields with sequential cardinality, it appends the constructor to the frontier field. For example, at t 2 the Call constructor is applied to the value field of Expr, and the derivation is expanded using its three child fields.
REDUCE actions complete generation of a field with optional or multiple cardinalities. For instance, the args field is instantiated by Name at t 5 , and then closed by a REDUCE action at t 7 .
GENTOKEN [v] actions populate an empty primitive frontier field with token v. A primitive field whose value is a single token (e.g., identifier fields) can be populated with a single GEN-TOKEN action. Fields of string type can be instantiated using multiple such actions, with a final GENTOKEN[</f>] action to terminate the generation of field values.

Modeling q φ (z|x)
The probability of generating an AST z is naturally decomposed into the probabilities of the actions {a t } used to construct z: Following YN17, we parameterize q φ (z|x) using a sequence-to-sequence network with auxiliary recurrent connections following the topology of the AST. Interested readers are referred to Appendix B and Yin and Neubig (2017) for details of the neural network architecture.

Semi-supervised Learning
In this section we explain how to optimize the semi-supervised learning objective Eq. (1) in STRUCTVAE.
Supervised Learning For the supervised learning objective, we modify J s , and use the labeled data to optimize both the inference model (the se-mantic parser) and the reconstruction model: Unsupervised Learning To optimize the unsupervised learning objective J u in Eq. (1), we maximize the variational lower-bound of log p(x): where KL[q φ ||p] is the Kullback-Leibler (KL) divergence. Following common practice in optimizing VAEs, we introduce λ as a tuning parameter of the KL divergence to control the impact of the prior (Miao and Blunsom, 2016;Bowman et al., 2016).
To optimize the parameters of our model in the face of non-differentiable discrete latent variables, we follow Miao and Blunsom (2016), and approximate ∂L ∂φ using the score function estimator (a.k.a. REINFORCE, Williams (1992) where we approximate the gradient using a set of samples S(x) drawn from q φ (·|x). To ensure the quality of sampled latent MRs, we follow Guu et al. (2017) and use beam search. The term l (x, z) is defined as the learning signal (Miao and Blunsom, 2016). The learning signal weights the gradient for each latent sample z. In REIN-FORCE, to cope with the high variance of the learning signal, it is common to use a baseline b(x) to stabilize learning, and re-define the learning signal as Specifically, in STRUCTVAE, we define where log p(x) is a pre-trained LSTM language model. This is motivated by the empirical observation that log p(x) correlates well with the reconstruction score log p θ (x|z), hence with l (x, z).
Finally, for the reconstruction model, its gradi-ent can be easily computed: ∂θ .
Discussion Perhaps the most intriguing question here is why semi-supervised learning could improve semantic parsing performance. While the underlying theoretical exposition still remains an active research problem (Singh et al., 2008), in this paper we try to empirically test some likely hypotheses. In Eq. (4), the gradient received by the inference model from each latent sample z is weighed by the learning signal l(x, z). l(x, z) can be viewed as the reward function in REINFORCE learning. It can also be viewed as weights associated with pseudo-training examples { x, z : z ∈ S(x)} sampled from the inference model. Intuitively, a sample z with higher rewards should: (1) have z adequately encode the input, leading to high reconstruction score log p θ (x|z); and (2) have z be succinct and natural, yielding high prior probability. Let z * denote the gold-standard MR of x. Consider the ideal case where z * ∈ S(x) and l(x, z * ) is positive, while l(x, z ) is negative for other imperfect samples z ∈ S(x), z = z * . In this ideal case, x, z * would serve as a positive training example and other samples x, z would be treated as negative examples. Therefore, the inference model would receive informative gradient updates, and learn to discriminate between gold and imperfect MRs. This intuition is similar in spirit to recent efforts in interpreting gradient update rules in reinforcement learning (Guu et al., 2017). We will present more empirical statistics and observations in § 4.3.

Datasets
In our semi-supervised semantic parsing experiments, it is of interest how STRUCTVAE could further improve upon a supervised parser with extra unlabeled data. We evaluate on two datasets: Semantic Parsing We use the ATIS dataset, a collection of 5,410 telephone inquiries of flight booking (e.g., "Show me flights from ci0 to ci1").
Code Generation The DJANGO dataset (Oda et al., 2015) contains 18,805 lines of Python source code extracted from the Django web framework. Each line of code is annotated with an NL utterance. Source code in the DJANGO dataset exhibits a wide variety of real-world use cases of Python, including IO operation, data structure manipulation, class/function definition, etc. We use the pre-processed version released by Yin and Neubig (2017) and use the astor package to convert ASDL ASTs into Python source code.

Setup
Labeled and Unlabeled Data STRUCTVAE requires access to extra unlabeled NL utterances for semi-supervised learning. However, the datasets we use do not accompany with such data. We therefore simulate the semi-supervised learning scenario by randomly sub-sampling K examples from the training split of each dataset as the labeled set L. To make the most use of the NL utterances in the dataset, we construct the unlabeled set U using all NL utterances in the training set 3,4 .
Training Procedure Optimizing the unsupervised learning objective Eq. (3) requires sampling structured MRs from the inference model q φ (z|x).
Due to the complexity of the semantic parsing problem, we cannot expect any valid samples from randomly initialized q φ (z|x). We therefore pre-train the inference and reconstruction models using the supervised objective Eq. (2) until convergence, and then optimize using the semisupervised learning objective Eq. (1). Throughout all experiments we set α (Eq. (1)) and λ (Eq. (3)) to 0.1. The sample size |S(x)| is 5. We observe that the variance of the learning signal could still be high when low-quality samples are drawn from the inference model q φ (z|x). We therefore clip  all learning signals lower than k = −20.0. Earlystopping is used to avoid over-fitting. We also pretrain the prior p(z) ( § 3.3) and the baseline function Eq. (6). Readers are referred to Appendix D for more detail of the configurations.
Metric As standard in semantic parsing research, we evaluate by exact-match accuracy.

Main Results
Tab. 1 and Tab. 2 list the results on ATIS and DJANGO, resp, with varying amounts of labeled data L. We also present results of training the transition-based parser using only the supervised objective (SUP., Eq. (2)). We also compare STRUCTVAE with self-training (SELFTRAIN), a semi-supervised learning baseline which uses the supervised parser to predict MRs for unlabeled utterances in U − L, and adds the predicted examples to the training set to fine-tune the supervised model. Results for STRUCTVAE are averaged over four runs to account for the additional fluctuation caused by REINFORCE training. our supervised baseline), we compare the supervised version of our parser with existing parsing models. On ATIS, our supervised parser trained on the full data is competitive with existing neural network based models, surpassing the SEQ2TREE model, and on par with the Abstract Syntax Network (ASN) without using extra supervision. On DJANGO, our model significantly outperforms the YN17 system, probably because the transition system used by our parser is defined natively to construct ASDL ASTs, reducing the number of actions for generating each example. On DJANGO, the average number of actions is 14.3, compared with 20.3 reported in YN17.

Supervised System Comparison
Semi-supervised Learning Next, we discuss our main comparison between STRUCTVAE with the supervised version of the parser (recall that the supervised parser is used as the inference model in STRUCTVAE, § 3.2). First, comparing our proposed STRUCTVAE with the supervised parser when there are extra unlabeled data (i.e., |L| < 4, 434 for ATIS and |L| < 16, 000 for DJANGO), semi-supervised learning with STRUCTVAE consistently achieves better performance. Notably, on DJANGO, our model registers results as competitive as previous state-of-the-art method (YN17) using only half the training data (71.5 when |L| = 8000 v.s. 71.6 for YN17). This demonstrates that STRUCTVAE is capable of learning from unlabeled NL utterances by inferring high quality, structurally rich latent meaning representations, further improving the performance of its supervised counterpart that is already competitive. Second, comparing STRUCTVAE with self-training, we find STRUCTVAE outperforms SELFTRAIN in eight out of ten settings, while SELFTRAIN under-performs the supervised parser in four out of ten settings. This shows self-training does not necessarily yield stable gains while STRUCTVAE does. Intuitively, STRUCTVAE would perform better since it benefits from the additional signal of the quality of MRs from the reconstruction model ( § 3.3), for which we present more analysis in our next set of experiments.
For the sake of completeness, we also report the results of STRUCTVAE when L is the full training set. Note that in this scenario there is no extra unlabeled data disjoint with the labeled set, and not surprisingly, STRUCTVAE does not outperform the supervised parser. In addition to the supervised objective Eq. (2) used by the supervised parser, STRUCTVAE has the extra unsupervised objective Eq. (3), which uses sampled (probably incorrect) MRs to update the model. When there is no extra unlabeled data, those sampled (incorrect) MRs add noise to the optimization process, causing STRUCTVAE to under-perform.

Study of Learning Signals
As discussed in § 3.3, in semi-supervised learning, the gradient received by the inference model from each sampled latent MR is weighted by the learning signal. Empirically, we would expect that on average, the learning signals of gold-standard samples z * , l(x, z * ), are positive, larger than those of other (imperfect) samples z , l(x, z ). We therefore study the statistics of l(x, z * ) and l(x, z ) for all utterances x ∈ U − L, i.e., the set of utterances which are not included in the labeled set. 5 The statistics are obtained by performing inference using trained models. Figures 4a and 4b depict the histograms of learning signals on DJANGO and ATIS, resp. We observe that the learning signals for gold samples concentrate on positive intervals. We also show the mean and variance of the learning signals. On average, we have l(x, z * ) being positive and l(x, z) negative. Also note that the distribution of l(x, z * ) has smaller variance and is more concentrated. Therefore the inference model receives informative gradient updates to discriminate between gold and imperfect NL join p and cmd into a file path, substitute it for f z s 1 f = os.path.join(p, cmd) log q(z|x) = −1.00 log p(x|z) = −2.00 log p(z) = −24.33 l(x, z) = 9.14 z s 2 p = path.join(p, cmd) log q(z|x) = −8.12 log p(x|z) = −20.96 log p(z) = −27.89 l(x, z) = −9.47 NL append i-th element of existing to child loggers z s  samples. Next, we plot the distribution of the rank of l(x, z * ), among the learning signals of all samples of x, {l(x, z i ) : z i ∈ S(x)}. Results are shown in Fig. 5. We observe that the gold samples z * have the largest learning signals in around 80% cases. We also find that when z * has the largest learning signal, its average difference with the learning signal of the highest-scoring incorrect sample is 1.27 and 0.96 on DJANGO and ATIS, respectively.
Finally, to study the relative contribution of the reconstruction score log p(x|z) and the prior log p(z) to the learning signal, we present examples of inferred latent MRs during training (Tab. 3). Examples 1&2 show that the reconstruction score serves as an informative quality measure of the latent MR, assigning the correct samples z s 1 with high log p(x|z), leading to positive learning signals. This is in line with our assumption that a good latent MR should adequately encode the semantics of the utterance. Example 3 shows that the prior is also effective in identifying "unnatural" MRs (e.g., it is rare to add a function and a string literal, as in z s 2 ). These results also suggest that the prior and the reconstruction model perform well with linearization of MRs. Finally, note that in Examples 2&3 the learning signals for the correct samples z s 1 are positive even if their inference scores q(z|x) are lower than those of z s 2 .   This result further demonstrates that learning signals provide informative gradient weights for optimizing the inference model.
Generalizing to Other Latent MRs Our main results are obtained using a strong AST-based semantic parser as the inference model, with copyaugmented reconstruction model and an LSTM language model as the prior. However, there are many other ways to represent and infer structure in semantic parsing (Carpenter, 1998;Steedman, 2000), and thus it is of interest whether our basic STRUCTVAE framework generalizes to other semantic representations. To examine this, we test STRUCTVAE using λ-calculus logical forms as latent MRs for semantic parsing on the ATIS domain. We use standard sequence-to-sequence networks with attention (Luong et al., 2015) as inference and reconstruction models. The inference model is trained to construct a tree-structured logical form using the transition actions defined in Cheng et al. (2017). We use a classical tri-gram Kneser-Ney language model as the prior. Tab. 4 lists the results for this STRUCTVAE-SEQ model.
We can see that even with this very different model structure STRUCTVAE still provides significant gains, demonstrating its compatibility with different inference/reconstruction networks and priors. Interestingly, compared with the results in Tab. 1, we found that the gains are especially larger with few labeled examples -STRUCT-VAE-SEQ achieves improvements of 8-10 points when |L| < 1000. These results suggest that semi-supervision is especially useful in improving a mediocre parser in low resource settings.   4)) to stabilize learning, which is based on a language model (LM) over utterances (Eq. (6)). We compare this baseline with a commonly used one in REINFORCE training: the multi-layer perceptron (MLP). The MLP takes as input the last hidden state of the utterance given by the encoding LSTM of the inference model. Tab. 5 lists the results over sampled settings. We found that although STRUCTVAE with the MLP baseline sometimes registers better performance on ATIS, in most settings it is worse than our LM baseline, and could be even worse than the supervised parser. On the other hand, our LM baseline correlates well with the learning signal, yielding stable improvements over the supervised parser. This suggests the importance of using carefully designed baselines in REINFORCE learning, especially when the reward signal has large range (e.g., log-likelihoods).
Impact of the Prior p(z) Fig. 6 depicts the performance of STRUCTVAE as a function of the KL term weight λ in Eq. (3). When STRUCTVAE degenerates to a vanilla auto-encoder without the prior distribution (i.e., λ = 0), it under-performs the supervised baseline. This is in line with our observation in Tab. 3 showing that the prior helps identify unnatural samples. The performance of the model also drops when λ > 0.1, suggesting that empirically controlling the influence of the prior to the inference model is important. Fig. 7 illustrates the accuracies w.r.t. the size of unlabeled data. STRUCTVAE yields consistent gains as the size of the unlabeled data increases.

Related Works
Semi-supervised Learning for NLP Semisupervised learning comes with a long history (Zhu, 2005), with applications in NLP from early work of self-training (Yarowsky, 1995), and graph-based methods (Das and Smith, 2011), to recent advances in auto-encoders (Cheng et al., 2016;Socher et al., 2011;Zhang et al., 2017) and deep generative methods (Xu et al., 2017). Our work follows the line of neural variational inference for text processing (Miao et al., 2016), and resembles Miao and Blunsom (2016), which uses VAEs to model summaries as discrete latent variables for semi-supervised summarization, while we extend the VAE architecture for more complex, tree-structured latent variables.
Semantic Parsing Most existing works alleviate issues of limited parallel data through weaklysupervised learning, using the denotations of MRs as indirect supervision (Reddy et al., 2014;Krishnamurthy et al., 2016;Neelakantan et al., 2016;Pasupat and Liang, 2015;Yin et al., 2016). For semi-supervised learning of semantic parsing, Kate and Mooney (2007)  There have also been efforts in unsupervised semantic parsing, which exploits external linguistic analysis of utterances (e.g., dependency trees) and the schema of target knowledge bases to infer the latent MRs (Poon and Domingos, 2009;Poon, 2013). Another line of research is domain adaptation, which seeks to transfer a semantic parser learned from a source domain to the target domain of interest, therefore alleviating the need of parallel data from the target domain (Su and Yan, 2017;Fan et al., 2017;Herzig and Berant, 2018).

Conclusion
We propose STRUCTVAE, a deep generative model with tree-structured latent variables for semi-supervised semantic parsing. We apply STRUCTVAE to semantic parsing and code generation tasks, and show it outperforms a strong supervised parser using extra unlabeled data. STRUCTVAE is a generative model of natural language, and therefore can be used to sample latent MRs and the corresponding NL utterances. This amounts to draw a latent MR z from the prior p(z), and sample an NL utterance x from the reconstruction model p θ (x|z). Since we use the sequential representation z s in the prior, to guarantee the syntactic well-formedness of sampled MRs from p(z), we use a syntactic checker and reject any syntactically-incorrect samples 6 . Tab. 6 and Tab. 7 present samples from DJANGO and ATIS, respectively. These examples demonstrate that STRUCTVAE is capable of generating syntactically diverse NL utterances. latent MR def init (self, * args, * * kwargs): pass surface NL Define the method init with 3 arguments: self, unpacked list args and unpacked dictionary kwargs latent MR elif isinstance(target, six.string types): pass surface NL Otherwise if target is an instance of six.string types latent MR for k, v in unk.items(): pass surface NL For every k and v in return value of the method unk.items latent MR return cursor.fetchone()[0] surface NL Call the method cursor.fetchone , return the first element of the result latent MR sys.stderr.write( STR % e) surface NL Call the method sys.stderr, write with an argument STR formated with e latent MR opts = getattr(self, STR , None) surface NL Get the STR attribute of the self object, if it exists substitute it for opts, if not opts is None Table 6: Sampled latent meaning representations (presented in surface source code) and NL utterances from DJANGO. latent MR (argmax $0 (and (flight $0) (meal $0 lunch:me) (from $0 ci0) (to $0 ci1)) (departure time $0)) surface NL Show me the latest flight from ci0 to ci1 that serves lunch latent MR (min $0 (exists $1 (and (from $1 ci0) (to $1 ci1) (day number $1 dn0) (month $1 mn0) (round trip $1) (= (fare $1) $0)))) surface NL I want the cheapest round trip fare from ci0 to ci1 on mn0 dn0 latent MR (lambda $0 e (and (flight $0) (from $0 ci0) (to $0 ci1) (weekday $0))) surface NL Please list weekday flight between ci0 and ci1 latent MR (lambda $0 e (and (flight $0) (has meal $0) (during day $0 evening:pd) (from $0 ci1) (to $0 ci0) (day number $0 dn0) (month $0 mn0))) surface NL What are the flight from ci1 to ci0 on the evening of mn0 dn0 that serves a meal latent MR (lambda $0 e (and (flight $0) (oneway $0) (class type $0 first:cl) (from $0 ci0) (to $0 ci1) (day $0 da0))) surface NL Show me one way flight from ci0 to ci1 on a da0 with first class fare latent MR (lambda $0 e (exists $1 (and (rental car $1) (to city $1 ci0) (= (ground fare $1) $0)))) surface NL What would be cost of car rental car in ci0 Table 7: Sampled latent meaning representations (presented in surface λ-calculus expression) and NL utterances from ATIS. Verbs are recovered to their correct form instead of the lemmatized version as in the pre-processed dataset.