Jointly Learning Semantic Parser and Natural Language Generator via Dual Information Maximization

Semantic parsing aims to transform natural language (NL) utterances into formal meaning representations (MRs), whereas an NL generator achieves the reverse: producing an NL description for some given MRs. Despite this intrinsic connection, the two tasks are often studied separately in prior work. In this paper, we model the duality of these two tasks via a joint learning framework, and demonstrate its effectiveness of boosting the performance on both tasks. Concretely, we propose a novel method of dual information maximization (DIM) to regularize the learning process, where DIM empirically maximizes the variational lower bounds of expected joint distributions of NL and MRs. We further extend DIM to a semi-supervision setup (SemiDIM), which leverages unlabeled data of both tasks. Experiments on three datasets of dialogue management and code generation (and summarization) show that performance on both semantic parsing and NL generation can be consistently improved by DIM, in both supervised and semi-supervised setups.


Introduction
Semantic parsing studies the task of translating natural language (NL) utterances into formal meaning representations (MRs) (Zelle and Mooney, 1996;Tang and Mooney, 2000). NL generation models can be designed to learn the reverse: mapping MRs to their NL descriptions (Wong and Mooney, 2007). Generally speaking, MR often takes a logical form that captures the semantic meaning, including λcalculus Collins, 2005, 2007), Abstract Meaning Representation (AMR) (Banarescu et al., 2013;Misra and Artzi, 2016), and general-purpose computer programs, such as 1 Code for this paper is available at: https:// github.com/oceanypt/DIM  NL generator. We model the duality of the two tasks by matching the joint distributions of p e (x, y) (learned from semantic parser) and p d (x, y) (learned from NL generator) to an underlying unknown distribution P(x, y).
Python (Yin and Neubig, 2017) or SQL (Zhong et al., 2017). Recently, NL generation models have been proposed to automatically construct humanreadable descriptions from MRs, for code summarization (Hu et al., 2018;Allamanis et al., 2016;Iyer et al., 2016) that predicts the function of code snippets, and for AMR-to-text generation (Song et al., 2018;Konstas et al., 2017;Flanigan et al., 2016). Specifically, a common objective that semantic parsers aim to estimate is p θ (y|x), the conditional distribution between NL input x and the corresponding MR output y, as demonstrated in Fig. 1. Similarly, for NL generation from MRs, the goal is to learn a generator of q φ (x|y). As demonstrated in Fig. 2, there is a clear duality between the two tasks, given that one task's input is the other task's output, and vice versa. However, such duality remains largely unstudied, even though joint modeling has been demonstrated effective in various NLP problems, e.g. question answering and generation (Tang et al., 2017), machine translation between paired languages (He et al., 2016), as well as sentiment prediction and subjective text gener-1 DATA EXAMPLE Ave. Token

ATIS
x can you list all flights from chicago to milwaukee 10.6 y ( lambda $0 e ( and ( flight $0) ( from $0 chicago: ci) ( to $0 milwaukee: ci) ) ) 26.5 DJANGO x convert max entries into a string, substitute it for self. max entries. 11.9 y self. max entries = int(max entries) 8.2 CONALA x more pythonic alternative for getting a value in range not using min and max 9.7 y a = 1 if x < 1 else 10 if x > 10 else x 14.1 Figure 2: Sample natural language utterances and meaning representations from datasets used in this work: ATIS for dialogue management; DJANGO (Oda et al., 2015) and CONALA (Yin et al., 2018a) for code generation and summarization.
In this paper, we propose to jointly model semantic parsing and NL generation by exploiting the interaction between the two tasks. Following previous work on dual learning (Xia et al., 2017), we leverage the joint distribution P(x, y) of NL and MR to represent the duality. Intuitively, as shown in Fig. 1, the joint distributions of p e (x, y) = p(x)p θ (y|x), which is estimated from semantic parser, and p d (x, y) = q(y)q φ (x|y), which is modeled by NL generator, are both expected to approximate P(x, y), the unknown joint distribution of NL and MR.
To achieve this goal, we propose dual information maximization (DIM) ( §3) to empirically optimize the variational lower bounds of the expected joint distributions of p e (x, y) and p d (x, y). Concretely, the coupling of the two expected distributions is designed to capture the dual information, with both optimized via variational approximation (Barber and Agakov, 2003) inspired by . Furthermore, combined with the supervised learning objectives of semantic parsing and NL generation, DIM bridges the two tasks within one joint learning framework by serving as a regularization term ( §2.2). Finally, we extend supervised DIM to semi-supervision setup (SEMIDIM), where unsupervised learning objectives based on unlabeled data are also optimized ( §3.3).
We experiment with three datasets from two different domains: ATIS for dialogue management; DJANGO and CONALA for code generation and summarization. Experimental results show that both the semantic parser and generator can be consistently improved with joint learning using DIM and SEMIDIM, compared to competitive comparison models trained for each task separately.
Overall, we have the following contributions in this work: • We are the first to jointly study semantic parsing and natural language generation by exploiting the duality between the two tasks; • We propose DIM to capture the duality and adopt variational approximation to maximize the dual information; • We further extend supervised DIM to semisupervised setup (SEMIDIM).

Semantic Parsing and NL Generation
Formally, the task of semantic parsing is to map the input of NL utterances x to the output of structured MRs y, and NL generation learns to generate NL from MRs. Learning Objective. Given a labeled dataset L = { x i , y i }, we aim to learn a semantic parser (x → y) by estimating the conditional distribution p θ (y|x), parameterized by θ, and an NL generator (y → x) by modeling q φ (x|y), parameterized by φ. The learning objective for each task is shown below: Frameworks. Sequence-to-sequence (seq2seq) models have achieved competitive results on both semantic parsing and generation (Dong and Lapata, 2016;Hu et al., 2018), and without loss of generality, we adopt it as the basic framework for both tasks in this work. Specifically, for both p θ (y|x) and q φ (x|y), we use a two-layer bidirectional LSTM (bi-LSTM) as the encoder and another one-layer LSTM as the decoder with attention mechanism (Luong et al., 2015). Furthermore, we leverage pointer network (Vinyals et al., 2015) to copy tokens from the input to handle outof-vocabulary (OOV) words. The structured MRs are linearized for the sequential encoder and de-coder. More details of the parser and the generator can be found in Appendix A. Briefly speaking, our models differ from existing work as follows: PARSER: Our architecture is similar to the one proposed in Jia and Liang (2016) for semantic parsing; GENERATOR: Our model improves upon the DEEPCOM coder summarization system (Hu et al., 2018) by: 1) replacing LSTM with bi-LSTM for the encoder to better model context, and 2) adding copying mechanism.

Jointly Learning Parser and Generator
Our joint learning framework is designed to model the duality between a parser and a generator. To incorporate the duality into our learning process, we design the framework to encourage the expected joint distributions p e (x, y) and p d (x, y) to both approximate the unknown joint distribution of x and y (shown in Fig. 1). To achieve this, we introduce dual information maximization (DIM) to empirically optimize the variational lower bounds of both p e (x, y) and p d (x, y), in which the coupling of expected distributions is captured as dual information (detailed in §3.1) and will be maximized during learning. Our joint learning objective takes the form of: L DIM is the variational lower bound of the two expected joint distributions, specifically, where L e DIM and L d DIM are the lower bounds over p e (x, y) and p d (x, y) respectively. The hyperparameter λ trades off between supervised objectives and dual information learning. With the objective of Eq. 3, we jointly learn a parser and a generator, as well as maximize the dual information between the two. L DIM serves as a regularization term to influence the learning process, whose detailed algorithm is described in §3.
Our method of DIM is model-independent. If the learning objectives for semantic parser and NL generator are subject to Eq. 1 and Eq. 2, we can always adopt DIM to conduct joint learning. Out of most commonly used seq2seq models for the parser and generator, more complex tree and graph structures have been adopted to model MRs (Dong and Lapata, 2016;Song et al., 2018). In this paper, without loss of generality, we study our joint-learning method on the widely-used seq2seq frameworks mentioned above ( §2.1).

Dual Information Maximization
In this section, we first introduce dual information in §3.1, followed by its maximization ( §3.2). §3.3 discusses its extension with semi-supervision.

Dual Information
As discussed above, we treat semantic parsing and NL generation as the dual tasks and exploit the duality between the two tasks for our joint learning. With conditional distributions p θ (y|x) for the parser and q φ (x|y) for the generator, the joint distributions of p e (x, y) and p d (x, y) can be estimated as p e (x, y) = p(x)p θ (y|x) and p d (x, y) = q(y)q φ (x|y), where p(x) and q(y) are marginals. The dual information I p e,d (x,y) between the two distributions is defined as follows: which is the combination of the two joint distribution expectations.
To leverage the duality between the two tasks, we aim to drive the learning of the model parameters θ and φ via optimizing I p e,d (x,y) , so that the expectations of joint distributions p e (x, y) and p d (x, y) will be both maximized and approximate the latent joint distribution P(x, y), whose procedure is similar to the joint distribution matching (Gan et al., 2017). By exploiting the inherent probabilistic connection between the two distributions, we hypothesize that it would enhance the learning of both tasks on parsing p θ (y|x) and generation q φ (x|y). Besides, to approach the same distribution P(x, y), the expected joint distributions can learn to be close to each other, making the dual models coupled.

Maximizing Dual Information
Here, we present the method for optimizing I p e (x,y) , which can also be applied to I p d (x,y) . In contrast to the parameter sharing techniques in most multi-task learning work (Collobert et al., 2011;Ando and Zhang, 2005), parameter θ for the parser and parameter φ for generator are independent in our framework. In order to jointly train the two models and bridge the learning of θ and φ, during the optimization of I p e (x,y) , where the parser is the primal model, we utilize the distributions of the dual task (i.e. the generator) to estimate I p e (x,y) . In this way, θ and φ can is there ground transportation in st. louis Figure 3: The pipeline of calculating lower bounds. We firstly use the parser or generator to sample MR or NL targets, then the sampled candidates go through the dual model and a language model to obtain the lower bounds. be both improved during the update of I p e (x,y) . Specifically, we rewrite E p e (x,y) log p e (x, y) as E p e (x,y) log p e (y)p e (x|y), where p e (y) and p e (x|y) are referred as the dual task distributions. However, the direct optimization for this objective is impractical since both p e (y) and p e (x|y) are unknown. Our solution is detailed below. Lower Bounds of Dual Information. To provide a principled approach of optimizing I p e (x,y) , inspired by , we follow Barber and Agakov (2003) to adopt variational approximation to deduce its lower bound and instead maximize the lower bound. The lower bound deduction process is as following: +E p e (y) KL(p e (x|y) q φ (x|y)) +E p e (x|y) KL(p e (y) q(y)) where KL(· ·)( 0) is the Kullback-Leibler (KL) divergence. Therefore, to maximize I p e (x,y) , we can instead maximize its lower bound of L e DIM . L e DIM is learned by using q φ (x|y) and q(y) which approximate p e (x|y) and p e (y). Besides, the lower bound of L e DIM is the function of θ and φ, so in the process of learning L e DIM , the parser and generator can be both optimized.
As illustrated in Fig. 3, in the training process, to calculate the lower bound of L e DIM , we first use the being-trained parser to sample MR candidates for a given NL utterance. The sampled MRs then go through the generator and a marginal model (i.e., a language model of MRs) to obtain the final lower bound.
To learn the lower bound of L e DIM , we provide the following method to calculate its gradients: Gradient Estimation. We adopt Monte Carlo samples using the REINFORCE policy (Williams, 1992) to approximate the gradient of L e DIM (θ, φ) with regard to θ: l(x, y; φ) can be seen as the learning signal from the dual model, which is similar to the reward in reinforcement learning algorithms (Guu et al., 2017;Paulus et al., 2017). To handle the highvariance of learning signals, we adopt the baseline function b by empirically averaging the signals to stabilize the learning process (Williams, 1992). With prior p θ (·|x), we use beam search to generate a pool of MR candidates (y), denoted as S, for the input of x.
The gradient with regard to φ is then calculated as: The above maximization procedure for L e DIM is analogous to the EM algorithm: Step 1: Freeze φ and find the optimal θ * = arg max θ L e DIM (θ, φ) with Eq. 7; Step 2: Based on Eq. 8, with freezing θ * , find the optimal φ * = arg max φ L e DIM (θ, φ). The two steps are repeated until convergence.
According to the gradient estimation in Eq. 7, when updating θ for the parser, we receive the learning signal l(x, y; φ) from the generator, and this learning signal can be seen as a reward from the generator: if parser p θ (y|x) predicts high-quality MRs, the reward will be high; otherwise, the reward is low. This implies that the generator guides the parser to generate high-quality MRs, through which the lower bound for the expected joint distribution gets optimized. This also applies to the situation when we treat the generator as the primal model and the parser as the dual model.
The lower bound of I p d (x,y) can be calculated in a similar way: which can be optimized the same way as in Eqs. 7 and 8 for estimating the gradients for L d DIM . Marginal Distributions. To obtain the marginal distributions p(x) and q(y), we separately train an LSTM-based language model (Mikolov et al., 2010) for NL and MR respectively, on each training set. Structured MRs are linearized into sequences for the sequential encoder and decoder in seq2seq models. Details on learning marginal distributions can be found in Appendix B. Joint Learning Objective. Our final joint learning objective becomes: According to this learning objective, after picking up a data pair x, y , we will firstly calculate the supervised learning loss, then we sample MR candidates and NL samples using prior p θ (·|x) and q φ (·|y) respectively to obtain the corresponding lower bounds over I p e (x,y) and I p d (x,y) .

Semi-supervised DIM (SEMIDIM)
We further extend DIM with semi-supervised learning. We denote the unlabeled NL dataset as U x = {x i } and the unlabeled MR dataset as To leverage U x , we maximize the unlabeled objective E x∼Ux log p(x). Our goal is to involve model parameters in the optimization process of E x∼Ux log p(x), so that the unlabeled data can facilitate parameter leanring. Lower Bounds of Unsupervised Objective. The lower bound of E x∼Ux log p(x) is as follows, us-ing the deduction in Ineq. 6: ≥ E x∼Ux,y∼p θ (·|x) log q φ (x|y) + q(y) (11) Comparing Ineq. 11 to Ineq. 6, we can see that the unsupervised objective E x∼Ux log p(x) and I p e (x,y) share the same lower bound, so that the same optimization method from Eq. 7 and Eq. 8 can be utilized for learning the lower bound over E x∼Ux log p(x). Analysis. The lower bound of the unsupervised objective E x∼Ux log p(x) is a function of θ and φ. Therefore, updating this unsupervised objective will jointly optimize the parser and the generator. From the updating algorithm in Eq. 7, we can see that the parser p θ (y|x) is learned by using pseudo pair (x,ŷ) whereŷ is sampled from p θ (·|x). This updating process resembles the popular semi-supervised learning algorithm of self-train that predicts pseudo labels for unlabeled data (Lee, 2013) and then attaches the predicted labels to the unlabeled data as additional training data. In our algorithm, the pseudo sample (x,ŷ) will be weighted by the learning signal l(x,ŷ; φ), which decreases the impact of low-quality pseudo samples. Furthermore, from Eq. 8, the generator q φ (x|y) is updated using the pseudo sample (x,ŷ), which is similar to the semi-supervised learning method of back-boost that is widely used in Neural Machine Translation for low-resource language pairs (Sennrich et al., 2016). Given the target-side corpus, back-boost generates the pseudo sources to construct pseudo samples, which is added for model training.
Similarly, to leverage the unlabeled data U y for semi-supervised learning, following Ineq. 11, we could also have the lower bound for E y∼Uy log p(y) as following, E y∼Uy log p(y) ≥ E y∼Uy,x∼q φ (·|y) log p θ (y|x) + p(x) (12) which is the same as the lower bound of I p d (x,y) .
Semi-supervised Joint Learning Objective. From the above discussions, we can deduce the lower bounds for the unsupervised objectives to be the same as the lower bounds of the dual information. We thus have the following semi-supervised joint-learning objective: where D x = U x ∪ L x and D y = U y ∪ L y . In this work, we weight the dual information and unsupervised objectives equally for simplicity, so the lower bounds over them are combined for joint optimization. We combine the labeled and unlabeled data to calculate the lower bounds to optimize the variational lower bounds of dual information and unsupervised objectives.

Datasets
Experiments are conducted on three datasets with sample pairs shown in Fig. 2 (Dong and Lapata, 2016) 84.6 ASN (Rabinovich et al., 2017) 85.3 ASN+SUPATT (Rabinovich et al., 2017) 85.9 COARSE2FINE (Dong and Lapata, 2018)   are lowercased and tokenized and the tokens in code snippets are separated with space. Statistics of the datasets are summarized in Table 1.

Experimental Setups
Joint-learning Setup. Before jointly learning the models, we pre-train the parser and the generator separately, using the labeled dataset, to enable the sampling of valid candidates with beam search when optimizing the lower bounds of dual information (Eqs. 7 and 8). The beam size is tuned from {3,5}. The parser and the generator are pretrained until convergence. We also learn the language models for NL and MRs on the training sets beforehand, which are not updated during joint learning. Joint learning stops when the parser or the generator does not get improved for 5 continuous iterations. λ is set to 0.1 for all the experiments. Additional descriptions about our setup are provided in Appendix C.
For the semi-supervised setup, since ATIS and DJANGO do not have additional unlabeled corpus and it is hard to obtain in-domain NL utterances and MRs, we create a new partial training set from the original training set via subsampling, and the rest is used as the unlabeled corpus. For CONALA, we subsample data from the full training set to construct the new training set and unlabeled set instead of sampling from the low-quality corpus which will much boost the data volume. Evaluation Metrics. Accuracy (Acc.) is reported   62.3 SNM (Yin and Neubig, 2017) 71.6 COARSE2FINE (Dong and Lapata, 2018) 74  (Hu et al., 2018) 65.9 for parser evaluation based on exact match, and BLEU-4 is adopted for generator evaluation. For the code generation task in CONALA, we use BLEU-4 following the setup in Yin et al. (2018a). Baselines. We compare our methods of DIM and SEMIDIM with the following baselines: SUPER: Train the parser or generator separately without joint learning. The models for the parser and generator are the same as DIM. SELFTRAIN (Lee, 2013): We use the pre-trained parser or generator to generate pseudo labels for the unlabeled sources, then the constructed pseudo samples will be mixed with the labeled data to fine-tune the pre-trained parser or generator. BACKBOOST: Adopted from the back translation method in Sennrich et al. (2016), which generates sources from unlabeled targets. The training process for BACKBOOST is the same as in SELFTRAIN. In addition to the above baselines, we also compare with popular supervised methods for each task, shown in the corresponding result tables.
ples following Yin et al. (2018b); for DJANGO and CONALA, unlabeled code snippets are utilized. We first note the consistent advantage of DIM over SUPER across all datasets and proportions of training samples for learning. This indicates that DIM is able to exploit the interaction between the dual tasks, and further improves the performance on both semantic parsing and NL generation.
For semi-supervised scenarios, SEMIDIM, which employs unlabeled samples for learning, delivers stronger performance than DIM, which only uses labeled data. Moreover, SEMIDIM outperforms both SELFTRAIN and BACKBOOST, the two semi-supervised learning methods. This is attributed to SEMIDIM's strategy of re-weighing pseudo samples based on the learning signals, which are indicative of their qualities, whereas SELFTRAIN and BACKBOOST treat all pseudo samples equally during learning. Additionally, we study the pre-training effect on CONALA. As can be seen in Table 4, pre-training further improves the performance of SUPER and DIM on both code generation and summarization. Model Analysis. Here we study whether DIM helps enhance the lower bounds of the expected joint distributions of NL and MRs. Specifically, lower bounds are calculated as in Eqs. 6 and 9 on the full training set for models of SUPER and DIM. As displayed in Fig. 4, DIM better optimizes the lower bounds of both the parser and the generator, with significantly higher values of average lower bounds on the full data. These results further explains that when the lower bound of the primal model is improved, it produces learning signals of high quality for the dual model, leading to better performance on both tasks. As conjectured above, SEMIDIM outperforms SELFTRAIN in almost all setups because SEMIDIM re-weights the pseudo data with learning signals from the dual model. To demonstrate this, by giving the gold label for the unlabeled corpus, we rank the learning signal over the gold label among the sampled set using the semi-trained model, e.g. on ATIS, given an NL x from the dataset used as the unlabeled corpus, we consider the position of the learning signal l(x, y * ; φ) of gold-standard sample among all samples l(x,ŷ i ; φ)|ŷ i ∈ S . As seen in Fig. 5, the gold candidates are almost always top-ranked, indicating that SEMIDIM is effective of separating pseudo samples of high and low-quality.   Ablation Study. We conduct ablation studies by training DIM with the parameters of parser or generator frozen. The results are presented in Table 5. As anticipated, for both of parsing and generation, when the dual model is frozen, the performance of the primal model degrades. This again demonstrates DIM's effectiveness of jointly optimizing both tasks. Intuitively, jointly updating both the primal and dual models allows a better learned dual model to provide high-quality learning signals, leading to an improved lower bound for the primal. As a result, freezing parameters of the dual model has a negative impact on the learning signal quality, which affects primal model learning.
Effect of λ. λ controls the tradeoff between learning dual information and the unsupervised learning objective. Fig. 6 shows that the optimal model performance can be obtained when λ is within 0.1 ∼ 1. When λ is set to 0, the joint training only employs labeled samples, and its performance decreases significantly. A minor drop is observed at λ = 0.01, which is considered to result from the variance of learning signals derived from the REINFORCE algorithm. Correlation between Parser and Generator. We further study the performance correlation between the coupled parser and generator. Using the model outputs shown in Fig. 6, we run linear regressions of generator performance on parser performance, and a high correlation is observed between them (Fig. 7).

Related Work
Semantic Parsing and NL Generation. Neural sequence-to-sequence models have achieved promising results on semantic parsing (Dong and Lapata, 2016;Jia and Liang, 2016;Dong and Lapata, 2018) and natural language generation (Iyer et al., 2016;Konstas et al., 2017;Hu et al., 2018). To better model structured MRs, tree structures and more complicated graphs are explored for both parsing and generation (Dong and Lapata, 2016;Rabinovich et al., 2017;Yin and Neubig, 2017;Song et al., 2018;Cheng et al., 2017;Alon et al., 2018). Semisupervised learning has been widely studied for semantic parsing (Yin et al., 2018b;Jia and Liang, 2016). Similar to our work, Chen and Zhou (2018) and Allamanis et al. (2015) study code retrieval and code summarization jointly to enhance both tasks. Here, we focus on the more challenging task of code generation instead of retrieval, and we also aim for generalpurpose MRs. Joint Learning in NLP. There has been growing interests in leveraging related NLP problems to enhance primal tasks (Collobert et al., 2011;Peng et al., 2017;, e.g. sequence tagging (Collobert et al., 2011), dependency parsing (Peng et al., 2017), discourse analysis . Among those, multi-task learning (MTL) (Ando and Zhang, 2005) is a common method for joint learning, especially for neural networks where parameter sharing is utilized for representation learning. We follow the recent work on dual learning (Xia et al., 2017) to train dual tasks, where interactions can be employed to enhance both models. Dual learning has been successfully applied in NLP and computer vision problems, such as neural machine translation (He et al., 2016), question generation and answering (Tang et al., 2017), image-to-image translation (Yi et al., 2017;Zhu et al., 2017). Different from Xia et al. (2017) which minimizes the divergence between the two expected joint distributions, we aim to learn the expected distributions in a way similar to distribution matching (Gan et al., 2017). Furthermore, our method can be extended to semi-supervised scenario, prior to Xia et al. (2017)'s work which can only be applied in supervised setup. Following , we deduce the variational lower bounds of expected distributions via information maximization (Barber and Agakov, 2003). DIM aims to optimize the dual information instead of the two mutual information studied in .

Conclusion
In this work, we propose to jointly train the semantic parser and NL generator by exploiting the structural connections between them. We introduce the method of DIM to exploit the duality, and provide a principled way to optimize the dual information. We further extend supervised DIM to semi-supervised scenario (SEMIDIM). Extensive experiments demonstrate the effectiveness of our proposed methods.
To overcome the issue of poor labeled corpus for semantic parsing, some automatically mined datasets have been proposed, e.g. CONALA (Yin et al., 2018a) and STAQC (Yao et al., 2018). However, these datasets are noisy and it is hard to train robust models out of them. In the future, we will further apply DIM to learn semantic parser and NL generator from the noisy datasets.

A Model Details for the Parser and Generator
The parser and generator have the same seq2seq framework. We take the parser for example. Given the NL utterance x and the linearized MR y, we use bi-LSTM to encode x into context vectors, and then a LSTM decoder generates y from the context vectors. The parser p θ (y|x) is formulated as following: where y <t = y 1 · · · y t−1 . The hidden state vector at time t from the encoder is the concatenation of forward hidden vector − → h t and backward one With the LSTM unit f LSTMe from the encoder, we have ← − h t+1 ). From the decoder side, using the decoder LSTM unit f LSTM d , we have the hidden state vector at time t as s t = f LSTM d (y t−1 , s t−1 ). Global attention mechanism (Luong et al., 2015) is applied to obtain the context vector c t at time t: where α t,i is the attention weight and is specified as: where W att is the learnable parameters. At time t, with hidden state s t in the decoder and context vector c t from the encoder, we have the prediction probability for y t : p vocab (y t |y <t , x) = f softmax (W d 1 · tanh(W d 2 [s t ; c t ])) where W d 1 and W d 2 are learnable parameters.
We further apply the pointer-network (Vinyals et al., 2015) to copy tokens from the input to alleviate the out-of-vocabulary (OOV) issue. We adopt the calculation flows for copying mechanism from Yin et al. (2018b), readers can refer to that paper for further details.

B Marginal Distributions
To estimate the marginal distributions p(x) and q(y), we learn the LSTM language models over the NL utterances and MRs. MRs are linearized. Suppose given the NL x = {x i } |x| i=1 , the learning objective is: where x <i = x 1 · · · x i−1 . At time t, we have the following probability to predict x t : Here, h t is estimated using the LSTM network: The above marginal distribution estimation for NLs is also applied to linearized MRs.

C.1 Marginal Distribution
We pre-train the language models on the full training set before joint learning and the language mdoels will be fixed in the following experiments. The embedding size is selected from {128, 256} and the hidden size is tuned from {256, 512}, which are both evaluated on the validation set. We use SGD to update the models. Early stopping is applied and the training will be stopped if the ppl value does not decrease for continuous 5 times.

C.2 Model Configuration
To conduct the joint learning using DIM and SEMIDIM, we have to firstly train the parser and generator separately referred as the method of SU-PER.
To pre-train the parser and generator, we tune the embedding size from {125, 150, 256} and hidden size from {256, 300, 512}. The batch size is selected from {10, 16} varying over the datasets. Early stopping is applied and the patience time is set to 5. Initial learning rate is 0.001. Adam is adopted to optimize the models. The parser and generator will be trained until convergence.
After the pre-training, we conduct joint learning based on the pre-trained parser and generator. The learning rate will be slowed down to 0.00025. The beam size for sampling is tuned from {3, 5}.