Semantic Parsing with Dual Learning

Semantic parsing converts natural language queries into structured logical forms. The lack of training data is still one of the most serious problems in this area. In this work, we develop a semantic parsing framework with the dual learning algorithm, which enables a semantic parser to make full use of data (labeled and even unlabeled) through a dual-learning game. This game between a primal model (semantic parsing) and a dual model (logical form to query) forces them to regularize each other, and can achieve feedback signals from some prior-knowledge. By utilizing the prior-knowledge of logical form structures, we propose a novel reward signal at the surface and semantic levels which tends to generate complete and reasonable logical forms. Experimental results show that our approach achieves new state-of-the-art performance on ATIS dataset and gets competitive performance on OVERNIGHT dataset.


Introduction
Semantic parsing is the task of mapping a natural language query into a logical form (Zelle and Mooney, 1996;Wong and Mooney, 2007;Zettlemoyer and Collins, 2007;Lu et al., 2008;Zettlemoyer and Collins, 2005). A logical form is one type of meaning representation understood by computers, which usually can be executed by an executor to obtain the answers.
The successful application of recurrent neural networks (RNN) in a variety of NLP tasks (Bahdanau et al., 2014;Sutskever et al., 2014;Vinyals et al., 2015) has provided strong impetus to treat semantic parsing as a sequence-tosequence (Seq2seq) problem (Jia and Liang, 2016;Dong and Lapata, 2016). This approach generates a logical form based on the input query in an endto-end manner but still leaves two main issues: (1) lack of labeled data and (2) constrained decoding.
Firstly, semantic parsing relies on sufficient labeled data. However, data annotation of semantic parsing is a labor-intensive and time-consuming task. Especially, the logical form is unfriendly for human annotation.
Secondly, unlike natural language sentences, a logical form is strictly structured. For example, the lambda expression of "show flight from ci0 to ci1" is ( lambda $0 e ( and ( from $0 ci0 ) ( to $0 ci1 ) ( flight $0 ) ) ). If we make no constraint on decoding, the generated logical form may be invalid or incomplete at surface and semantic levels.
Surface The generated sequence should be structured as a complete logical form. For example, left and right parentheses should be matched to force the generated sequence to be a valid tree.
Semantic Although the generated sequence is a legal logical form at surface level, it may be meaningless or semantically ill-formed. For example, the predefined binary predicate from takes no more than two arguments. The first argument must represent a flight and the second argument should be a city.
To avoid producing incomplete or semantically illformed logical forms, the output space must be constrained.
In this paper, we introduce a semantic parsing framework (see Figure 1) by incorporating dual learning (He et al., 2016) to tackle the problems mentioned above. In this framework, we have a primal task (query to logical form) and a dual task (logical form to query). They can form a closed loop, and generate informative feedback signals to train the primal and dual models even without supervision. In this loop, the primal and dual models restrict or regularize each other by generating intermediate output in one model and then checking it in the other. Actually, it can be viewed as a method of data augmentation. Thus it can leverage unlabeled data (queries or synthesized logical forms) in a more effective way which helps alleviate the problem of lack of annotated data.
In the dual learning framework, the primal and dual models are represented as two agents and they teach each other through a reinforcement learning process. To force the generated logical form complete and well-formed, we newly propose a validity reward by checking the output of the primal model at the surface and semantic levels.
We evaluate our approach on two standard datasets: ATIS and OVERNIGHT. The results show that our method can obtain significant improvements over strong baselines on both datasets with fully labeled data, and even outperforms state-of-the-art results on ATIS. With additional logical forms synthesized from rules or templates, our method is competitive with state-ofthe-art systems on OVERNIGHT. Furthermore, our method is compatible with various semantic parsing models. We also conduct extensive experiments to further investigate our framework in semi-supervised settings, trying to figure out why it works.
The main contributions of this paper are summarized as follows: • An innovative semantic parsing framework based on dual learning is introduced, which can fully exploit data (labeled or unlabeled) and incorporate various prior-knowledge as feedback signals. We are the first to introduce dual learning in semantic parsing to the best of our knowledge.
• We further propose a novel validity reward focusing on the surface and semantics of logical forms, which is a feedback signal indicating whether the generated logical form is well-formed. It involves the prior-knowledge about structures of logical forms predefined in a domain.
• We conduct extensive experiments on ATIS and OVERNIGHT benchmarks. The results show that our method achieves new stateof-the-art performance (test accuracy 89.1%) on ATIS dataset and gets competitive performance on OVERNIGHT dataset.

Primal and Dual Tasks of Semantic Parsing
Before discussing the dual learning algorithm for semantic parsing, we first present the primal and dual tasks (as mentioned before) in detail. The primal and dual tasks are modeled on the attention-based Encoder-Decoder architectures (i.e. Seq2seq) which have been successfully applied in neural semantic parsing (Jia and Liang, 2016;Dong and Lapata, 2016). We also include copy mechanism (Gulcehre et al., 2016;See et al., 2017) to tackle unknown tokens.

Primal Task
The primal task is semantic parsing which converts queries into logical forms (Q2LF ). Let x = x 1 · · · x |x| denote the query, and y = y 1 · · · y |y| denote the logical form. An encoder is exploited to encode the query x into vector representations, and a decoder learns to generate logical form y depending on the encoding vectors.
Encoder Each word x i is mapped to a fixeddimensional vector by a word embedding function ψ(·) and then fed into a bidirectional LSTM (Hochreiter and Schmidhuber, 1997). The hidden vectors are recursively computed at the i-th time step via: where [·; ·] denotes vector concatenation, h i ∈ R 2n , n is the number of hidden cells and f LSTM is the LSTM function.
Decoder Decoder is an unidirectional LSTM with the attention mechanism (Luong et al., 2015). The hidden vector at the t-th time step is computed by s t = f LSTM (φ(y t−1 ), s t−1 ), where φ(·) is the token embedding function for logical forms and s t ∈ R n . The hidden vector of the first time step is initialized as s 0 = ← − h 1 . The attention weight for the current step t of the decoder, with the i-th step in the encoder is and Figure 1: An overview of dual semantic parsing framework. The primal model (Q2LF ) and dual model (LF 2Q) can form a closed cycle. But there are two different directed loops, depending on whether they start from a query or logical form. Validity reward is used to estimate the quality of the middle generation output, and reconstruction reward is exploited to avoid information loss. The primal and dual models can be pre-trained and fine-tuned with labeled data to keep the models effective.
where v, b a ∈ R n , and W 1 ∈ R n×2n , W 2 ∈ R n×n are parameters. Then we compute the vocabulary distribution P gen (y t |y <t , x) by where W o ∈ R |Vy|×3n , b o ∈ R |Vy| and |V y | is the output vocabulary size. Generation ends once an end-of-sequence token "EOS" is emitted. Copy Mechanism We also include copy mechanism to improve model generalization following the implementation of See et al. (2017), a hybrid between Nallapati et al. (2016) and pointer network (Gulcehre et al., 2016). The predicted token is from either a fixed output vocabulary V y or raw input words x. We use sigmoid gate function σ to make a soft decision between generation and copy at each step t.
where g t ∈ [0, 1] is the balance score, v g is a weight vector and b g is a scalar bias. Distribution P copy (y t |y <t , x) will be described as follows. Entity Mapping Although the copy mechanism can deal with unknown words, many raw words can not be directly copied to be part of a logical form. For example, kobe bryant is represented as en.player.kobe_bryant in OVERNIGHT (Wang et al., 2015). It is common that entities are identified by Uniform Resource Identifier (URI, Klyne and Carroll, 2006) in a knowledge base. Thus, a mapping from raw words to URI is included after copying. Mathematically, P copy in Eq.8 is calculated by: where i < j, a t k is the attention weight of position k at decoding step t, KB(·) is a dictionarylike function mapping a specific noun phrase to the corresponding entity token in the vocabulary of logical forms.

Dual Model
The dual task (LF 2Q) is an inverse of the primal task, which aims to generate a natural language query given a logical form. We can also exploit the attention-based Encoder-Decoder architecture (with copy mechanism or not) to build the dual model. Reverse Entity Mapping Different with the primal task, we reversely map every possible KB entity y t of a logical form to the corresponding noun phrase before query generation, KB −1 (y t ) 1 . Since each KB entity may have multiple aliases in the real world, e.g. kobe bryant has a nickname the black mamba, we make different selections in two cases: • For paired data, we select the noun phrase from KB −1 (y t ), which exists in the query.
• For unpaired data, we randomly select one from KB −1 (y t ).

Dual learning for Semantic Parsing
In this section, we present a semantic parsing framework with dual learning. We use one agent to represent the model of the primal task (Q2LF ) and another agent to represent the model of the dual task (LF 2Q), then design a two-agent game in a closed loop which can provide quality feedback to the primal and dual models even if only queries or logical forms are available. As the feedback reward may be non-differentiable, reinforcement learning algorithm (Sutton and Barto, 2018) based on policy gradient (Sutton et al., 2000) is applied for optimization. Two agents, Q2LF and LF 2Q, participate in the collaborative game with two directed loops as illustrated in Figure 1.
One loop query->logical_form->query starts from a query, generates possible logical forms by agent Q2LF and tries to reconstruct the original query by LF 2Q.
The other loop logical_form->query->logical_form starts from the opposite side. Each agent will obtain quality feedback depending on reward functions defined in the directed loops.

Learning algorithm
Suppose we have fully labeled dataset T = { x, y }, unlabeled dataset Q with only queries if available, and unlabeled dataset LF with only logical forms if available. We firstly pre-train the primal model Q2LF and the dual model LF 2Q on T by maximum likelihood estimation (MLE). Let Θ Q2LF and Θ LF 2Q denote all the parameters of Q2LF and LF 2Q respectively. Our learning algorithm in each iteration consists of three parts:

Loop starts from a query
As shown in Figure 1 (a), we sample a query x from Q ∪ T randomly. Given x, Q2LF model could generate k possible logical forms y 1 , y 2 , · · · , y k via beam search (k is beam size). For each y i , we can obtain a validity reward R val q (y i ) (a scalar) computed by a specific reward function which will be discussed in Section 3.2.1. After feeding y i into LF 2Q, we finally get a reconstruction reward R rec q (x, y i ) which forces the generated query as similar to x as possible and will be discussed in Section 3.2.2.
A hyper-parameter α is exploited to balance these two rewards in By utilizing policy gradient (Sutton et al., 2000), the stochastic gradients of Θ Q2LF and Θ LF 2Q are computed as:

Loop starts from a logical form
As shown in Figure 1 (b), we sample a logical form y from LF ∪ T randomly. Given y, LF 2Q model generates k possible queries x 1 , x 2 , · · · , x k via beam search. For each x i , we can obtain a validity reward R val lf (x i ) (a scalar) which will also be discussed in Section 3.2.1. After feeding x i into Q2LF , we can also get a reconstruction reward R rec lf (y, x i ), which forces the generated logical form as similar to y as possible and will be discussed in Section 3.2.2.
A hyper-parameter β is also exploited to balance these two rewards by . By utilizing policy gradient, the stochastic gradients of Θ Q2LF and Θ LF 2Q are computed as:

Supervisor guidance
The previous two stages are unsupervised learning processes, which need no labeled data. If there is no supervision for the primal and dual models after pre-training, these two models would be rotten especially when T is limited.
To keep the learning process stable and prevent the models from crashes, we randomly select samples from T to fine-tune the primal and dual models by maximum likelihood estimation (MLE). Details about the dual learning algorithm for semantic parsing are provided in Appendix A.

Reward design
As mentioned in Section 3.1, there are two types of reward functions in each loop: validity reward (R val q , R val lf ) and reconstruction reward (R rec q , R rec lf ). But each type of reward function may be different in different loops.

Validity reward
Validity reward is used to evaluate the quality of intermediate outputs in a loop (see Figure 1). In the loop starts from a query, it indicates whether the generated logical forms are well-formed at the surface and semantic levels. While in the loop starts from a logical form, it indicates how natural and fluent the intermediate queries are. Loop starts from a query: We estimate the quality of the generated logical forms at two levels, i.e. surface and semantics. Firstly, we check whether the logical form is a complete tree without parentheses mismatching. As for semantics, we check whether the logical form is understandable without errors like type inconsistency. It can be formulated as R val q (y) = grammar_error_indicator(y) (9) which returns 1 when y has no error at the surface and semantic levels, and returns 0 otherwise. If there exists an executing program or search engine for logical form y, e.g. dataset OVERNIGHT (Wang et al., 2015), grammar_error_indicator(·) has been included.
Otherwise, we should construct a grammar error indicator based on the ontology of the corresponding dataset. For example, a specification of ATIS can be extracted by clarifying all (1) entities paired with corresponding types, (2) unary/binary predicates with argument constraints (see Table 1). Accordingly, Algorithm 1 abstracts the procedure of checking the surface and semantics for a logical form candidate y based on the specification. end if 6: end if 7: return 0 apply length-normalization (Wu et al., 2016) to make a fair competition between short and long queries.
where LM q (·) is a language model pre-trained on all the queries of Q ∪ T (referred in Section 3.1).

Reconstruction reward
Reconstruction reward is used to estimate how similar the output of one loop is compared with the input. We take log likelihood as reconstruction rewards for the loop starts from a query and the loop starts from a logical form. Thus, where y i and x i are intermediate outputs.

Experiment
In this section, we evaluate our framework on the ATIS and OVERNIGHT datasets.

Dataset
ATIS We use the preprocessed version provided by Dong and Lapata (2018), where natural language queries are lowercased and stemmed with NLTK (Loper and Bird, 2002), and entity mentions are replaced by numbered markers. We also leverage an external lexicon that maps word phrases (e.g., first class) to entities (e.g., first:cl) like what Jia and Liang (2016) did.
OVERNIGHT It contains natural language paraphrases paired with logical forms across eight domains. We follow the traditional 80%/20% train/valid splits as Wang et al. (2015) to choose the best model during training.
ATIS and OVERNIGHT never provide unlabeled queries. To test our method in semi-supervised learning, we keep a part of the training set as fully labeled data and leave the rest as unpaired queries and logical forms which simulate unlabeled data.

Synthesis of logical forms
Although there is no unlabeled query provided in most semantic parsing benchmarks, it should be easy to synthesize logical forms. Since a logical form is strictly structured and can be modified from the existing one or created from simple grammars, it is much cheaper than query collection. Our synthesized logical forms are public 2 .

Modification based on ontology
On ATIS, we randomly sample a logical form from the training set, and select one entity or predicate for replacement according to the specification in Table 1. If the new logical form after replacement is valid and never seen, it is added to the unsupervised set. 4592 new logical forms are created for ATIS. An example is shown in Figure 2.  Wang et al. (2015) proposed an underlying grammar to generate logical forms along with their corresponding canonical utterances on OVERNIGHT, which can be found in SEMPRE 3 . We reorder the entity instances (e.g., ENTITYNP) of one type (e.g., TYPENP) in grammar files to generate new logical forms. We could include new entity instances if we want more unseen logical forms, but we didn't do that actually. Finally, we get about 500 new logical forms for each domain on average. More examples can be found in Appendix B.

Experimental settings 4.3.1 Base models
We use 200 hidden units and 100-dimensional word vectors for all encoders and decoders of Q2LF and LF 2Q models. The LSTMs used are in single-layer. Word embeddings on query side are initialized by Glove6B (Pennington et al., 2014). Out-of-vocabulary words are replaced with a special token unk . Other parameters are initialized by uniformly sampling within the interval [−0.2, 0.2]. The language model we used is also a single-layer LSTM with 200 hidden units and 100-dim word embedding layer.

Training and decoding
We individually pre-train Q2LF /LF 2Q models using only labeled data and language model LM q using both labeled and unlabeled queries. The language model is fixed for calculating reward. The hyper-parameters α and β are selected according to validation set (0.5 is used), and beam size k is selected from {3, 5}. The batch size is selected from {10, 20}. We use optimizer Adam (Kingma and Ba, 2014) with learning rate 0.001 for all experiments. Finally, we evaluate the primal model (Q2LF , semantic parsing) and report test accuracy on each dataset.

Results and analysis
We perform a PSEUDO baseline following the setup in Sennrich et al. (2016) and Guo et al. (2018). The pre-trained LF 2Q or Q2LF model is used to generate pseudo query, logical f orm pairs from unlabeled logical forms or unlabeled queries, which extends the training set. The pseudo-labeled data is used carefully with a discount factor (0.5) in loss function (Lee, 2013), when we train Q2LF by supervised training.

Main results
The results are illustrated in Table 2 and 3. ATT and ATTPTR represent that the primal/dual models are attention-based Seq2seq and attention-based Seq2seq with copy mechanism respectively. We train models with the dual learning algorithm if DUAL is included, otherwise we only train the primal model by supervised training. LF refers to the synthesized logical forms. PSEUDO uses the   LF 2Q model and LF to generate pseudo-labeled data. From the overall results, we can see that: 1) Even without the additional logical forms by synthesizing, the dual learning based semantic parser can outperform our baselines with supervised training, e.g., "ATT + DUAL" gets much better performances than "ATT + PSEUDO(LF)" in Table 2 and 3. We think the Q2LF and LF 2Q models can teach each other in dual learning: one model sends informative signals to help regularize the other model. Actually, it can also be explained as a data augmentation procedure, e.g., Q2LF can generate samples utilized by LF 2Q and vice versa. While the PSEUDO greatly depends on the quality of pseudo-samples even if a discount factor is considered.
2) By involving the synthesized logical forms LF in the dual learning for each domain respectively, the performances are improved further. We achieve state-of-the-art performance (89.1%) 4 on ATIS as shown in Table 3. On OVERNIGHT dataset, we achieve a competitive performance on average (80.2%). The best average accuracy is from Su and Yan (2017), which benefits from cross-domain training. We believe our method could get more improvements with stronger primal models (e.g. with domain adaptation). Our method would be compatible with various models.
3) Copy mechanism can remarkably improve accuracy on ATIS, while not on OVERNIGHT. The average accuracy even decreases from 80.2% to 79.9% when using the copy mechanism. We argue that OVERNIGHT dataset contains a very small number of distinct entities that copy is not essential, and it contains less training samples than ATIS. This phenomenon also exists in Jia and Liang (2016).

Ablation study
Semi-supervised learning We keep a part of the training set as labeled data T randomly and leave the rest as unpaired queries (Q) and logical forms (LF) to validate our method in a semi-supervised setting. The ratio of labeled data is 50%. PSEUDO here uses the Q2LF model and Q to generate pseudo-labeled data, as well as LF 2Q model and LF. From Table 4, we can see that the dual learning method outperforms the PSEUDO baseline in two datasets dramatically. The dual learning method is more efficient to exploit unlabeled data. In general, both unpaired queries and logi-  Table 4: Semi-supervised learning experiments. We keep 50% of the training set as labeled data randomly, and leave the rest as unpaired queries(Q) and logical forms(LF) to simulate unsupervised dataset. cal forms could boost the performance of semantic parsers with dual learning.
Different ratios To investigate the efficiency of our method in semi-supervised learning, we vary the ratio of labeled data kept on ATIS from 1% to 90%. In Figure 3, we can see that dual learning strategy enhances semantic parsing over all proportions. The prominent gap happens when the ratio is between 0.2 and 0.4. Generally, the more unlabeled data we have, the more remarkable the leap is. However, if the labeled data is really limited, less supervision can be exploited to keep the primal and dual models reasonable. For example, when the ratio of labeled data is from only 1% to 10%, the improvement is not that obvious. Does more unlabeled data give better result?
We also fix the ratio of labeled data as 30%, and change the ratio of unlabeled samples to the rest data on ATIS, as illustrated in Figure 4. Even without unlabeled data (i.e. the ratio of unlabeled data is zero), the dual learning based semantic parser can outperform our baselines. However, the performance of our method doesn't improve constantly, when the amount of unlabeled data is increased. We think the power of the primal and dual models is constrained by the limited amount of labeled data. When some complex queries or logical forms come, the two models may converge to an equilibrium where the intermediate value loses some implicit semantic information, but the rewards are high.
Choice for validity reward We conduct another experiment by changing validity reward in Eq.9 with length-normalized LM score (i.e. language model of logical forms) like Eq.10. Results (Table 5) show that "hard" surface/semantic check is more suitable than "soft" probability of logical  Table 5: Test accuracies on ATIS and OVERNIGHT in semi-supervised learning setting (the ratio of labeled data is 50%). On OVERNIGHT, we average across all eight domains. LM lf means using a logical form language model for validity reward, while "grammar check" means using the surface and semantic check.
form LM. We think that simple language models may suffer from long-dependency and data imbalance issues, and it is hard to capture inner structures of logical forms from a sequential model.

Related Work
Lack of data A semantic parser can be trained from labeled logical forms or weakly supervised samples (Krishnamurthy and Mitchell, 2012;Berant et al., 2013;Liang et al., 2017;Goldman et al., 2018). Yih et al. (2016) demonstrate logical forms can be collected efficiently and more useful than merely answers to queries. Wang et al. (2015) construct a semantic parsing dataset starting from grammar rules to crowdsourcing paraphrase. Jia and Liang (2016) induces synchronous contextfree grammar (SCFG) and creates new "recombinant" examples accordingly. Su and Yan (2017) use multiple source domains to reduce the cost of collecting data for the target domain. Guo et al. (2018) pre-train a question generation model to produce pseudo-labeled data as a supplement.
In this paper, we introduce the dual learning to make full use of data (both labeled and unlabeled).  introduce a variational auto-encoding model for semi-supervised semantic parsing. Beyond semantic parsing, the semisupervised and adaptive learnings are also typical in natural language understanding (Tur et al., 2005;Bapna et al., 2017;Zhu et al., 2014. Constrained decoding To avoid invalid parses, additional restrictions must be considered in the decoding. Dong and Lapata (2016) propose SEQ2TREE method to ensure the matching of parentheses, which can generate syntactically valid output. Cheng et al. (2017) and Dong and Lapata (2018) both try to decode in two steps, from a coarse rough sketch to a finer structure hierarchically. Krishnamurthy et al. (2017) define a grammar of production rules such that only welltyped logical forms can be generated. Yin and Neubig (2017) and Chen et al. (2018a) both transform the generation of logical forms into query graph construction. Zhao et al. (2019) propose a hierarchical parsing model following the structure of semantic representations, which is predefined by domain developers. We introduce a validity reward at the surface and semantic levels in the dual learning algorithm as a constraint signal. Dual learning Dual learning framework is first proposed to improve neural machine translation (NMT) (He et al., 2016). Actually, the primal and dual tasks are symmetric in NMT, while not in semantic parsing. The idea of dual learning has been applied in various tasks (Xia et al., 2017), such as Question Answering/Generation (Tang et al., 2017(Tang et al., , 2018, Image-to-Image Translation (Yi et al., 2017) and Open-domain Information Extraction/Narration (Sun et al., 2018). We are the first to introduce dual learning in semantic parsing to the best of our knowledge.

Conclusions and Future Work
In this paper, we develop a semantic parsing framework based on dual learning algorithm, which enables a semantic parser to fully utilize labeled and even unlabeled data through a duallearning game between the primal and dual models. We also propose a novel reward function at the surface and semantic levels by utilizing the prior-knowledge of logical form structures. Thus, the primal model tends to generate complete and reasonable semantic representation. Experimental results show that semantic parsing based on dual learning improves performance across datasets.
In the future, we want to incorporate this framework with much refined primal and dual models, and design more informative reward signals to make the training more efficient. It would be appealing to apply graph neural networks (Chen et al., 2018b(Chen et al., , 2019  Pre-trained models on T : Q2LF model P (y|x; Θ Q2LF ), LF 2Q model P (x|y; Θ LF 2Q ); 3: Pre-trained model on Q and queries of T : Language Model for queries LM q ; 4: Indicator performs surface and semantic check for a logical form: grammar_error_indicator(·); 5: Beam search size k, hyper parameters α and β, learning rate η 1 for Q2LF and η 2 for LF 2Q; Output: Parameters Θ Q2LF of Q2LF model 6: repeat 7: Reinforcement learning process uses unlabeled data, also reuses labeled data 8: Sample a query x from Q ∪ T ; 9: Q2LF model generates k logical forms y 1 , y 2 , · · · , y k via beam search; 10: for each possible logical form y i do 11: Obtain validity reward for y i as R val q (y i ) = grammar_error_indicator(y i )

12:
Get reconstruction reward for y i as R rec q (x, y i ) = log P (x|y i ; Θ LF 2Q )