Unsupervised Dual Paraphrasing for Two-stage Semantic Parsing

One daunting problem for semantic parsing is the scarcity of annotation. Aiming to reduce nontrivial human labor, we propose a two-stage semantic parsing framework, where the first stage utilizes an unsupervised paraphrase model to convert an unlabeled natural language utterance into the canonical utterance. The downstream naive semantic parser accepts the intermediate output and returns the target logical form. Furthermore, the entire training process is split into two phases: pre-training and cycle learning. Three tailored self-supervised tasks are introduced throughout training to activate the unsupervised paraphrase model. Experimental results on benchmarks Overnight and GeoGranno demonstrate that our framework is effective and compatible with supervised training.


Introduction
Semantic parsing is the task of converting natural language utterances into structured meaning representations, typically logical forms (Zelle and Mooney, 1996;Wong and Mooney, 2007;Zettlemoyer and Collins, 2007;Lu et al., 2008). One prominent approach to build a semantic parser from scratch follows this procedure (Wang et al., 2015): a). (canonical utterance, logical form) pairs are automatically generated according to a domain-general grammar and a domainspecific lexicon. b). Researchers use crowdsourcing to paraphrase those canonical utterances into natural language utterances (the upper part of Figure 1). c). A semantic parser is built upon collected (natural language utterance, logical form) pairs. * The corresponding author is Kai Yu. Figure 1: Two-stage semantic parsing framework, which is composed of an unsupervised paraphrase model and a naive neural semantic parser.
Canonical utterances are pseudo-language utterances automatically generated from grammar rules, which can be understandable to people, but do not sound natural. Though effective, the paraphrasing paradigm suffers from two drawbacks: (1) dependency on nontrivial human labor and (2) low utilization of canonical utterances.
Annotators may struggle to understand the exact meanings of canonical utterances. Some canonical utterances even incur ambiguity, which enhances the difficulty of annotation. Furthermore, Wang et al. (2015) and Herzig and Berant (2019) only exploit them during data collection. Once the semantic parsing dataset is constructed, canonical utterances are thrown away, which leads to insufficient utilization. While Berant and Liang (2014) and Su and Yan (2017) have reported the effectiveness of leveraging them as intermediate outputs, they experiment in a completely supervised way, where the human annotation is indispensable.
In this work, inspired by unsupervised neural machine translation (Lample et al., 2017;Artetxe et al., 2017), we propose a two-stage semantic parsing framework. The first stage uses a paraphrase model to convert natural language utterances into corre-sponding canonical utterances. The paraphrase model is trained in an unsupervised way. Then a naive 1 neural semantic parser is built upon autogenerated (canonical utterance, logical form) pairs using traditional supervised training. These two models are concatenated into a pipeline (Figure 1).
Paraphrasing aims to perform semantic normalization and reduce the diversity of expression, trying to bridge the gap between natural language and logical forms. The naive neural semantic parser learns inner mappings between canonical utterances and logical forms, as well as the structural constraints.
The unsupervised paraphrase model consists of one shared encoder and two separate decoders for natural language and canonical utterances. In the pre-training phase, we design three types of noise (Section 3.1) tailored for sentence-level denoising autoencoder (Vincent et al., 2008) task to warm up the paraphrase model without any parallel data. This task aims to reconstruct the raw input utterance from its corrupted version. After obtaining a good initialization point, we further incorporate back-translation (Sennrich et al., 2015) and dual reinforcement learning (Section 2.2.2) tasks during the cycle learning phase. In this phase, one encoderdecoder model acts as the environment to provide pseudo-samples and reward signals for another.
We conduct extensive experiments on benchmarks OVERNIGHT and GEOGRANNO, both in unsupervised and semi-supervised settings. The results show that our method obtains significant improvements over various baselines in unsupervised settings. With full labeled data, we achieve new state-of-the-art performances (80.1% on OVERNIGHT and 74.5% on GEOGRANNO), not considering additional data sources.
The main contributions of this work can be summarized as follows: • A two-stage semantic parser framework is proposed, which casts parsing into paraphrasing. No supervision is provided in the first stage between input natural language utterances and intermediate output canonical utterances.
• In unsupervised settings, experimental results on datasets OVERNIGHT and GEOGRANNO demonstrate the superiority of our model over various baselines, including the supervised method in Wang et al. (2015) on OVERNIGHT (60.7% compared to 58.8%).
• The framework is also compatible with traditional supervised training and achieves the new state-of-the-art performances on datasets OVERNIGHT (80.1%) and GEOGRANNO (74.5%) with full labeled data.
2 Our Approach

Problem Definition
For the rest of our discussion, we use x to denote natural language utterance, z for canonical utterance, and y for logical form. X , Z and Y represent the set of all possible natural language utterances, canonical utterances, and logical forms respectively.
The underlying mapping function f : Z −→ Y is dominated by grammar rules. We can train a naive neural semantic parser P nsp using attention (Luong et al., 2015) based Seq2Seq model (Sutskever et al., 2014). The labeled samples {(z, y), z ∈ Z, y ∈ Y} can be automatically generated by recursively applying grammar rules. P nsp can be pre-trained and saved for later usage.
As for the paraphrase model (see Figure 1), it consists of one shared encoder E and two independent decoders: D x for natural language utterances and D z for canonical utterances. The symbol • denotes module composition. Detailed model implementations are omitted here since they are not the main focus (Appendix A.1 for reference).
Given an input utterance x ∈ X , the paraphrase model D z • E converts it into possible canonical utteranceẑ = D z • E(x); thenẑ is passed into the pre-trained naive parser P nsp to obtain predicted logical formŷ = P nsp • D z • E(x). Another paraphrase model, D x • E, is only used as an auxiliary tool during training.

Unsupervised training procedures
To train an unsupervised paraphrase model with no parallel data between X and Z, we split the entire training procedure into two phases: pre-training and cycle learning. D x • E and D z • E are first pretrained as denoising auto-encoders (DAE). This initialization phase plays a significant part in accelerating convergence due to the ill-posed nature of paraphrasing tasks. Next, in the cycle learning phase, we employ both back-translation (BT) and dual reinforcement learning (DRL) strategies for self-training and exploration.

Pre-training phase
In this phase, we initialize the paraphrase model via the denoising auto-encoder task. All auxiliary models involved in calculating rewards (see Section 3.2) are also pre-trained. Denoising auto-encoder Given a natural language utterance x, we forward it through a noisy channel N x (·) (see Section 3.1) and obtain its corrupted versionx. Then, model D x • E tries to reconstruct the original input x from its corrupted versionx, see Figure 2. Symmetrically, model D z • E tries to reconstruct the original canonical utterance z from its corrupted input N z (z). The training objective can be formulated as where Θ Dx•E and Θ Dz•E are parameters for the system.

Cycle learning phase
The training framework till now is just a noisycopying model. To improve upon it, we adopt two schemes in the cycle learning phase, backtranslation (BT) and dual reinforcement learning (DRL), see Figure 3.
Back-translation In this task, the shared encoder E aims to map the input utterance of different types into the same latent space, and the decoders need to decompose this representation into the utterance of another type. More concretely, given a natural language utterance x, we use paraphrase model D z • E in evaluation mode with greedy decoding to convert x into canonical utteranceẑ. We will obtain pseudo training sample (ẑ, x) for paraphrase model D x • E. Similarly, (x, z) pair can be synthesized from model D x • E given canonical utterance z. Next, we train the paraphrase model from these pseudo-parallel samples and update parameters by minimizing The updated model will generate better paraphrases during the iterative process.
Dual reinforcement learning Back-translation pays more attention to utilize what has been learned by the dual model, which may lead to a local optimum. To encourage more trials during cycle learning, we introduce the dual reinforcement learning strategy and optimize the system through policy gradient (Sutton et al., 2000).
Starting from a natural language utterance x, we sample one canonical utterancez through D z • E. Then, we evaluate the quality ofz from different aspects (see Section 3.2) and obtain reward R x (z). Similarly, we calculate reward R z (x) for sampled natural language utterancex. To cope with high variance in reward signals, we increase sample size to K and re-define reward signals via a baseline b(·) to stabilize learning: (takez k for an example) We investigate different baseline choices (such as running mean, cumulative mean of history, and reward of the greedy decoding prediction), and it performs best when we use the average of rewards within samples of per input, especially with larger sample size. The training objective is the negative sum of expected reward: The gradient is calculated with REIN-FORCE (Williams, 1992) algorithm: The complete loss function in the cycle learning phase is the sum of cross entropy loss and policy gradient loss: L Cycle = L BT + L DRL . The entire training procedure is summarized in Algorithm 1. Algorithm 1 Training procedure Pre-training phase 1: Pre-train all auxiliary models: language models LM x and LM z , naive neural semantic parser P nsp and utterance discriminator P dis 2: Pre-train paraphrase models D Cycle learning phase 3: for i = 0 to M − 1 do 4: Sample natural language utterance x ∼ X

5:
Sample canonical utterance z ∼ Z Back-translation 6: Use (ẑ, x) and (x, z) as pseudo samples, calculate loss L BT based on Eq.2; Dual Reinforcement Learning 9: Given R x (z) and R z (x), calculate loss L DRL based on Eq.3 Update model parameters 14: Calculate total loss L Cycle = L BT +L DRL 15: Update model parameters, get new models D

Training details
In this section, we elaborate on different types of noise used in our experiment and the reward design in dual reinforcement learning.

Noisy channel
We introduce three types of noise to deliberately corrupt the input utterance in the DAE task.
Importance-aware word dropping Traditional word dropping (Lample et al., 2017) discards each word in the input utterance with equal probability p wd . During reconstruction, the decoder needs to recover those words based on the context. We further inject a bias towards dropping more frequent words (such as function words) in the corpus instead of less frequent words (such as content words), see Table 1   Each word x i in the natural language utterance where w(x i ) is the word count of x i in X , and p max is the maximum dropout rate (p max = 0.2 in our experiment). As for canonical utterances, we apply this word dropping similarly.
Mixed-source addition For any given raw input, it is either a natural language utterance or a canonical utterance. This observation discourages the shared encoder E to learn a common representation space. Thus, we propose to insert extra words from another source into the input utterance. As for noisy channel N x (·), which corrupts a natural language utterance, we first select one candidate canonical utterance z; next, 10%-20% words are randomly sampled from z and inserted into arbitrary position in x, see Table 2 for example.
To pick candidate z with higher relevance, we use a heuristic method: C canonical utterances are randomly sampled as candidates (C = 50); we choose z that has the minimum amount of Word Mover's Distance concerning x (WMD, Kusner et al., 2015). The additive operation is exactly symmetric for noisy channel N z .  Bigram shuffling We also use word shuffling (Lample et al., 2017) in noisy channels. It has been proven useful in preventing the encoder from relying too much on the word order. Instead of shuffling words, we split the input utterance into n-grams first and shuffle at n-gram level (bigram in our experiment). Considering the inserted words from another source, we shuffle the entire utterance after the addition operation (see Table 3

Reward design
In order to provide more informative reward signals and promote the performance in the DRL task, we introduce various rewards from different aspects.
Fluency The fluency of an utterance is evaluated by a length-normalized language model. We use individual language models (LM x and LM z ) for each type of utterances. As for a sampled natural language utterancex, the fluency reward is As for canonical utterances, we also include an additional 0/1 reward from downstream naive semantic parser to indicate whether the sampled canonical utterancez is well-formed as input for P nsp .
y =arg max y P nsp (y|z), greedy decoding Style Natural language utterances are diverse, casual, and flexible, whereas canonical utterances are generally rigid, regular, and restricted to some specific form induced by grammar rules. To distinguish their characteristics, we incorporate another reward signal that determine the style of the sampled utterance. This is implemented by a CNN discriminator (Kim, 2014): where P dis (·) is a pre-trained sentence classifier that evaluates the probability of the input utterance being a canonical utterance.
Relevance Relevance reward is included to measure how much content is preserved after paraphrasing. We follow the common practice to take the loglikelihood from the dual model.
Some other methods include computing the cosine similarity of sentence vectors or BLEU score (Papineni et al., 2002) between the raw input and the reconstructed utterance. Nevertheless, we find loglikelihood to perform better in our experiments. The total reward for the sampled canonical utterancez and natural language utterancex can be formulated as

Experiment
In this section, we evaluate our system on benchmarks OVERNIGHT and GEOGRANNO in both un-supervised and semi-supervised settings. Our implementations are public available 2 .
OVERNIGHT It contains natural language paraphrases paired with logical forms over 8 domains.
We follow the traditional 80%/20% train/valid to choose the best model during training. Canonical utterances are generated with tool SEMPRE 3 paired with target logical forms (Wang et al., 2015). Due to the limited number of grammar rules and its coarse-grained nature, there is only one canonical utterance for each logical form, whereas 8 natural language paraphrases for each canonical utterance on average. For example, to describe the concept of "larger", in natural language utterances, many synonyms, such as "more than", "higher", "at least", are used, while in canonical utterances, the expression is restricted by grammar.
GEOGRANNO Due to the language mismatch problem (Herzig and Berant, 2019), annotators are prone to reuse the same phrase or word while paraphrasing. GEOGRANNO is created via detection instead of paraphrasing. Natural language utterances are firstly collected from query logs. Crowd workers are required to select the correct canonical utterance from candidate list (provided by an incrementally trained score function) per input. We follow exactly the same split (train/valid/test 487/59/278) in original paper Herzig and Berant (2019).

Experiment setup
Throughout the experiments, unless otherwise specified, word vectors are initialized with Glove6B (Pennington et al., 2014) with 93.3% coverage on average and allowed to fine-tune. Out-ofvocabulary words are replaced with unk . Batch size is fixed to 16 and sample size K in the DRL task is 6. During evaluation, the size of beam search is 5. We use optimizer Adam (Kingma and Ba, 2014) with learning rate 0.001 for all experiments. All auxiliary models are pre-trained and fixed for later usage. We report the denotation-level accuracy of logical forms in different settings.
Supervised settings This is the traditional scenario, where labeled (x, y) pairs are used to train a one-stage parser directly, (x, z) and (z, y) pairs are respectively used to train different parts of a two-stage parser.
Unsupervised settings We split all methods into two categories: one-stage and two-stage. In the one-stage parser, EMBED semantic parser is merely trained on (z, y) pairs but evaluated on natural language utterances. Contextual embeddings ELMo (Peters et al., 2018) and Bert-baseuncased (Devlin et al., 2018) are also used to replace the original embedding layer; WMDSAM-PLES method labels each input x with the most similar logical form (one-stage) or canonical utterance (two-stage) based on WMD (Kusner et al., 2015) and deals with these faked samples in a supervised way; MULTITASKDAE utilizes another decoder for natural language utterances in one-stage parser to perform the same DAE task discussed before. The two-stage COMPLETEMODEL can share the encoder or not (-SHAREDENCODER), and include tasks in the cycle learning phase or not (-CYCLELEARNING). The downstream parser P nsp for the two-stage system is EMBED + GLOVE6B and fixed after pre-training.
Semi-supervised settings To further validate our framework, based on the complete model in unsupervised settings, we also conduct semisupervised experiments by gradually adding part of labeled paraphrases with supervised training into the training process (both pre-training and cycle learning phase).

Results and analysis
As Table 4 and 5 demonstrate, in unsupervised settings: (1) two-stage semantic parser is superior to one-stage, which bridges the vast discrepancy between natural language utterances and logical forms by utilizing canonical utterances. Even in supervised experiments, this pipeline is still competitive (76.4% compared to 76.0%, 71.6% to 71.9%).
(2) Not surprisingly, model performance is sensitive to the word embedding initialization. On OVERNIGHT, directly using raw Glove6B word vectors, the performance is the worst among all baselines (19.7%). Benefiting from pre-trained embeddings ELMo or Bert, the accuracy is dramatically improved (26.2% and 32.7%).
(3) When we share the encoder module in a one-stage parser for multi-tasking (MULTITASKDAE), the performance is not remarkably improved, even slightly lower than EMBED+BERT (31.9% compared to 32.7%, 38.1% to 40.7%). We hypothesize that a semantic parser utilizes the input utterance in a way different from that of a denoising auto-encoder,  (Wang et al., 2015) 46.3 41.9 74.4 54.0 59.0 70.8 75.9 48.2 58.8 DSP-C (Xiao et al., 2016) 80.5 55.6 75.0 61.9 75.8 80.1 80.0 72.7 NORECOMB* (Jia and Liang, 2016)    thus focusing on different zones in representation space. However, in a paraphrasing model, since the input and output utterances are exactly symmetric, sharing the encoder is more suitable to attain an excellent performance (from 57.5% to 60.7% on OVERNIGHT, 59.0% to 63.7% on GE-OGRANNO). Furthermore, the effectiveness of the DAE pre-training task (44.9% and 44.6% accu-racy on target task) can be explained in part by the proximity of natural language and canonical utterances. (4) WMDSAMPLES method is easy to implement but has poor generalization and obvious upper bound. While our system can self-train through cycle learning and promote performance from initial 44.9% to 60.7% on OVERNIGHT, outperforming traditional supervised method (Wang et al., 2015) by 1.9 points.
As for semi-supervised results: (1) when only 5% labeled data is added, the performance is dramatically improved from 60.7% to 68.4% on OVERNIGHT and 63.7% to 69.4% on GE-OGRANNO. (2) With 30% annotation, our system is competitive (75.0%/71.6%) to the neural network model using all data with supervised training.
(3) Compared with the previous result reported in Cao et al. (2019) on dataset OVERNIGHT with 50% parallel data, our system surpasses it by a large margin (4%) and achieves the new state-of-the-art performance on both datasets when using all labeled data (80.1%/74.5%), not considering results using additional data sources or cross-domain benefits.
From the experimental results and Figure 4, we can safely summarize that (1) our proposed method resolves the daunting problem of cold start when we train a semantic parser without any parallel data.
(2) It is also compatible with traditional supervised training and can easily scale up to handle more labeled data.

Ablation study
In this section, we analyze the influence of each noise type in the DAE task and different combinations of schemes in the cycle learning phase on dataset OVERNIGHT.  According to results in Table 6, (1) it is interesting that even without any noise, in which case the denoising auto-encoder degenerates into a simple copying model, the paraphrase model still succeeds to make some useful predictions (26.9%). This observation may be attributed to the shared encoder for different utterances.
(2) When we gradually complicate the DAE task by increasing the number of noise types, the generalization capability continues to improve. (3) Generally speaking, importance-aware drop and mixed-source addition are more useful than bigram shuffling in this task.

Strategies in the cycle learning
The most striking observation arising from Table  7 is that the performance decreases by 1.5 percent when we add the DAE task into the cycle learning phase (BT+DRL). A possible explanation for this phenomenon may be that the model has reached its bottleneck of the DAE task after pre-training, thereby making no contribution to the cycle learning process. Another likely factor may stem from the contradictory goals of different tasks. If we continue to add this DAE regularization term, it may hinder exploratory trials of the DRL task. By decoupling the three types of rewards in DRL, we discover that style and relevance rewards are more informative than the fluency reward.

Case study
In Table 8, we compare intermediate canonical utterances generated by our unsupervised paraphrase model with that created by the baseline WMDSAM-PLES. In domain BASKETBALL, our system succeeds in paraphrasing the constraint into "at least 3", which is an alias of "3 or more". This finding consolidates the assumption that our model can learn these fine-grained semantics, such as phrase alignments. In domain GEOGRANNO, our model rectifies the errors in baseline system where constraint "borders state " is missing and subject "state" is stealthily replaced with "population". As for domain CALENDAR, the baseline system fails to identify the query object and requires "meeting" instead of "person". Although our model correctly understands the purpose, it is somewhat stupid to do unnecessary work. The requirement "attendee of weekly standup" is repeated. This may be caused by the uncontrolled process during cycle learning in that we encourage the model to take risky steps for better solutions.

Related Work
Annotation for Semantic Parsing Semantic parsing is always data-hungry. However, the annotation for semantic parsing is not user-friendly. Many researchers have attempted to relieve the burden of human annotation, such as training from weak supervision (Krishnamurthy and Mitchell, 2012;Berant et al., 2013;Liang et al., 2017;Goldman et al., 2018), semi-supervised learning Guo et al., 2018;Cao et al., 2019;Zhu et al., 2014), on-line learning (Iyer et al., 2017;Lawrence and Riezler, 2018) and relying on multi-lingual (Zou and Lu, 2018) or cross-domain datasets (Herzig and Berant, 2017;Zhao et al., 2019). In this work, we try to avoid the heavy work in annotation by utilizing canonical utterances as intermediate results and construct an unsupervised model for paraphrasing.
Unsupervised Learning for Seq2Seq Models Seq2Seq (Sutskever et al., 2014;Zhu and Yu, 2017) models have been successfully applied in unsupervised tasks such as neural machine translation (NMT) (Lample et al., 2017;Artetxe et al., 2017), text simplification (Zhao et al., 2020), spoken language understanding (Zhu et al., 2018) and text style transfer (Luo et al., 2019). Unsupervised NMT relies heavily on pre-trained cross-lingual word embeddings for initialization, as Lample et al. (2018) pointed out. Moreover, it mainly focuses on learning phrase alignments or word mappings. While in this work, we dive into sentence-level semantics and adopt the dual structure of an unsupervised paraphrase model to improve semantic parsing.

Conclusion
In this work, aiming to reduce annotation, we propose a two-stage semantic parsing framework. The first stage utilizes the dual structure of an unsupervised paraphrase model to rewrite the input natural language utterance into canonical utterance. Three self-supervised tasks, namely denoising auto-encoder, back-translation and dual reinforcement learning, are introduced to iteratively improve our model through pre-training and cycle learning phases. Experimental results show that our framework is effective, and compatible with supervised training.

A.1 Model Implementations
In this section, we give a full version discussion about all models used in our two-stage semantic parsing framework.
Unsupervised paraphrase model We use traditional attention (Luong et al., 2015) based Seq2Seq model. Different from previous work, we remove the transition function of hidden states between encoder and decoder. The initial hidden states of decoders are initialized to 0-vectors. Take D z • E paraphrase model as an example: (1) a shared encoder encodes the input utterance x into a sequence of contextual representations h through a bi-directional single-layer LSTM (Hochreiter and Schmidhuber, 1997) network (ψ is the embedding function) (2) on the decoder side, a traditional LSTM language model at the bottom is used to model dependencies in target utterance z (φ is the embedding function on target side) s t =f LSTM (φ(z t−1 ), s t−1 ) s 0 =0-vector (3) output state s t at each time-step t is then fused with encoded contexts h to obtain the features for final softmax classifier (v, W * and b * are model parameters) In both pre-training and cycle learning phases, the unsupervised paraphrase model is trained for 50 epochs, respectively. To select the best model during unsupervised training, inspired by Lample et al. (2017), we use a surrogate criterion since we have no access to labeled data (x, z) even during validation time. For one natural language utterance x, we pass it into the model D z • E and obtain a canonical utteranceẑ via greedy decoding. Thenẑ is forwarded into the dual paraphrase model D x •E. By measuring the BLEU score between raw input x and reconstructed utterancex, we obtain one metric BLEU (x,x). In the reverse path, we will obtain another metric by calculating the overall accuracy between raw canonical utterance z and its reconstructed versionẑ through the naive semantic parser P nsp . The overall metric for model selection is (λ is a scaling hyper-parameter, set to 4 in our experiments) Auxiliary models The naive semantic parser P nsp is another Seq2Seq model with exactly the same architecture as D z • E. We do not incorporate copy mechanism cause it has been proven useless on dataset OVERNIGHT (Jia and Liang, 2016). The language models LM x and LM z are all singlelayer unidirectional LSTM networks. As for style discriminator P dis , we use a CNN based sentence classifier (Kim, 2014). We use rectified linear units and filter windows of 3, 4, 5 with 10, 20, 30 feature maps respectively. All the auxiliary models are trained with maximum epochs 100. For all models discussed above, the embedding dimension is set to 100, hidden size to 200, dropout rate between layers to 0.5. All parameters except embedding layers are initialized by uniformly sampling within the interval [−0.2, 0.2].