Unified Semantic Parsing with Weak Supervision

Semantic parsing over multiple knowledge bases enables a parser to exploit structural similarities of programs across the multiple domains. However, the fundamental challenge lies in obtaining high-quality annotations of (utterance, program) pairs across various domains needed for training such models. To overcome this, we propose a novel framework to build a unified multi-domain enabled semantic parser trained only with weak supervision (denotations). Weakly supervised training is particularly arduous as the program search space grows exponentially in a multi-domain setting. To solve this, we incorporate a multi-policy distillation mechanism in which we first train domain-specific semantic parsers (teachers) using weak supervision in the absence of the ground truth programs, followed by training a single unified parser (student) from the domain specific policies obtained from these teachers. The resultant semantic parser is not only compact but also generalizes better, and generates more accurate programs. It further does not require the user to provide a domain label while querying. On the standard Overnight dataset (containing multiple domains), we demonstrate that the proposed model improves performance by 20% in terms of denotation accuracy in comparison to baseline techniques.


Introduction
Semantic parsing is the task of converting natural language utterances into machine executable programs such as SQL, lambda logical form (Liang, 2013). This has been a classical area of research in natural language processing (NLP) with earlier works primarily utilizing rule based approaches (Woods, 1973) or grammar based approaches (Lafferty et al., 2001;Kwiatkowski et al., Figure 1: Examples for natural language utterances with linguistic variations in two different domains that share structural regularity (Source: OVERNIGHT dataset). Note that in this setup, we do not use ground truth parses for training the semantic parser. 2011; Collins, 2005, 2007). Recently, there has been a surge in neural encoderdecoder techniques which are trained with input utterances and corresponding annotated output programs (Dong and Lapata, 2016;Jia and Liang, 2016). However, the performance of these strongly supervised methods is restricted by the size and the diversity of training data i.e. natural language utterances and their corresponding annotated logical forms. This has motivated the work on applying weak supervision based approaches (Clarke et al., 2010;Liang et al., 2017;Neelakantan et al., 2016;Chen et al., 2018), which use denotations i.e. the final answers obtained upon executing a program on the knowledge base and use REINFORCE (Williams, 1992;Norouzi et al., 2016), to guide the network to learn its semantic parsing policy (see Figure 3(a)). Another line of work (Goldman et al., 2018;Cheng and Lapata, 2018) is aimed towards improving the efficiency of weakly supervised parsers by applying a twostage approach of first learning to generate program templates followed by exact program generation. It is important to note that this entire body of work on weakly supervised semantic parsing has arXiv:1906.05062v1 [cs.CL] 12 Jun 2019 been restricted to building a parser over a single domain only (i.e. single dataset).
Moving beyond single-domain to multiple domains, Herzig and Berant (2017) proposed semantic parsing networks trained by combining the datasets corresponding to multiple domains into a single pool. Consider the example in Figure 1 illustrating utterances from two domains, RECIPES and PUBLICATIONS, of the OVERNIGHT dataset. The utterances have linguistic variations most and maximum number corresponding to the shared program token argmax. This work shows that leveraging such structural similarities in language by combining these different domains leads to improved performance. However, as with many single-domain techniques, this work also requires strong supervision in the form of program annotations corresponding to the utterances. Obtaining such high quality annotations across multiple domains is challenging, thereby making it expensive to scale to newer domains.
To overcome these limitations, in this work, we focus on the problem of developing a semantic parser for multiple domains in the weak supervision setting using denotations. Note that, this combined multiple domain task clearly entails a large set of answers and complex search space in comparison to the individual domain tasks. Therefore, the existing multi-domain semantic parsing models (Herzig and Berant, 2017) fail when trained under weak supervision setting. See Section 6 for a detailed analysis.
To address this challenge, we propose a multipolicy distillation framework for multi-domain semantic parsing. This framework splits the training in the following two stages: 1) Learn domain experts (teacher) policy using weak supervision for each domain. This allows the individual models to focus on learning the semantic parsing policy for corresponding single domains; 2) Train a unified compressed semantic parser (student) using distillation from these expert policies. This enables the unified student to gain supervision from the above trained expert policies and thus, learn the shared semantic parsing policy for all the domains. This two-stage framework is inspired from policy distillation (Rusu et al., 2016) which transfers policy of a reinforcement learning (RL) agent to train a student network that is more compact and efficient. In our case, weakly supervised domain teachers serve as RL agents. For inference, only the compressed student model is used which takes as input the user utterance from any domain and outputs the corresponding parse program. It is important to note that, the domain identifier input is not required by our model. The generated program is then executed over the corresponding KB to retrieve denotations that are provided as responses to the user.
To the best of our knowledge, we are the first to propose a unified multiple-domain parsing framework which does not assume the availability of ground truth programs. Additionally, it allows inference to be multi-domain enabled and does not require user to provide domain identifiers corresponding to the input utterance. In summary, we make the following contributions: • Build a unified neural framework to train a single semantic parser for multiple domains in the absence of ground truth parse programs. (Section 3) • We show the effectiveness of multi-policy distillation in learning a semantic parser using independent weakly supervised experts for each domain. (Section 4) • We perform an extensive experimental study in multiple domains to understand the efficacy of the proposed system against multiple baselines. We also study the effect of the availability of a small labeled corpus in the distillation setup. (Section 5) 2 Related Work This work is related to three different areas: semantic parsing, policy learning and knowledge distillation. Figure 2 illustrates the placement of our proposed framework of unified semantic parsing in the space of the key related works done in each of these three areas. Semantic parsing has been an extensively studied problem, the first study dating back to Woods (1973). Much of the work has been towards exploiting annotated programs for natural language utterances to build single domain semantic parsers using various methods. Zettlemoyer and Collins (2007); Kwiatkowski et al. (2011) propose to learn the probabilistic categorical combination grammars, Kate et al. (2005) learn transformation from syntactic parse tree of natural language utterance to formal parse tree. Andreas et al. (2013) model the task of semantic parsing as machine translation. Recently, Dong and Lapata (2016) introduce the use of neural sequence-to-sequence models for the task of machine translation. Due to the cost of obtaining annotated programs, there has been an increasing interest in using weak supervision based methods (Clarke et al., 2010;Liang et al., 2017;Neelakantan et al., 2016;Chen et al., 2018;Goldman et al., 2018) which uses denotations, i.e. final answers obtained on executing a program on the knowledge base, for training.
The problem of semantic parsing has been primarily studied in a single domain setting employing supervised and weakly supervised techniques. However, the task of building a semantic parser in the multi-domain setting is relatively new. Herzig and Berant (2017) propose semantic parsing models using supervised learning in a multi-domain setup and is the closest to our work. However, none of the existing works inspect the problem of multi-domain semantic parsing in a weak supervision setting.
Knowledge distillation was first presented by Hinton et al. (2015) and has been popularly used for model compression of convolution neural networks in computer vision based tasks (Yu et al., 2017;. Kim and Rush (2016); Chen et al. (2017) applied knowledge distillation on recurrent neural networks for the task of machine translation and showed improved performance with a much compressed student network. Our proposed method of policy distillation was first introduced by Rusu et al. (2016) and is built on the principle of knowledge distillation and applied for reinforcement learning agents. Variants of the framework for policy distillations have also been proposed (Teh et al., 2017). To the best of our knowledge, our work is the first to apply policy distillation in a sequence-to-sequence learning task. We anticipate that the framework described in this paper can be applied to learn unified models for other tasks as well.

Proposed Framework
In this section, we first present a high level overview of the framework for the proposed unified semantic parsing using multi-policy distillation and then describe the models employed for each component of the framework.
We focus on the setting of 'K' domains each with an underlying knowledge-base B 1 , · · · , B K . We have a training set of utterances X k and the corresponding final denotations Y k , for each domain k ∈ 1, · · · , K. Unlike existing works (Herzig and Berant, 2017), we do not assume availability of ground truth programs corresponding to the utterances in the training data. Our goal is to learn a unified semantic parsing model which takes as input a user utterance x k i = {x k i1 , · · · , x k in } ∈ X k from any domain k and produces the corresponding program z k i = {z k i1 , · · · , z k im } which when executed on the corresponding knowledge base B k should return denotation y k i ∈ Y k . In this setup, we only rely on the weak supervision from the final denotations Y k for training this model. Moreover, the domain identifier k is not needed by this unified model.
We use multi-policy distillation framework for the task of learning a unified semantic parser. Figure 3 summarizes the proposed architecture. We first train parsing models (teachers) for each domain using weak supervision to learn domainspecific teacher policies. We use REINFORCE for training, similar to prior work on Neural Symbolic Machine (Liang et al., 2017) described briefly in Section 4.1. Next, we distill the learnt teacher policies to train a unified semantic parser enabled over multiple domain. (described in Section 4.2). Note that: (1) Our teachers are trained with weak supervision from denotations instead of actual parses and hence are weaker compared to completely supervised semantic parses. (2) Stateof-the-art sequence distillation works (Kim and Rush, 2016;Chen et al., 2017) have focused on a single teacher-student setting.   Figure 3(a) demonstrates the training of the experts E k using weak supervision on the denotation corresponding to input utterance. Once we train all the domain experts E 1 , · · · , E K for the K domains, we use the probability distributions of the parse generated by these experts to train the student, thereby distilling the domain policies learnt by the teachers to the student as shown in Figure 3(b).

Model
In this section, we describe the architecture of semantic parsing model used for both teachers as well as the student networks. We use a standard sequence-to-sequence model (Sutskever et al., 2014) with attention similar to Dong and Lapata (2016) for this task. Each parsing model (the domain specific teachers E 1 , ..., E K and the unified student S) is composed of an L-layer encoder LSTM (Hochreiter and Schmidhuber, 1997) for encoding the input utterances and an L-layer attention based decoder LSTM (Bahdanau et al., 2014) for producing the program sequences. Note that in this section, we omit the domain id superscript k. Given a user utterance x, the aim of the semantic parsing model is to generate output program z which should ultimately result in the true denotations y. This user utterance x = {x 1 , ..., x n } is input to the encoder which maps each word in the input sequence to the embedding e = {e 1 , ..., e n } and uses this embedding to update its respective hidden states h = {h 1 , ..., h n } using h t = LSTM(e t , h t−1 ; θ enc ), where θ enc are the parameters of encoder LSTM. The last hid-den state h n is input to the decoder's first state. The decoder updates its hidden state s t using s t = LSTM(c t−1 , s t−1 ; θ dec ) where s t−1 is the embedding of output program token z t−1 at last step t − 1 and θ dec are the decoder LSTM parameters. The output program {z 1 , ..., z m } is generated token-wise by applying softmax over the vocabulary weights derived by transforming the corresponding hidden state s.
Further, we employ beam search during decoding which generates a set of parses B for every utterance. At each decoding step t, a beam B t containing partial parses of length t are maintained. The next step beam B t+1 are the |B| highest scoring expansions of programs in the beam B t .

Training
In this section we describe the training mechanism employed for the proposed multi-domain policy distillation framework for semantic parsing. The training process in our proposed framework has the following two components (Figure 3): (i) weakly supervised training for domain specific semantic parsing experts E 1 , ..., E K and, (ii) distilling multiple domain policies to the unified student S. We next describe each of these two components.

Domain-specific Semantic Parsing Policy
As described in the previous section, an individual domain specific semantic parsing model generates the program z = {z 1 , ..., z m } which is executed on the knowledge base B to return the denotation y. For brevity, we omit domain identifier k and instance id i in this section. In our setting, since labeled programs are not available for training, we use weak supervision from final denotations y similar to Liang et al. (2017) for each domain expert. As the execution of parse program is a non-differential operation on the KB, we use RE-INFORCE (Williams, 1992;Norouzi et al., 2016) for training which maximizes the expected reward. Reward R(x, z) for prediction z on an input x is defined as the match score between the true denotations y for utterance x and the denotations obtained by executing the predicted program z. The overall objective to maximize the expected reward is as follows where θ = (θ enc , θ dec ) are the policy parameters; B is the output beam containing top scoring programs (described in Section 3.1) and P θ (z|x) is the likelihood of parse z P θ (z|x) = t P θ (z t |x, z 1:t−1 ) To reduce the variance in gradient estimation we use baseline b(x) = 1 |B| z∈B R(x, z) i.e. the average reward for the beam corresponding to the input instance x. See Table 2 WEAKINDEP for the performance achieved for individual domains with this training objective.
Note that the primary challenge with this weakly supervised training is the sparsity in reward signal given the large search space leading to only a few predictions having a non-zero reward. This can be seen in the Table 2 WEAKCOMBINED when the entire set of domains is pooled into one, the numbers drop severely due to the exponential increase in the search space.

Unified Model for multiple domains
For the unified semantic parser, we use the same sequence-to-sequence model described in Section 3.1. The hyper-parameter settings vary from domain-specific models as detailed in Section 5.3. We use the multi-task policy distillation method of Rusu et al. (2016) to train this unified parser for multiple domains. The individual domain experts E 1 , ..., E K are trained independently as described in Section 4.1. This distillation framework enables transfer of knowledge from experts E 1 , ..., E K to a single student model S that operates as a multi-domain parser, even in the absence of any domain indicator with input utterance during the test phase. Each expert E k provides a transformed training dataset to the student , where (p k θ ) i is the expert's probability distribution on the entire program space w.r.t input utterance x i . Concretely, given m is the decoding sequence length and V is the vocabulary combined across domains, then (p k θ ) i ∈ [0, 1] m×|V| denotes the expert E k 's respective probabilities that output token z ij equals vocab token v, for all time steps j ∈ {1, . . . , m} and ∀v ∈ V.
The student takes the probability outputs from the experts as the ground truth and is trained in a supervised manner to minimize the cross-entropy loss L w.r.t to teachers' probability distribution: where {θ k } K k=1 are the policy parameters of experts and θ S are the student model parameters; similarly p S θ (z ij = v; x k , z i{1:j−1} ) is the probability assigned to output token z ij by student S. This training objective enables the unified parser to learn domain-specific parsing strategies from individual domains as well as leverage structural variations across domains. Therefore, the combined multi-domain policy S is refined and compressed during the distillation process thus rendering it to be more effective in parsing for each of the domains.
In this section, we provide details on the data and model used for the experimental analysis 1 . We further elaborate on the baselines used.

Data
We use the OVERNIGHT semantic parsing dataset (Wang et al., 2015) which contains multiple domains. Each domain has utterances (questions) and corresponding parses in λ−DCS form that are executable on domain specific knowledge base. Every domain is designed to focus on a specific linguistic phenomenon, for example, CALEN-DAR on temporal knowledge, BLOCKS on spatial queries. In this work, we use seven domains from the dataset as listed in Table 1.
We would like to highlight that we do not use the parses available in the dataset during the training of our unified semantic parser. Our weakly supervised setup uses denotations to navigate the program search space and learn the parsing policy. This search space is a function of decoder (program) length and vocabulary size. Originally, the parses have 45 tokens on an average with a combined vocabulary of 182 distinct tokens across the domains. To reduce the decoder search space, we normalize the data to have shortened parses with an average length of 11 tokens and 147 combined vocab size. We reduce the sequence length by using a set of template normalization functions and reduce the vocab size by masking named entities for each domain. An example of normalization function is the following: an entity utterance say of type recipe in the query is programmed by first creating a single valued list with the entity type i.e. (en.recipe) and then that property is extracted : (call SW.getProperty ( call SW.singleton en.recipe ) ( string ! type )) resulting in 14 tokens. We replace this complex phrasing by directly substituting the entity type under consideration i.e. (en.recipe) (1 token). Next, we show an example for a complete utterance: what recipes posting date is at least the same as rice pudding. Its original parse is: (call SW.listValue (call SW.filter (call SW.getProperty (call SW.singleton en.recipe) (string ! type)) (call SW.ensureNumericProperty (string posting_date)) (string >=) (call SW.ensureNumericEntity (call SW.getProperty en.recipe.rice_pudding (string posting_date))))).
Our normalized query is what recipes posting date is at least the same as e0, where entity rice pudding is substituted by entity identifier e0. The normalized parse is as follows: SW.filter en.recipe SW.ensureNumericProperty posting_date >= (SW.ensureNumericEntity SW.getProperty e0 posting_date) It is important to note that this normalization function is reversible. During the test phase, we apply the reverse function to convert the normalized parses to original forms for computing the denotations. Table 1 shows the domain wise statistics of original and normalized data. It is important to note that this script is applicable for template reduction for any λ−DCS form.
We report hard denotation accuracy i.e. the proportion of questions for which the top prediction and ground truth programs yield the matching answer sets as the evaluation metric. For computing the rewards during training, we use soft denotation accuracy i.e. F1 score between predicted and ground truth answer sets. Table 2 shows the accuracy with strongly supervised training (SUPERVISED). The average denotation accuracy (with beam width 1) of 70.6% which is comparable to state-of-the-art (Jia and Liang, 2016) denotation accuracy of 75.6% (with beam width 5). This additionally suggests that data normalization process does not alter the task complexity.

Baselines
In the absence of any work on multi-domain parser trained without ground truth programs, we compare the performance of the proposed unified framework against the following baselines: 1. Independent Domain Experts (WEAK-INDEPENDENT): These are the set of weakly supervised semantic parsers, trained independently for each domain using REINFORCE algorithm as described in Section 4.1. Note that these are the teachers in our multi-policy distillation framework.
2. Combined Weakly Supervised Semantic Parser (WEAK-COMBINED)): As per the recommendation in Herzig and Berant (2017), we pool all the domains datasets into one and train a single semantic parser with weak supervision.

Independent Policy Distillation (DISTILL-INDEPENDENT):
We also experiment with independent policy distillation for each domain. The setup is similar to the one described in Section 4.2 used to learn K student parsing models, one for each individual domain. Each student model uses the respective expert model as the only teacher.
Following the above naming convention, we term our proposed framework as DISTILL-COMBINED. For the sake of completeness, we also compute the skyline SUPERVISED i.e. the sequence-tosequence model described in Section 3.1 trained with ground truth parses.

Model Setting
We use the original train-test split provided in the dataset. We further split the training set of each domain into training (80%) and validation (   fied semantic parser using multi-policy distillation (DISTILL-COMBINED) (as described in section 3) on an average has the highest performance in predicting programs under weak supervision setup. DISTILL-COMBINED approach leads to an increased performance by ∼ 20% on an average in comparison to individual domain specific teachers (WEAK-INDEPENDENT). We note maximum increase in the case of HOUSING domain with ∼ 47% increase in the denotation accuracy.

Results and Discussion
Effectiveness of Multi-Policy Distillation: Finally, we evaluate the effectiveness of the overall multi-policy distillation process in comparison to training a combined model with data merged from all the domains (WEAK-COMBINED) in the weak supervision setup. We observe that due to weak signal strength and enlarged search space from multiple domains, WEAK-COMBINED model performs poorly across domains. Thus, further reinforcing the need for the distillation process. As discussed earlier, the SUPERVISED model is trained using strong supervision from ground-truth parses and hence is not considered as a comparable baseline, rather a skyline, for our proposed model

Effect of Small Parallel Corpus
We show that our model can greatly benefit from the availability of a limited amount of parallel data where semantic parses are available. Figure 4 plots the performance of WEAK-INDEPENDENT and DISTILL-INDEPENDENT models for RECIPES domain when initialized with a pre-trained SU-PERVISED model trained on 10% and 30% of parallel training data. As it can be seen, adding 10% parallel data brings an improvement of about 5 points, while increasing the parallel corpus size to only 30% we observe an improvement of about 11 points. The observed huge boost in performance is motivating given the availability of small amount of parallel corpus in most real world scenarios.

Conclusions and Future Work
In this work, we addressed the challenge of training a semantic parser for multiple domains without strong supervision i.e. in the absence of ground truth programs corresponding to input utterances. We propose a novel unified neural framework using multi-policy distillation mechanism with two stages of training through weak supervision from denotations i.e. final answers corresponding to utterances. The resultant multi-domain semantic parser is compact and more precise as demonstrated on the OVERNIGHT dataset. We believe that this proposed framework has wide applicability to any sequence-to-sequence model. We show that a small parallel corpus with annotated programs boosts the performance. We plan to explore if further fine-tuning using denotations based training on the distilled model can lead to improvements in the unified parser. We also plan to investigate the possibility of augmenting the parallel corpus by bootstrapping from shared templates across domains. This would further make it feasible to perform transfer learning on a new domain. An interesting direction would be to enable domain experts to identify and actively request for program annotations given the knowledge shared by other domains. We would also like to explore if guiding the decoder through syntactical and domain-specific constraints helps in reducing the search space for the weakly supervised unified parser.