Semantic Parsing with Semi-Supervised Sequential Autoencoders

We present a novel semi-supervised approach for sequence transduction and apply it to semantic parsing. The unsupervised component is based on a generative model in which latent sentences generate the unpaired logical forms. We apply this method to a number of semantic parsing tasks focusing on domains with limited access to labelled training data and extend those datasets with synthetically generated logical forms.


Introduction
Neural approaches, in particular attention-based sequence-to-sequence models, have shown great promise and obtained state-of-the-art performance for sequence transduction tasks including machine translation (Bahdanau et al., 2015), syntactic constituency parsing (Vinyals et al., 2015), and semantic role labelling (Zhou and Xu, 2015). A key requirement for effectively training such models is an abundance of supervised data.
In this paper we focus on learning mappings from input sequences x to output sequences y in domains where the latter are easily obtained, but annotation in the form of (x, y) pairs is sparse or expensive to produce, and propose a novel architecture that accommodates semi-supervised training on sequence transduction tasks. To this end, we augment the transduction objective (x → y) with an autoencoding objective where the input sequence is treated as a latent variable (y → x → y), enabling training from both labelled pairs and unpaired output sequences. This is common in situations where we encode natural language into a logical form governed by some grammar or database.
While such an autoencoder could in principle be constructed by stacking two sequence transducers, modelling the latent variable as a series of discrete symbols drawn from multinomial distributions creates serious computational challenges, as it requires marginalising over the space of latent sequences Σ * x . To avoid this intractable marginalisation, we introduce a novel differentiable alternative for draws from a softmax which can be used with the reparametrisation trick of Kingma and Welling (2014). Rather than drawing a discrete symbol in Σ x from a softmax, we draw a distribution over symbols from a logistic-normal distribution at each time step. These serve as continuous relaxations of discrete samples, providing a differentiable estimator of the expected reconstruction log likelihood.
We demonstrate the effectiveness of our proposed model on three semantic parsing tasks: the GEO-QUERY benchmark (Zelle and Mooney, 1996;Wong and Mooney, 2006), the SAIL maze navigation task (MacMahon et al., 2006) and the Natural Language Querying corpus (Haas and Riezler, 2016) on Open-StreetMap. As part of our evaluation, we introduce simple mechanisms for generating large amounts of unsupervised training data for two of these tasks.
In most settings, the semi-supervised model outperforms the supervised model, both when trained on additional generated data as well as on subsets of the existing data.

Dataset Example
GEO what are the high points of states surrounding mississippi answer(high point 1(state(next to 2(stateid('mississippi'))))) NLMAPS Where are kindergartens in Hamburg? query(area(keyval('name','Hamburg')),nwr(keyval('amenity','kindergarten')),qtype(latlong)) SAIL turn right at the bench into the yellow tiled hall (1, 6, 90) FORWARD -FORWARD -RIGHT -STOP (3, 6, 180) Table 1: Examples of natural language x and logical form y from the three corpora and tasks used in this paper. Note that the SAIL corpus requires additional information in order to map from the instruction to the action sequence.
Figure 1: SEQ4 model with attention-sequence-to-sequence encoder and decoder. Circle nodes represent random variables.

Model
Our sequential autoencoder is shown in Figure 1. At a high level, it can be seen as two sequenceto-sequence models with attention (Bahdanau et al., 2015) chained together. More precisely, the model consists of four LSTMs (Hochreiter and Schmidhuber, 1997), hence the name SEQ4. The first, a bidirectional LSTM, encodes the sequence y; next, an LSTM with stochastic output, described below, draws a sequence of distributionsx over words in vocabulary Σ x . The third LSTM encodes these distributions for the last one to attend over and reconstruct y asŷ. We now give the details of these parts.

Encoding y
The first LSTM of the encoder half of the model reads the sequence y, represented as a sequence of one-hot vectors over the vocabulary Σ y , using a bidirectional RNN into a sequence of vectors h y

1:Ly
where L y is the sequence length of y, where f → y , f ← y are non-linear functions applied at each time step to the current token y t and their recurrent states h y,→ t−1 , h y,← t+1 , respectively. Both the forward and backward functions project the one-hot vector into a dense vector via an embedding matrix, which serves as input to an LSTM.

Predicting a Latent Sequencex
Subsequently, we wish to predict x. Predicting a discrete sequence of symbols through draws from multinomial distributions over a vocabulary is not an option, as we would not be able to backpropagate through this discrete choice. Marginalising over the possible latent strings or estimating the gradient through naïve Monte Carlo methods would be a prohibitively high variance process because the number of strings is exponential in the maximum length (which we would have to manually specify) with the vocabulary size as base. To allow backpropagation, we instead predict a sequence of distributionsx over the symbols of Σ x with an RNN attending over Figure 2: Unsupervised case of the SEQ4 model. h y = h y 1:Ly , which will later serve to reconstruct y: where q(x|y) models the mapping y → x. We define q(x t |{x 1 , · · · ,x t−1 }, h y ) in the following way: Let the vectorx t be a distribution over the vocabulary Σ x drawn from a logistic-normal distribution 1 , the parameters of which, µ t , log(σ 2 ) t ∈ R |Σx| , are predicted by attending by an LSTM attending over the outputs of the encoder (Equation 2), where |Σ x | is the size of the vocabulary Σ x . The use of a logistic normal distribution serves to regularise the model in the semi-supervised learning regime, which is described at the end of this section. Formally, this process, depicted in Figure 2, is as follows: where the fx function is an LSTM and l a linear transformation to R 2|Σx| . We use the reparametrisation trick from Kingma and Welling (2014) to draw from the logistic normal, allowing us to backpropagate through the sampling process.

Encoding x
Moving on to the decoder part of our model, in the third LSTM, we embed 2 and encodex: When x is observed, during supervised training and also when making predictions, instead of the distributionx we feed the one-hot encoded x to this part of the model.

Reconstructing y
In the final LSTM, we decode into y: Equation 9 is implemented as an LSTM attending over hx producing a sequence of symbolsŷ based on recurrent states hŷ, aiming to reproduce input y: where fŷ is the non-linear function, and the actual probabilities are given by a softmax function after a linear transformation l of hŷ. At training time, rather thanŷ t−1 we feed the ground truth y t−1 .

Loss function
The complete model described in this section gives a reconstruction function y →ŷ. We define a loss on this reconstruction which accommodates the unsupervised case, where x is not observed in the training data, and the supervised case, where (x, y) pairs are available. Together, these allow us to train the SEQ4 model in a semi-supervised setting, which experiments will show provides some benefits over a purely supervised training regime.
Unsupervised case When x isn't observed, the loss we minimise during training is the reconstruction loss on y, expressed as the negative loglikelihood N LL(ŷ, y) of the true labels y relative to the predictionsŷ. To this, we add as a regularising term the KL divergence KL[q(γ|y) p(γ)] which effectively penalises the mean and variance of q(γ|y) from diverging from those of a prior p(γ), which we model as a diagonal Gaussian N (0, I). This has the effect of smoothing the logistic normal distribution from which we draw the distributions over symbols of x, guarding against overfitting of the latent distributions over x to symbols seen in the supervised case discussed below. The unsupervised loss is therefore formalised as with regularising factor α is tuned on validation, and We use a closed form of these individual KL divergences, described by Kingma and Welling (2014).
Supervised case When x is observed, we additionally minimise the prediction loss on x, expressed as the negative log-likelihood N LL(x, x) of the true labels x relative to the predictionsx, and do not impose the KL loss. The supervised loss is thus In both the supervised and unsupervised case, because of the continuous relaxation on generatingx and the reparameterisation trick, the gradient of the losses with regard to the model parameters is well defined throughout SEQ4.
Semi-supervised training and inference We train with a weighted combination of the supervised and unsupervised losses described above. Once trained, we simply use the x → y decoder segment of the model to predict y from sequences of symbols x represented as one-hot vectors. When the decoder is trained without the encoder in a fully supervised manner, it serves as our supervised sequenceto-sequence baseline model under the name S2S.

Tasks and Data Generation
We apply our model to three tasks outlined in this section. Moreover, we explain how we generated additional unsupervised training data for two of these tasks. Examples from all datasets are in Table 1.

GeoQuery
The first task we consider is the prediction of a query on the GEO corpus which is a frequently used benchmark for semantic parsing. The corpus contains 880 questions about US geography together with executable queries representing those questions. We follow the approach established by Zettlemoyer and Collins (2005) and split the corpus into 600 training and 280 test cases. Following common practice, we augment the dataset by referring to the database during training and test time. In particular, we use the database to identify and anonymise variables (cities, states, countries and rivers) following the method described in Dong and Lapata (2016).
Most prior work on the GEO corpus relies on standard semantic parsing methods together with custom heuristics or pipelines for this corpus. The recent paper by Dong and Lapata (2016) is of note, as it uses a sequence-to-sequence model for training which is the unidirectional equivalent to S2S, and also to the decoder part of our SEQ4 network.

Open Street Maps
The second task we tackle with our model is the NLMAPS dataset by Haas and Riezler (2016). The dataset contains 1,500 training and 880 testing instances of natural language questions with corresponding machine readable queries over the geographical OpenStreetMap database. The dataset contains natural language question in both English and German but we focus only on single language semantic parsing, similar to the first task in Haas and Riezler (2016). We use the data as it is, with the only pre-processing step being the tokenization of both natural language and query form 3 .

Navigational Instructions to Actions
The SAIL corpus and task were developed to train agents to follow free-form navigational route instructions in a maze environment (MacMahon et al., 2006;Chen and Mooney, 2011). It consists of a small number of mazes containing features such as objects, wall and floor types. These mazes come together with a large number of human instructions paired with the required actions 4 to reach the goal state described in those instructions.
We use the sentence-aligned version of the SAIL route instruction dataset containing 3,236 sentences (Chen and Mooney, 2011). Following previous work, we accept an action sequence as correct if and only if the final position and orientation exactly match those of the gold data. We do not perform any pre-processing on this dataset.

Data Generation
As argued earlier, we are focusing on tasks where aligned data is sparse and expensive to obtain, while it should be cheap to get unsupervised, monomodal data. Albeit that is a reasonable assumption for real world data, the datasets considered have no such component, thus the approach taken here is to generate random database queries or maze paths, i.e. the machine readable side of the data, and train a semi-supervised model. The alternative not explored here would be to generate natural language questions or instructions instead, but that is more difficult to achieve without human intervention. For this reason, we generate the machine readable side of the data for GEOQUERY and SAIL tasks 5 .
For GEOQUERY, we fit a 3-gram Kneser-Ney (Chen and Goodman, 1999) model to the queries in the training set and sample about 7 million queries from it. We ensure that the sampled queries are different from the training queries, but do not enforce validity. This intentionally simplistic approach is to demonstrate the applicability of our model.
The SAIL dataset has only three mazes. We added a fourth one and over 150k random paths, including duplicates. The new maze is larger (21 × 21 grid) than the existing ones, and seeks to approximately replicate the key statistics of the other three mazes (maximum corridor length, distribution of objects, etc). Paths within that maze are created by randomly sampling start and end positions.

Experiments
We evaluate our model on the three tasks in multiple settings. First, we establish a supervised baseline to compare the S2S model with prior work. Next, we 5 Our randomly generated unsupervised datasets can be downloaded from http://deepmind.com/ publications   (Zettlemoyer and Collins, 2005). train our SEQ4 model in a semi-supervised setting on the entire dataset with the additional monomodal training data described in the previous section. Finally, we perform an "ablation" study where we discard some of the training data and compare S2S to SEQ4. S2S is trained solely on the reduced data in a supervised manner, while SEQ4 is once again trained semi-supervised on the same reduced data plus the machine readable part of the discarded data (SEQ4-) or on the extra generated data (SEQ4+).
Training We train the model using standard gradient descent methods. As none of the datasets used here contain development sets, we tune hyperparameters by cross-validating on the training data. In the case of the SAIL corpus we train on three folds (two mazes for training and validation, one for test each) and report weighted results across the folds following prior work (Mei et al., 2016).

GeoQuery
The evaluation metric for GEOQUERY is the accuracy of exactly predicting the machine readable query. As results in Table 2 show, our supervised S2S baseline model performs slightly better than the comparable model by Dong and Lapata (2016). The semi-supervised SEQ4 model with the additional generated queries improves on it further.
The ablation study in Table 3 demonstrates a widening gap between supervised and semi-   supervised as the amount of labelled training data gets smaller. This suggests that our model can leverage unlabelled data even when only small amount of labelled data is available.

Open Street Maps
We report results for the NLMAPS corpus in Table 4, comparing the supervised S2S model to the results posted by Haas and Riezler (2016). While their model used a semantic parsing pipeline including alignment, stemming, language modelling and CFG inference, the strong performance of the S2S model demonstrates the strength of fairly vanilla attentionbased sequence-to-sequence models. It should be pointed out that the previous work reports the number of correct answers when queries were executed against the dataset, while we evaluate on the strict accuracy of the generated queries. While we expect these numbers to be nearly equivalent, our evaluation is strictly harder as it does not allow for reordering of query arguments and similar relaxations.
We investigate the SEQ4 model only via the ablation study in Table 5 and find little gain through the semi-supervised objective. Our attempt at cheaply generating unsupervised data for this task was not successful, likely due to the complexity of the underlying database.

Navigational Instructions to Actions
Model extension The experiments for the SAIL task differ slightly from the other two tasks in that the language input does not suffice for choosing an  action. While a simple instruction such as 'turn left' can easily be translated into the action sequence LEFT-STOP, more complex instructions such as 'Walk forward until you see a lamp' require knowledge of the agent's position in the maze.
To accomplish this we modify the model as follows. First, when encoding action sequences, we concatenate each action with a representation of the maze at the given position, representing the mazestate akin to Mei et al. (2016) with a bag-of-features vector. Second, when decoding action sequences, the RNN outputs an action which is used to update the agent's position and the representation of that new position is fed into the RNN as its next input.
Training regime We cross-validate over the three mazes in the dataset and report overall results weighted by test size (cf. Mei et al. (2016)). Both our supervised and semi-supervised model perform worse than the state-of-the-art (see Table 6), but the latter enjoys a comfortable margin over the former. As the S2S model broadly reimplements the work of Mei et al. (2016), we put the discrepancy in performance down to the particular design choices that we did not follow in order to keep the model here as general as possible and comparable across tasks.
The ablation studies (Table 7) show little gain for the semi-supervised approach when only using data from the original training set, but substantial improvement with the additional unsupervised data.

Discussion
Supervised training The prediction accuracies of our supervised baseline S2S model are mixed with respect to prior results on their respective tasks. For GEOQUERY, S2S performs significantly better than the most similar model from the literature (Dong and Lapata, 2016), mostly due to the fact that y and x are Input from unsupervised data (y) Generated latent representation (x) answer smallest city loc 2 state stateid STATE what is the smallest city in the state of STATE </S> answer city loc 2 state next to 2 stateid STATE what are the cities in states which border STATE </S> answer mountain loc 2 countryid COUNTRY what is the lakes in COUNTRY </S> answer state next to 2 state all which states longer states show peak states to </S> Table 8: Positive and negative examples of latent language together with the randomly generated logical form from the unsupervised part of the GEOQUERY training. Note that the natural language (x) does not occur anywhere in the training data in this form.

Model Accuracy
Chen and Mooney (2011) 54.40 Kim and Mooney (2012) 57.22 Andreas and Klein (2015) 59.60 Kim and Mooney (2013) 62.81 Artzi et al. (2014) 64.36  65 .   encoded with bidirectional LSTMs. With a unidirectional LSTM we get similar results to theirs. On the SAIL corpus, S2S performs worse than the state of the art. As the models are broadly equivalent we attribute this difference to a number of taskspecific choices and optimisations 7 made in Mei et al. (2016) which we did not reimplement for the sake of using a common model across all three tasks.
For NLMAPS, S2S performs much better than the state-of-the-art, exceeding the previous best result by 11% despite a very simple tokenization method 7 In particular we don't use beam search and ensembling. and a lack of any form of entity anonymisation.
Semi-supervised training In both the case of GEOQUERY and the SAIL task we found the semisupervised model to convincingly outperform the fully supervised model. The effect was particularly notable in the case of the SAIL corpus, where performance increased from 58.60% accuracy to 63.25% (see Table 6). It is worth remembering that the supervised training regime consists of three folds of tuning on two maps with subsequent testing on the third map, which carries a risk of overfitting to the training maps. The introduction of the fourth unsupervised map clearly mitigates this effect. Table 8 shows some examples of unsupervised logical forms being transformed into natural language, which demonstrate how the model can learn to sensibly ground unsupervised data.

Ablation performance
The experiments with additional unsupervised data prove the feasibility of our approach and clearly demonstrate the usefulness of the SEQ4 model for the general class of sequence-to-sequence tasks where supervised data is hard to come by. To analyse the model further, we also look at the performance of both S2S and SEQ4 when reducing the amount of supervised training data available to the model. We compare three settings: the supervised S2S model with reduced training data, SEQ4-which uses the removed training data in an unsupervised fashion (throwing away the natural language) and SEQ4+ which uses the randomly generated unsupervised data described in Section 3. The S2S model behaves as expected on all three tasks, its performance dropping with the size of the training data. The performance of SEQ4and SEQ4+ requires more analysis.
In the case of GEOQUERY, having unlabelled data from the true distribution (SEQ4-) is a good thing when there is enough of it, as clearly seen when only 5% of the original dataset is used for supervised training and the remaining 95% is used for unsupervised training. The gap shrinks as the amount of supervised data is increased, which is as expected. On the other hand, using a large amount of extra, generated data from an approximating distribution (SEQ4+) does not help as much initially when compared with the unsupervised data from the true distribution. However, as the size of the unsupervised dataset in SEQ4-becomes the bottleneck this gap closes and eventually the model trained on the extra data achieves higher accuracy.
For the SAIL task the semi-supervised models do better than the supervised results throughout, with the model trained on randomly generated additional data consistently outperforming the model trained only on the original data. This gives further credence to the risk of overfitting to the training mazes already mentioned above.
Finally, in the case of the NLMAPS corpus, the semi-supervised approach does not appear to help much at any point during the ablation. These indistinguishable results are likely due to the task's complexity, causing the ablation experiments to either have to little supervised data to sufficiently ground the latent space to make use of the unsupervised data, or in the higher percentages then too little unsupervised data to meaningfully improve the model.

Related Work
Semantic parsing The tasks in this paper all broadly belong to the domain of semantic parsing, which describes the process of mapping natural language to a formal representation of its meaning. This is extended in the SAIL navigation task, where the formal representation is a function of both the language instruction and a given environment.
While a large number of relevant literature fo-cuses on defining the grammar of the logical forms (Zettlemoyer and Collins, 2005), other models learn purely from aligned pairs of text and logical form (Berant and Liang, 2014), or from more weakly supervised signals such as question-answer pairs together with a database (Liang et al., 2011). Recent work of Jia and Liang (2016) induces a synchronous context-free grammar and generates additional training examples (x, y), which is one way to address data scarcity issues. The semi-supervised setup proposed here offers an alternative solution to this issue.
Discrete autoencoders Very recently there has been some related work on discrete autoencoders for natural language processing (Suster et al., 2016;Marcheggiani and Titov, 2016, i.a.) This work presents a first approach to using effectively discretised sequential information as the latent representation without resorting to draconian assumptions (Ammar et al., 2014) to make marginalisation tractable. While our model is not exactly marginalisable either, the continuous relaxation makes training far more tractable. A related idea was recently presented in Gülçehre et al. (2015), who use monolingual data to improve machine translation by fusing a sequence-to-sequence model and a language model.

Conclusion
We described a method for augmenting a supervised sequence transduction objective with an autoencoding objective, thereby enabling semi-supervised training where previously a scarcity of aligned data might have held back model performance. Across multiple semantic parsing tasks we demonstrated the effectiveness of this approach, improving model performance by training on randomly generated unsupervised data in addition to the original data. Going forward it would be interesting to further analyse the effects of sampling from a logisticnormal distribution as opposed to a softmax in order to better understand how this impacts the distribution in the latent space. While we focused on tasks with little supervised data and additional unsupervised data in y, it would be straightforward to reverse the model to train it with additional labelled data in x, i.e. on the natural language side. A natural extension would also be a formulation where semisupervised training was performed in both x and y.
For instance, machine translation lends itself to such a formulation where for many language pairs parallel data may be scarce while there is an abundance of monolingual data.