Coarse-to-Fine Decoding for Neural Semantic Parsing

Semantic parsing aims at mapping natural language utterances into structured meaning representations. In this work, we propose a structure-aware neural architecture which decomposes the semantic parsing process into two stages. Given an input utterance, we first generate a rough sketch of its meaning, where low-level information (such as variable names and arguments) is glossed over. Then, we fill in missing details by taking into account the natural language input and the sketch itself. Experimental results on four datasets characteristic of different domains and meaning representations show that our approach consistently improves performance, achieving competitive results despite the use of relatively simple decoders.


Introduction
Semantic parsing maps natural language utterances onto machine interpretable meaning representations (e.g., executable queries or logical forms).The successful application of recurrent neural networks to a variety of NLP tasks (Bahdanau et al., 2015;Vinyals et al., 2015) has provided strong impetus to treat semantic parsing as a sequence-to-sequence problem (Jia and Liang, 2016;Dong and Lapata, 2016;Ling et al., 2016).The fact that meaning representations are typically structured objects has prompted efforts to develop neural architectures which explicitly account for their structure.Examples include tree decoders (Dong and Lapata, 2016;Alvarez-Melis and Jaakkola, 2017), decoders constrained by a grammar model (Xiao et al., 2016;Yin and Neubig, 2017;Krishnamurthy et al., 2017), or modular decoders which use syntax to dynamically compose various submodels (Rabinovich et al., 2017).
In this work, we propose to decompose the decoding process into two stages.The first decoder focuses on predicting a rough sketch of the meaning representation, which omits low-level details, such as arguments and variable names.Example sketches for various meaning representations are shown in Table 1.Then, a second decoder fills in missing details by conditioning on the natural language input and the sketch itself.Specifically, the sketch constrains the generation process and is encoded into vectors to guide decoding.
We argue that there are at least three advantages to the proposed approach.Firstly, the decomposition disentangles high-level from low-level semantic information, which enables the decoders to model meaning at different levels of granularity.As shown in Table 1, sketches are more compact and as a result easier to generate compared to decoding the entire meaning structure in one go.Secondly, the model can explicitly share knowledge of coarse structures for the examples that have the same sketch (i.e., basic meaning), even though their actual meaning representations are different (e.g., due to different details).Thirdly, after generating the sketch, the decoder knows what the basic meaning of the utterance looks like, and the model can use it as global context to improve the prediction of the final details.

Related Work
Various models have been proposed over the years to learn semantic parsers from natural language expressions paired with their meaning representations (Tang and Mooney, 2000;Ge and Mooney, 2005;Zettlemoyer and Collins, 2007;Wong and Mooney, 2007;Lu et al., 2008;Kwiatkowski et al., 2011;Andreas et al., 2013;Zhao and Huang, 2015).These systems typically learn lexicalized mapping rules and scoring models to construct a meaning representation for a given input.
More recently, neural sequence-to-sequence models have been applied to semantic parsing with promising results (Dong and Lapata, 2016;Jia and Liang, 2016;Ling et al., 2016), eschewing the need for extensive feature engineering.Several ideas have been explored to enhance the performance of these models such as data augmentation (Kočiský et al., 2016;Jia and Liang, 2016), transfer learning (Fan et al., 2017), sharing parameters for multiple languages or meaning representations (Susanto and Lu, 2017;Herzig and Berant, 2017), and utilizing user feedback signals (Iyer et al., 2017).There are also efforts to develop structured decoders that make use of the syntax of meaning representations.Dong and Lapata (2016) and Alvarez-Melis and Jaakkola (2017) develop models which generate tree structures in a topdown fashion.Xiao et al. (2016) and Krishnamurthy et al. (2017) employ the grammar to constrain the decoding process.Cheng et al. (2017) use a transition system to generate variable-free queries.Yin and Neubig (2017) design a grammar model for the generation of abstract syntax trees (Aho et al., 2007) in depth-first, left-to-right order.Rabinovich et al. (2017) propose a modular decoder whose submodels are dynamically composed according to the generated tree structure.
Our own work also aims to model the structure of meaning representations more faithfully.The flexibility of our approach enables us to easily apply sketches to different types of meaning representations, e.g., trees or other structured objects.Coarse-to-fine methods have been popular in the NLP literature, and are perhaps best known for syntactic parsing (Charniak et al., 2006;Petrov, 2011).Artzi and Zettlemoyer (2013) and Zhang et al. (2017) use coarse lexical entries or macro grammars to reduce the search space of semantic parsers.Compared with coarse-to-fine inference for lexical induction, sketches in our case are abstractions of the final meaning representation.
The idea of using sketches as intermediate representations has also been explored in the field of program synthesis (Solar-Lezama, 2008;Zhang and Sun, 2013;Feng et al., 2017).Yaghmazadeh et al. (2017) use SEMPRE (Berant et al., 2013) to map a sentence into SQL sketches which are completed using program synthesis techniques and iteratively repaired if they are faulty.

Problem Formulation
Our goal is to learn semantic parsers from instances of natural language expressions paired with their structured meaning representations.

Decoder units
Figure 1: We first generate the meaning sketch a for natural language input x.Then, a fine meaning decoder fills in the missing details (shown in red) of meaning representation y.The coarse structure a is used to guide and constrain the output decoding.
denote a natural language expression, and y = y 1 • • • y |y| its meaning representation.We wish to estimate p (y|x), the conditional probability of meaning representation y given input x.We decompose p (y|x) into a twostage generation process: where a = a 1 • • • a |a| is an abstract sketch representing the meaning of y.We defer detailed description of how sketches are extracted to Section 4. Suffice it to say that the extraction amounts to stripping off arguments and variable names in logical forms, schema specific information in SQL queries, and substituting tokens with types in source code (see Table 1).As shown in Figure 1, we first predict sketch a for input x, and then fill in missing details to generate the final meaning representation y by conditioning on both x and a.The sketch is encoded into vectors which in turn guide and constrain the decoding of y.We view the input expression x, the meaning representation y, and its sketch a as sequences.The generation probabilities are factorized as: where In the following, we will explain how p (a|x) and p (y|x, a) are estimated.

Sketch Generation
An encoder is used to encode the natural language input x into vector representations.Then, a decoder learns to compute p (a|x) and generate the sketch a conditioned on the encoding vectors.
Input Encoder Every input word is mapped to a vector via is the vocabulary size, and o (x t ) a one-hot vector.We use a bi-directional recurrent neural network with long short-term memory units (LSTM, Hochreiter and Schmidhuber 1997) as the input encoder.The encoder recursively computes the hidden vectors at the t-th time step via: where [•, •] denotes vector concatenation, e t ∈ R n , and f LSTM is the LSTM function.
Coarse Meaning Decoder The decoder's hidden vector at the t-th time step is computed by , where a t−1 ∈ R n is the embedding of the previously predicted token.
The hidden states of the first time step in the decoder are initialized by the concatenated encoding vectors Additionally, we use an attention mechanism (Luong et al., 2015) to learn soft alignments.We compute the attention score for the current time step t of the decoder, with the k-th hidden state in the encoder as: where Z t = |x| j=1 exp{d t • e j } is a normalization term.Then we compute p (a t |a <t , x) via: where Generation terminates once an end-of-sequence token "</s>" is emitted.

Meaning Representation Generation
Meaning representations are predicted by conditioning on the input x and the generated sketch a.
The model uses the encoder-decoder architecture to compute p (y|x, a), and decorates the sketch a with details to generate the final output.
Sketch Encoder As shown in Figure 1, a bidirectional LSTM encoder maps the sketch sequence a into vectors {v k } |a| k=1 as in Equation ( 6), where v k denotes the vector of the k-th time step.

Fine Meaning Decoder
The final decoder is based on recurrent neural networks with an attention mechanism, and shares the input encoder described in Section 3.1.The decoder's hidden states {h t } |y| t=1 are computed via: where , and y t−1 is the embedding of the previously predicted token.Apart from using the embeddings of previous tokens, the decoder is also fed with {v k } |a| k=1 .If y t−1 is determined by a k in the sketch (i.e., there is a one-toone alignment between y t−1 and a k ), we use the corresponding token's vector v k as input to the next time step.
The sketch constrains the decoding output.If the output token y t is already in the sketch, we force y t to conform to the sketch.In some cases, sketch tokens will indicate what information is missing (e.g., in Figure 1, token "flight@1" indicates that an argument is missing for the predicate "flight").In other cases, sketch tokens will not reveal the number of missing tokens (e.g., "STRING" in DJANGO) but the decoder's output will indicate whether missing details have been generated (e.g., if the decoder emits a closing quote token for "STRING").Moreover, type information in sketches can be used to constrain generation.In Table 1, sketch token "NUMBER" specifies that a numeric token should be emitted.
For the missing details, we use the hidden vector h t to compute p (y t |y <t , x, a), analogously to Equations ( 7)-(10).

Training and Inference
The model's training objective is to maximize the log likelihood of the generated meaning representations given natural language expressions: max At test time, the prediction for input x is obtained via â = arg max a p (a |x) and ŷ = arg max y p (y |x, â), where a and y represent coarse-and fine-grained meaning candidates.Because probabilities p (a|x) and p (y|x, a) are factorized as shown in Equations ( 2)-( 3), we can obtain best results approximately by using greedy search to generate tokens one by one, rather than iterating over all candidates.

Semantic Parsing Tasks
In order to show that our framework applies across domains and meaning representations, we developed models for three tasks, namely parsing natural language to logical form, to Python source code, and to SQL query.For each of these tasks we describe the datasets we used, how sketches were extracted, and specify model details over and above the architecture presented in Section 3.

Natural Language to Logical Form
For our first task we used two benchmark datasets, namely GEO (880 language queries to a database of U.S. geography) and ATIS (5, 410 queries to a flight booking system).Examples are shown in Table 1 (see the first and second block).We used standard splits for both datasets: 600 training and 280 test instances for GEO (Zettlemoyer and Collins, 2005); 4, 480 training, 480 development, and 450 test examples for ATIS.Meaning representations in these datasets are based on λ-calculus (Kwiatkowski et al., 2011).We use brackets to linearize the hierarchical structure.The first element between a pair of brackets is an operator or predicate name, and any remaining elements are its arguments.
Algorithm 1 shows the pseudocode used to extract sketches from λ-calculus-based meaning representations.We strip off arguments and variable names in logical forms, while keeping predicates, operators, and composition information.We use the symbol "@" to denote the number of missing arguments in a predicate.For example, we extract "from@2" from the expression "(from $0 dallas:ci)" which indicates that the predicate "from" has two arguments.We use "?" as a placeholder in cases where only partial argument information can be omitted.We also omit variable information defined by the lambda operator and quantifiers (e.g., exists, count, and argmax).We use the symbol "#" to denote the number of omitted tokens.For the example in Figure 1, "lambda $0 e" is reduced to "lambda#2".
The meaning representations of these two datasets are highly compositional, which motivates us to utilize the hierarchical structure of λ-calculus.A similar idea is also explored in the tree decoders proposed in Dong and Lapata (2016) and Yin and Neubig (2017) where parent hidden states are fed to the input gate of the LSTM units.On the contrary, parent hidden states serve as input to the softmax classifiers of both fine and coarse meaning decoders.
Parent Feeding Taking the meaning sketch "(and flight@1 from@2)" as an example, the parent of "from@2" is "(and".Let p t denote the parent of the t-th time step in the decoder.Compared with Equation (10), we use the vector d att t and the hidden state of its parent d pt to compute the prob-ability p (a t |a <t , x) via: where [•, •] denotes vector concatenation.The parent feeding is used for both decoding stages.

Natural Language to Source Code
Our second semantic parsing task used DJANGO (Oda et al., 2015), a dataset built upon the Python code of the Django library.The dataset contains lines of code paired with natural language expressions (see the third block in Table 1) and exhibits a variety of use cases, such as iteration, exception handling, and string manipulation.The original split has 16, 000 training, 1, 000 development, and 1, 805 test instances.We used the built-in lexical scanner of Python 1 to tokenize the code and obtain token types.Sketches were extracted by substituting the original tokens with their token types, except delimiters (e.g., "[", and ":"), operators (e.g., "+", and "*"), and built-in keywords (e.g., "True", and "while").For instance, the expression "if s[:4].lower()== 'http':" becomes "if NAME [ : NUMBER ] .NAME ( ) == STRING :", with details about names, values, and strings being omitted.
DJANGO is a diverse dataset, spanning various real-world use cases and as a result models are often faced with out-of-vocabulary (OOV) tokens (e.g., variable names, and numbers) that are unseen during training.We handle OOV tokens with a copying mechanism (Gu et al., 2016;Gulcehre et al., 2016;Jia and Liang, 2016), which allows the fine meaning decoder (Section 3.2) to directly copy tokens from the natural language input.
Copying Mechanism Recall that we use a softmax classifier to predict the probability distribution p (y t |y <t , x, a) over the pre-defined vocabulary.We also learn a copying gate g t ∈ [0, 1] to decide whether y t should be copied from the input or generated from the vocabulary.We compute the modified output distribution via: 1 https://docs.python.org/3/library/tokenize where w g ∈ R n and b g ∈ R are parameters, and the indicator function 1 [yt / ∈Vy] is 1 only if y t is not in the target vocabulary V y ; the attention score s t,k (see Equation ( 7)) measures how likely it is to copy y t from the input word x k .

Natural Language to SQL
The WIKISQL (Zhong et al., 2017) dataset contains 80, 654 examples of questions and SQL queries distributed across 24, 241 tables from Wikipedia.The goal is to generate the correct SQL query for a natural language question and table schema (i.e., table column names), without using the content values of tables (see the last block in Table 1 for an example).The dataset is partitioned into a training set (70%), a development set (10%), and a test set (20%).Each table is present in one split to ensure generalization to unseen tables.
WIKISQL queries follow the format "SELECT agg op agg col WHERE (cond col cond op cond) AND ...", which is a subset of the SQL syntax.SELECT identifies the column that is to be included in the results after applying the aggregation operator agg op 2 to column agg col.WHERE can have zero or multiple conditions, which means that column cond col must satisfy the constraints expressed by the operator cond op 3 and the condition value cond.Sketches for SQL queries are simply the (sorted) sequences of condition operators cond op in WHERE clauses.For example, in Table 1, sketch "WHERE > AND =" has two condition operators, namely ">" and "=".
The generation of SQL queries differs from our previous semantic parsing tasks, in that the table schema serves as input in addition to natural language.We therefore modify our input encoder in order to render it table-aware, so to speak.Furthermore, due to the formulaic nature of the SQL query, we only use our decoder to generate the WHERE clause (with the help of sketches).
The SELECT clause has a fixed number of slots (i.e., aggregation operator agg op and column agg col), which we straightforwardly predict with softmax classifiers (conditioned on the input).We briefly explain how these components are modeled below.
As shown in Figure 2, we use bi-directional LSTMs to encode the whole sequence.Next, for column c k , the LSTM hidden states at positions c k,1 and c k,|c k | are concatenated.Finally, the concatenated vectors are used as the encoding vectors {c k } M k=1 for table columns.As mentioned earlier, the meaning representations of questions are dependent on the tables.As shown in Figure 2, we encode the input question x into {e t } |x| t=1 using LSTM units.At each time step t, we use an attention mechanism towards table column vectors {c k } M k=1 to obtain the most relevant columns for e t .The attention score from e t to c k is computed via u t,k ∝ exp{α(e t ) • α(c k )}, where α(•) is a one-layer neural network, and analogously to Equations ( 4)-( 6).
SELECT Clause We feed the question vector ẽ into a softmax classifier to obtain the aggregation operator agg op.If agg col is the k-th table column, its probability is computed via:  WHERE Clause We first generate sketches whose details are subsequently decorated by the fine meaning decoder described in Section 3.2.As the number of sketches in the training set is small (35 in total), we model sketch generation as a classification problem.We treat each sketch a as a category, and use a softmax classifier to compute p (a|x): where W a ∈ R |Va|×n , b a ∈ R |Va| are parameters, and ẽ is the table-aware input representation defined in Equation ( 12).
Once the sketch is predicted, we know the condition operators and number of conditions in the WHERE clause which follows the format "WHERE (cond op cond col cond) AND ...".As shown in Figure 3, our generation task now amounts to populating the sketch with condition columns cond col and their values cond.
Let {h t } |y| t=1 denote the LSTM hidden states of the fine meaning decoder, and {h att t } |y| t=1 the vectors obtained by the attention mechanism as in Equation ( 9).The condition column cond col yt is selected from the table's headers.For the k-th column in the table, we compute p (cond col yt = k|y <t , x, a) as in Equation ( 14), but use different parameters and compute the score via σ([h att t , c k ]).If the k-th table column is selected, we use c k for the input of the next LSTM unit in the decoder.
Condition values are typically mentioned in the input questions.These values are often phrases with multiple tokens (e.g., Mikhail Snitko in Table 1).We therefore propose to select a text span from input x for each condition value cond yt rather than copying tokens one by one.Let x l • • • x r denote the text span from which cond yt is copied.We factorize its probability as: where l L yt / r R yt represents the first/last copying index of cond yt is l/r, the probabilities are normalized to 1, and σ(•) is the scoring network defined in Equation ( 13).Notice that we use different parameters for the scoring networks σ(•).The copied span is represented by the concatenated vector [ẽ l , ẽr ], which is fed into a one-layer neural network and then used as the input to the next LSTM unit in the decoder.

Experiments
We present results on the three semantic parsing tasks discussed in Section 4. Our implementation and pretrained models are available at https:// github.com/donglixp/coarse2fine.

Experimental Setup
Preprocessing For GEO and ATIS, we used the preprocessed versions provided by Dong and Lapata (2016), where natural language expressions are lowercased and stemmed with NLTK (Bird et al., 2009), and entity mentions are replaced by numbered markers.We combined predicates and left brackets that indicate hierarchical structures to make meaning representations compact.We employed the preprocessed DJANGO data provided by Yin and Neubig (2017), where input expressions are tokenized by NLTK, and quoted strings in the input are replaced with place holders.WIK-ISQL was preprocessed by the script provided by Zhong et al. (2017), where inputs were lowercased and tokenized by Stanford CoreNLP (Manning et al., 2014).

Results and Analysis
We compare our model (COARSE2FINE) against several previously published systems as well as various baselines.Specifically, we report results with a model which decodes meaning representations in one stage (ONESTAGE) without leveraging sketches.We also report the results of several ablation models, i.e., without a sketch encoder and without a table-aware input encoder.).Again we observe that the sketch encoder is beneficial and that there is an 8.9 point difference in accuracy between COARSE2FINE and the oracle.
Results on WIKISQL are shown in Table 4.Our model is superior to ONESTAGE as well as to previous best performing systems.COARSE2FINE's accuracies on aggregation agg op and agg col are 90.2% and 92.0%, respectively, which is comparable to SQLNET (Xu et al., 2017).So the most gain is obtained by the improved decoder of the WHERE clause.We also find that a tableaware input encoder is critical for doing well on this task, since the same question might lead to different SQL queries depending on the table schemas.Consider the question "how many presidents are graduated from A ".The SQL query over table " President College " is "SELECT We also examine the predicted sketches themselves in Table 5.We compare sketches generated by COARSE2FINE against ONESTAGE.The latter model generates meaning representations without an intermediate sketch generation stage.Nevertheless, we can extract sketches from the output of ONESTAGE following the procedures described in Section 4. Sketches produced by COARSE2FINE are more accurate across the board.This is not surprising because our model is trained explicitly to generate compact meaning sketches.Taken together (Tables 2-4), our results show that better sketches bring accuracy gains on GEO, ATIS, and DJANGO.On WIKISQL, the sketches predicted by COARSE2FINE are marginally better compared with ONESTAGE.Performance improvements on this task are mainly due to the fine meaning decoder.We conjecture that by decomposing decoding into two stages, COARSE2FINE can better match table columns and extract condition values without interference from the prediction of condition operators.Moreover, the sketch provides a canonical order of condition operators, which is beneficial for the decoding process (Vinyals et al., 2016;Xu et al., 2017).

Conclusions
In this paper we presented a coarse-to-fine decoding framework for neural semantic parsing.We first generate meaning sketches which abstract away from low-level information such as arguments and variable names and then predict missing details in order to obtain full meaning representations.The proposed framework can be easily adapted to different domains and meaning representations.Experimental results show that coarseto-fine decoding improves performance across tasks.In the future, we would like to apply the framework in a weakly supervised setting, i.e., to learn semantic parsers from question-answer pairs and to explore alternative ways of defining meaning sketches.
(x,a,y)∈D log p (y|x, a) + log p (a|x) where D represents training pairs.

M
k=1 u t,k = 1.Then we compute the context vector c e t = M k=1 u t,k c k to summarize the relevant columns for e t .We feed the concatenated vectors {[e t , c e t ]} |x| t=1 into a bi-directional LSTM encoder, and use the new encoding vectors {ẽ t } |x| t=1 to replace {e t } |x| t=1 in other model components.We define the vector representation of input x as:

Figure 3 :
Figure 3: Fine meaning decoder of the WHERE clause used for WIKISQL.

Table schema :
Pianist Conductor Record Company Year of Recording Format x : What record company did conductor Mikhail Snitko record for after 1996?y : SELECT Record Company WHERE (Year of Recording > 1996) AND (Conductor = Mikhail Snitko) a : WHERE > AND =

Table 1 :
Examples of natural language expressions x, their meaning representations y, and meaning sketches a.The average number of tokens is shown in the second column.

Table 2 :
Accuracies on GEO and ATIS.and α(•) in Equation (13) was set to 64.Word embeddings were initialized by GloVe(Pennington et al., 2014), and were shared by table encoder and input encoder in Section 4.3.We appended 10-dimensional part-of-speech tag vectors to embeddings of the question words in WIKISQL.The part-of-speech tags were obtained by the spaCy toolkit.We used the RMSProp optimizer (Tieleman and Hinton, 2012) to train the models.The learning rate was selected from {0.002, 0.005}.The batch size was 200 for WIKISQL, and was 64 for other datasets.Early stopping was used to determine the number of epochs.
Table 2 presents our results on GEO and ATIS.Overall, we observe that COARSE2FINE outperforms ONESTAGE, which suggests that disentangling high-level from low-level information dur-

Table 5 :
Sketch accuracy.For ONESTAGE, sketches are extracted from the meaning representations it generates.
COUNT(President) WHERE (College = A)", but the query over table " College Number of Presidents " would be "SELECT Number of Presidents WHERE (College = A)".