Language to Logical Form with Neural Attention

Semantic parsing aims at mapping natural language to machine interpretable meaning representations. Traditional approaches rely on high-quality lexicons, manually-built templates, and linguistic features which are either domain- or representation-specific. In this paper we present a general method based on an attention-enhanced encoder-decoder model. We encode input utterances into vector representations, and generate their logical forms by conditioning the output sequences or trees on the encoding vectors. Experimental results on four datasets show that our approach performs competitively without using hand-engineered features and is easy to adapt across domains and meaning representations.


Introduction
Semantic parsing is the task of translating text to a formal meaning representation such as logical forms or structured queries.There has recently been a surge of interest in developing machine learning methods for semantic parsing (see the references in Section 2), due in part to the existence of corpora containing utterances annotated with formal meaning representations.Figure 1 shows an example of a question (left handside) and its annotated logical form (right handside), taken from JOBS (Tang and Mooney, 2001), a well-known semantic parsing benchmark.In order to predict the correct logical form for a given utterance, most previous systems rely on predefined templates and manually designed features, which often render the parsing model domain-or representation-specific.In this work, we aim to use a simple yet effective method to bridge the gap between natural language and logical form with minimal domain knowledge.

Input Utterance Logical Form
Figure 1: Input utterances and their logical forms are encoded and decoded with neural networks.An attention layer is used to learn soft alignments.
Encoder-decoder architectures based on recurrent neural networks have been successfully applied to a variety of NLP tasks ranging from syntactic parsing (Vinyals et al., 2015a), to machine translation (Kalchbrenner and Blunsom, 2013;Cho et al., 2014;Sutskever et al., 2014), and image description generation (Karpathy and Fei-Fei, 2015;Vinyals et al., 2015b).As shown in Figure 1, we adapt the general encoder-decoder paradigm to the semantic parsing task.Our model learns from natural language descriptions paired with meaning representations; it encodes sentences and decodes logical forms using recurrent neural networks with long short-term memory (LSTM) units.We present two model variants, the first one treats semantic parsing as a vanilla sequence transduction task, whereas our second model is equipped with a hierarchical tree decoder which explicitly captures the compositional structure of logical forms.We also introduce an attention mechanism (Bahdanau et al., 2015;Luong et al., 2015b) allowing the model to learn soft alignments between natural language and logical forms and present an argument identification step to handle rare mentions of entities and numbers.
Evaluation results demonstrate that compared to previous methods our model achieves similar or better performance across datasets and meaning representations, despite using no hand-engineered domain-or representation-specific features.

arXiv:1601.01280v2 [cs.CL] 6 Jun 2016
Our work synthesizes two strands of research, namely semantic parsing and the encoder-decoder architecture with neural networks.
Our model learns from natural language descriptions paired with meaning representations.Most previous systems rely on high-quality lexicons, manually-built templates, and features which are either domain-or representationspecific.We instead present a general method that can be easily adapted to different domains and meaning representations.We adopt the general encoder-decoder framework based on neural networks which has been recently repurposed for various NLP tasks such as syntactic parsing (Vinyals et al., 2015a), machine translation (Kalchbrenner and Blunsom, 2013;Cho et al., 2014;Sutskever et al., 2014), image description generation (Karpathy and Fei-Fei, 2015;Vinyals et al., 2015b), question answering (Hermann et al., 2015), and summarization (Rush et al., 2015).Mei et al. (2016) use a sequence-to-sequence model to map navigational instructions to actions.
Our model works on more well-defined meaning representations (such as Prolog and lambda calculus) and is conceptually simpler; it does not employ bidirectionality or multi-level alignments.Grefenstette et al. (2014) propose a different architecture for semantic parsing based on the combination of two neural network models.The first model learns shared representations from pairs of questions and their translations into knowledge base queries, whereas the second model generates the queries conditioned on the learned representations.However, they do not report empirical evaluation results.

Problem Formulation
Our aim is to learn a model which maps natural language input q = x 1 • • • x |q| to a logical form representation of its meaning a = y 1 • • • y |a| .The conditional probability p (a|q) is decomposed as: where Our method consists of an encoder which encodes natural language input q into a vector representation and a decoder which learns to generate y 1 , • • • , y |a| conditioned on the encoding vector.In the following we describe two models varying in the way in which p (a|q) is computed.

Sequence-to-Sequence Model
This model regards both input q and output a as sequences.As shown in Figure 2, the encoder and decoder are two different L-layer recurrent neural networks with long short-term memory (LSTM) units which recursively process tokens one by one.The first |q| time steps belong to the encoder, while the following |a| time steps belong to the decoder.Let h l t ∈ R n denote the hidden vector at time step t and layer l. h l t is then computed by: where LSTM refers to the LSTM function being used.In our experiments we follow the architecture described in Zaremba et al. (2015), however other types of gated activation functions are possible (e.g., Cho et al. (2014)).For the encoder, h 0 t = W q e(x t ) is the word vector of the current input token, with W q ∈ R n×|Vq| being a parameter matrix, and e(•) the index of the corresponding  token.For the decoder, h 0 t = W a e(y t−1 ) is the word vector of the previous predicted word, where W a ∈ R n×|Va| .Notice that the encoder and decoder have different LSTM parameters.
Once the tokens of the input sequence x 1 , • • • , x |q| are encoded into vectors, they are used to initialize the hidden states of the first time step in the decoder.Next, the hidden vector of the topmost LSTM h L t in the decoder is used to predict the t-th output token as: where W o ∈ R |Va|×n is a parameter matrix, and e (y t ) ∈ {0, 1} |Va| a one-hot vector for computing y t 's probability from the predicted distribution.
We augment every sequence with a "start-ofsequence" <s> and "end-of-sequence" </s> token.The generation process terminates once </s> is predicted.The conditional probability of generating the whole sequence p (a|q) is then obtained using Equation (1).

Sequence-to-Tree Model
The SEQ2SEQ model has a potential drawback in that it ignores the hierarchical structure of logical forms.As a result, it needs to memorize various pieces of auxiliary information (e.g., bracket pairs) to generate well-formed output.In the following we present a hierarchical tree decoder which is more faithful to the compositional nature of meaning representations.A schematic description of the model is shown in Figure 3.
The present model shares the same encoder with the sequence-to-sequence model described in Section 3.1 (essentially it learns to encode input q as vectors).However, its decoder is fundamentally different as it generates logical forms in a topdown manner.In order to represent tree structure, we define a "nonterminal" <n> token which indicates subtrees.As shown in Figure 3, we preprocess the logical form "lambda $0 e (and (>(departure time $0) 1600:ti) (from $0 dallas:ci))" to a tree by replacing tokens between pairs of brackets with nonterminals.Special tokens <s> and <(> denote the beginning of a sequence and nonterminal sequence, respectively (omitted from Figure 3 due to lack of space).Token </s> represents the end of sequence.
After encoding input q, the hierarchical tree decoder uses recurrent neural networks to generate tokens at depth 1 of the subtree corresponding to parts of logical form a. If the predicted token is <n>, we decode the sequence by conditioning on the nonterminal's hidden vector.This process terminates when no more nonterminals are emitted.In other words, a sequence decoder is used to hierarchically generate the tree structure.
In contrast to the sequence decoder described in Section 3.1, the current hidden state does not only depend on its previous time step.In order to better utilize the parent nonterminal's information, we introduce a parent-feeding connection where the hidden vector of the parent nonterminal is concatenated with the inputs and fed into LSTM.
As an example, Figure 4 shows the decoding tree corresponding to the logical form "A B (C)", where y 1 • • • y 6 are predicted tokens, and t 1 • • • t 6 denote different time steps.Span "(C)" corresponds to a subtree.Decoding in this example has two steps: once input q has been encoded, we first generate y 1  predicted; next, we generate y 5 , y 6 by conditioning on nonterminal t 3 's hidden vectors.The probability p (a|q) is the product of these two sequence decoding steps: where Equation ( 3) is used for the prediction of each output token.

Attention Mechanism
As shown in Equation ( 3), the hidden vectors of the input sequence are not directly used in the decoding process.However, it makes intuitively sense to consider relevant information from the input to better predict the current token.Following this idea, various techniques have been proposed to integrate encoder-side information (in the form of a context vector) at each time step of the decoder (Bahdanau et al., 2015;Luong et al., 2015b;Xu et al., 2015).
As shown in Figure 5, in order to find relevant encoder-side context for the current hidden state h L t of decoder, we compute its attention score with the k-th hidden state in the encoder as: |q| are the top-layer hidden vectors of the encoder.Then, the context vector is the weighted sum of the hidden vectors in the encoder: In lieu of Equation ( 3), we further use this context vector which acts as a summary of the encoder to compute the probability of generating y t as: Figure 5: Attention scores are computed by the current hidden vector and all the hidden vectors of encoder.Then, the encoder-side context vector c t is obtained in the form of a weighted sum, which is further used to predict y t .
where W o ∈ R |Va|×n and W 1 , W 2 ∈ R n×n are three parameter matrices, and e (y t ) is a one-hot vector used to obtain y t 's probability.

Model Training
Our goal is to maximize the likelihood of the generated logical forms given natural language utterances as input.So the objective function is: where D is the set of all natural language-logical form training pairs, and p (a|q) is computed as shown in Equation (1).The RMSProp algorithm (Tieleman and Hinton, 2012) is employed to solve this non-convex optimization problem.Moreover, dropout is used for regularizing the model (Zaremba et al., 2015).Specifically, dropout operators are used between different LSTM layers and for the hidden layers before the softmax classifiers.This technique can substantially reduce overfitting, especially on datasets of small size.

Inference
At test time, we predict the logical form for an input utterance q by: â = arg max a p a |q (10) where a represents a candidate output.However, it is impractical to iterate over all possible results to obtain the optimal prediction.According to Equation (1), we decompose the probability p (a|q) so that we can use greedy search (or beam search) to generate tokens one by one.
Algorithm 1 describes the decoding process for SEQ2TREE.The time complexity of both decoders is O(|a|), where |a| is the length of output.The extra computation of SEQ2TREE compared with SEQ2SEQ is to maintain the nonterminal queue, which can be ignored because most of time is spent on matrix operations.We implement the hierarchical tree decoder in a batch mode, so that it can fully utilize GPUs.Specifically, as shown in Algorithm 1, every time we pop multiple nonterminals from the queue and decode these nonterminals in one batch.

Argument Identification
The majority of semantic parsing datasets have been developed with question-answering in mind.In the typical application setting, natural language questions are mapped into logical forms and executed on a knowledge base to obtain an answer.Due to the nature of the question-answering task, many natural language utterances contain entities or numbers that are often parsed as arguments in the logical form.Some of them are unavoidably rare or do not appear in the training set at all (this is especially true for small-scale datasets).Conventional sequence encoders simply replace rare words with a special unknown word symbol (Luong et al., 2015a;Jean et al., 2015), which would be detrimental for semantic parsing.
We have developed a simple procedure for argument identification.Specifically, we identify entities and numbers in input questions and replace them with their type names and unique IDs.For instance, we pre-process the training example "jobs with a salary of 40000" and its logical form "job(ANS), salary greater than(ANS, 40000, year)" as "jobs with a salary of num 0 " and "job(ANS), salary greater than(ANS, num 0 , year)".We use the pre-processed examples as training data.At inference time, we also mask entities and numbers with their types and IDs.Once we obtain the decoding result, a post-processing step recovers all the markers type i to their corresponding logical constants.

Experiments
We compare our method against multiple previous systems on four datasets.We describe these datasets below, and present our experimental settings and results.Finally, we conduct model analysis in order to understand what the model learns.The code is available at https://github.com/donglixp/lang2logic.

Datasets
Our model was trained on the following datasets, covering different domains and using different meaning representations.Examples for each domain are shown in Table 1.
JOBS This benchmark dataset contains 640 queries to a database of job listings.Specifically, questions are paired with Prolog-style queries.We used the same training-test split as Zettlemoyer and Collins (2005)  Turn on heater when temperature drops below 58 degree TRIGGER: Weather -Current temperature drops below -((Temperature ( 58)) (Degrees in (f))) ACTION: WeMo Insight Switch -Turn on -((Which switch?(""))) Table 1: Examples of natural language descriptions and their meaning representations from four datasets.The average length of input and output sequences is shown in the second column.recipes from the IFTTT website1 .Recipes are simple programs with exactly one trigger and one action which users specify on the site.Whenever the conditions of the trigger are satisfied, the action is performed.Actions typically revolve around home security (e.g., "turn on my lights when I arrive home"), automation (e.g., "text me if the door opens"), well-being (e.g., "remind me to drink water if I've been at a bar for more than two hours"), and so on.Triggers and actions are selected from different channels (160 in total) representing various types of services, devices (e.g., Android), and knowledge sources (such as ESPN or Gmail).In the dataset, there are 552 trigger functions from 128 channels, and 229 action functions from 99 channels.We used Quirk et al. 's (2015) original split which contains 77, 495 training, 5, 171 development, and 4, 294 test examples.The IFTTT programs are represented as abstract syntax trees and are paired with natural language descriptions provided by users (see Table 1).Here, numbers and URLs are identified.

Settings
Natural language sentences were lowercased; misspellings were corrected using a dictionary based on the Wikipedia list of common misspellings.Words were stemmed using NLTK (Bird et al., 2009).For IFTTT, we filtered tokens, channels and functions which appeared less than five times in the training set.For the other datasets, we filtered input words which did not occur at least two times in the training set, but kept all tokens in the logical forms.Plain string matching was employed to identify augments as described in Section 3.6.More sophisticated approaches could be used, however we leave this future work.
Model hyper-parameters were cross-validated Method Accuracy COCKTAIL (Tang and Mooney, 2001) 79.4 PRECISE (Popescu et al., 2003) 88.0 ZC05 (Zettlemoyer and Collins, 2005) 79.3 DCS+L (Liang et al., 2013) 90.7 TISP (Zhao and Huang, 2015) 85 on the training set for JOBS and GEO.We used the standard development sets for ATIS and IFTTT.We used the RMSProp algorithm (with batch size set to 20) to update the parameters.The smoothing constant of RMSProp was 0.95.Gradients were clipped at 5 to alleviate the exploding gradient problem (Pascanu et al., 2013).Parameters were randomly initialized from a uniform distribution U (−0.08, 0.08).A two-layer LSTM was used for IFTTT, while a one-layer LSTM was employed for the other domains.The dropout rate was selected from {0.2, 0.3, 0.4, 0.5}.Dimensions of hidden vector and word embedding were selected from {150, 200, 250}.Early stopping was used to determine the number of epochs.Input sentences were reversed before feeding into the encoder (Sutskever et al., 2014).We use greedy search to generate logical forms during inference.
Notice that two decoders with shared word embeddings were used to predict triggers and actions for IFTTT, and two softmax classifiers are used to classify channels and functions.

Results
We first discuss the performance of our model on JOBS, GEO, and ATIS, and then examine our results on IFTTT.Tables 2-4 present comparisons against a variety of systems previously described Method Accuracy SCISSOR (Ge and Mooney, 2005) 72.3 KRISP (Kate and Mooney, 2006) 71.7 WASP (Wong and Mooney, 2006) 74.8 λ-WASP (Wong and Mooney, 2007) 86.6 LNLZ08 (Lu et al., 2008) 81.8 ZC05 (Zettlemoyer and Collins, 2005) 79.3 ZC07 (Zettlemoyer and Collins, 2007) 86.1 UBL (Kwiatkowski et al., 2010) 87.9 FUBL (Kwiatkowski et al., 2011) 88.6 KCAZ13 (Kwiatkowski et al., 2013) 89.0 DCS+L (Liang et al., 2013) 87.9 TISP (Zhao and Huang, 2015) 88.9  , 2007) 84.6 UBL (Kwiatkowski et al., 2010) 71.4 FUBL (Kwiatkowski et al., 2011) 82.8 GUSP-FULL (Poon, 2013) 74.8 GUSP++ (Poon, 2013) 83.5 TISP (Zhao and Huang, 2015) 84 in the literature.We report results with the full models (SEQ2SEQ, SEQ2TREE) and two ablation variants, i.e., without an attention mechanism (−attention) and without argument identification (−argument).We report accuracy which is defined as the proportion of the input sentences that are correctly parsed to their gold standard logical forms.Notice that DCS+L, KCAZ13 and GUSP output answers directly, so accuracy in this setting is defined as the percentage of correct answers.Overall, SEQ2TREE is superior to SEQ2SEQ.This is to be expected since SEQ2TREE explicitly models compositional structure.On the JOBS and GEO datasets which contain logical forms with nested structures, SEQ2TREE outperforms SEQ2SEQ by 2.9% and 2.5%, respectively.SEQ2TREE achieves better accuracy over SEQ2SEQ on ATIS too, however, the difference is smaller, since ATIS is a simpler domain without complex nested structures.We find that adding at-    We illustrate examples of alignments produced by SEQ2SEQ in Figures 6a and 6b.Alignments produced by SEQ2TREE are shown in Figures 6c  and 6d.Matrices of attention scores are computed using Equation ( 5) and are represented in grayscale.Aligned input words and logical form predicates are enclosed in (same color) rectangles whose overlapping areas contain the attention scores.Also notice that attention scores are computed by LSTM hidden vectors which encode context information rather than just the words in their current positions.The examples demonstrate that the attention mechanism can successfully model the correspondence between sentences and logical forms, capturing reordering (Figure 6b), manyto-many (Figure 6a), and many-to-one alignments (Figures 6c,d).
For IFTTT, we follow the same evaluation protocol introduced in Quirk et al. (2015).The dataset is extremely noisy and measuring accuracy is problematic since predicted abstract syntax trees (ASTs) almost never exactly match the gold standard.Quirk et al. view an AST as a set of productions and compute balanced F1 instead which we also adopt.The first column in Table 5 shows the percentage of channels selected correctly for both triggers and actions.The second column measures accuracy for both channels and functions.The last column shows balanced F1 against the gold tree over all productions in the proposed derivation.We compare our model against posclass, the method introduced in Quirk et al. and several of their baselines.posclass is reminiscent of KRISP (Kate and Mooney, 2006), it learns distributions over productions given input sentences represented as a bag of linguistic features.The retrieval baseline finds the closest description in the training data based on character string-edit-distance and returns the recipe for that training program.The phrasal method uses phrase-based machine translation to generate the recipe, whereas sync extracts synchronous grammar rules from the data, essentially recreating WASP (Wong and Mooney, 2006).Finally, they use a binary classifier to predict whether a production should be present in the derivation tree corresponding to the description.Quirk et al. (2015) report results on the full test data and smaller subsets after noise filtering, e.g., when non-English and unintelligible descriptions are removed (Tables 5a and 5b).They also ran their system on a high-quality subset of description-program pairs which were found in the gold standard and at least three humans managed to independently reproduce (Table 5c).Across all subsets our models outperforms posclass and related baselines.Again we observe that SEQ2TREE consistently outperforms SEQ2SEQ, albeit with a small margin.Compared to the previous datasets, the attention mechanism and our argument iden-tification method yield less of an improvement.This may be due to the size of Quirk et al. (2015) and the way it was created -user curated descriptions are often of low quality, and thus align very loosely to their corresponding ASTs.

Error Analysis
Finally, we inspected the output of our model in order to identify the most common causes of errors which we summarize below.
Under-Mapping The attention model used in our experiments does not take the alignment history into consideration.So, some question words, expecially in longer questions, may be ignored in the decoding process.This is a common problem for encoder-decoder models and can be addressed by explicitly modelling the decoding coverage of the source words (Tu et al., 2016;Cohn et al., 2016).Keeping track of the attention history would help adjust future attention and guide the decoder towards untranslated source words.
Argument Identification Some mentions are incorrectly identified as arguments.For example, the word may is sometimes identified as a month when it is simply a modal verb.Moreover, some argument mentions are ambiguous.For instance, 6 o'clock can be used to express either 6 am or 6 pm.We could disambiguate arguments based on contextual information.The execution results of logical forms could also help prune unreasonable arguments.
Rare Words Because the data size of JOBS, GEO, and ATIS is relatively small, some question words are rare in the training set, which makes it hard to estimate reliable parameters for them.One solution would be to learn word embeddings on unannotated text data, and then use these as pretrained vectors for question words.

Conclusions
In this paper we presented an encoder-decoder neural network model for mapping natural language descriptions to their meaning representations.We encode natural language utterances into vectors and generate their corresponding logical forms as sequences or trees using recurrent neural networks with long short-term memory units.Experimental results show that enhancing the model with a hierarchical tree decoder and an attention mechanism improves per-formance across the board.Extensive comparisons with previous methods show that our approach performs competitively, without recourse to domain-or representation-specific features.Directions for future work are many and varied.For example, it would be interesting to learn a model from question-answer pairs without access to target logical forms.Beyond semantic parsing, we would also like to apply our SEQ2TREE model to related structured prediction tasks such as constituency parsing.

Figure 4 :
Figure 4: A SEQ2TREE decoding example for the logical form "A B (C)".

Figure 6 :
Figure 6: Alignments (same color rectangles) produced by the attention mechanism (darker color represents higher attention score).Input sentences are reversed and stemmed.Model output is shown for SEQ2SEQ (a, b) and SEQ2TREE (c, d).

Table 2 :
Evaluation results on JOBS.

Table 3 :
Evaluation results on GEO.10-fold crossvalidation is used for the systems shown in the top half of the table.The standard split of ZC05 is used for all other systems.

Table 4 :
Evaluation results on ATIS.