Data Recombination for Neural Semantic Parsing

Modeling crisp logical regularities is crucial in semantic parsing, making it difficult for neural models with no task-specific prior knowledge to achieve good results. In this paper, we introduce data recombination, a novel framework for injecting such prior knowledge into a model. From the training data, we induce a high-precision synchronous context-free grammar, which captures important conditional independence properties commonly found in semantic parsing. We then train a sequence-to-sequence recurrent network (RNN) model with a novel attention-based copying mechanism on datapoints sampled from this grammar, thereby teaching the model about these structural properties. Data recombination improves the accuracy of our RNN model on three semantic parsing datasets, leading to new state-of-the-art performance on the standard GeoQuery dataset for models with comparable supervision.

Meanwhile, recurrent neural networks (RNNs) what are the major cities in utah ? what states border maine ?

Original Examples
Train Model

Synchronous CFG
Induce Grammar what are the major cities in [states border [maine]] ? what are the major cities in [states border [utah]] ? what states border [states border [maine]] ? what states border [states border [utah]] ?

Recombinant Examples
Figure 1: An overview of our system. Given a dataset, we induce a high-precision synchronous context-free grammar. We then sample from this grammar to generate new "recombinant" examples, which we use to train a sequence-to-sequence RNN.
have made swift inroads into many structured prediction tasks in NLP, including machine translation (Sutskever et al., 2014;Bahdanau et al., 2014) and syntactic parsing Dyer et al., 2015). Because RNNs make very few domain-specific assumptions, they have the potential to succeed at a wide variety of tasks with minimal feature engineering. However, this flexibility also puts RNNs at a disadvantage compared to standard semantic parsers, which can generalize naturally by leveraging their built-in awareness of logical compositionality.
ing prior knowledge into a domain-general structured prediction model. In data recombination, prior knowledge about a task is used to build a high-precision generative model that expands the empirical distribution by allowing fragments of different examples to be combined in particular ways. Samples from this generative model are then used to train a domain-general model. In the case of semantic parsing, we construct a generative model by inducing a synchronous context-free grammar (SCFG), creating new examples such as those shown in Figure 1; our domain-general model is a sequence-to-sequence RNN with a novel attention-based copying mechanism. Data recombination boosts the accuracy of our RNN model on three semantic parsing datasets. On the GEO dataset, data recombination improves test accuracy by 4.3 percentage points over our baseline RNN, leading to new state-of-the-art results for models that do not use a seed lexicon for predicates.

Problem statement
We cast semantic parsing as a sequence-tosequence task. The input utterance x is a sequence of words x 1 , . . . , x m ∈ V (in) , the input vocabulary; similarly, the output logical form y is a sequence of tokens y 1 , . . . , y n ∈ V (out) , the output vocabulary. A linear sequence of tokens might appear to lose the hierarchical structure of a logical form, but there is precedent for this choice:  showed that an RNN can reliably predict tree-structured outputs in a linear fashion. We evaluate our system on three existing semantic parsing datasets. Figure 2 shows sample input-output pairs from each of these datasets.  Zettlemoyer and Collins (2007).
• Overnight (OVERNIGHT) contains logical forms paired with natural language paraphrases across eight varied subdomains.  constructed the dataset by generating all possible logical forms up to some depth threshold, then getting multiple natural language paraphrases for each logical form from workers on Amazon Mechanical Turk. We evaluate on the same train/test splits as .
In this paper, we only explore learning from logical forms. In the last few years, there has an emergence of semantic parsers learned from denotations (Clarke et al., 2010;Liang et al., 2011;Berant et al., 2013;Artzi and Zettlemoyer, 2013b). While our system cannot directly learn from denotations, it could be used to rerank candidate derivations generated by one of these other systems.

Sequence-to-sequence RNN Model
Our sequence-to-sequence RNN model is based on existing attention-based neural machine translation models (Bahdanau et al., 2014;Luong et al., 2015a), but also includes a novel attention-based copying mechanism. Similar copying mechanisms have been explored in parallel by Gu et al. (2016) and Gulcehre et al. (2016).

Basic Model
Encoder. The encoder converts the input sequence x 1 , . . . , x m into a sequence of context-sensitive embeddings b 1 , . . . , b m using a bidirectional RNN (Bahdanau et al., 2014). First, a word embedding function φ (in) maps each word x i to a fixed-dimensional vector. These vectors are fed as input to two RNNs: a forward RNN and a backward RNN. The forward RNN starts with an initial hidden state h F 0 , and generates a sequence of hidden states h F 1 , . . . , h F m by repeatedly applying the recurrence The recurrence takes the form of an LSTM (Hochreiter and Schmidhuber, 1997). The backward RNN similarly generates hidden states h B m , . . . , h B 1 by processing the input sequence in reverse order. Finally, for each input position i, we define the context-sensitive embedding b i to be the concatenation of h F i and h B i Decoder. The decoder is an attention-based model (Bahdanau et al., 2014;Luong et al., 2015a) that generates the output sequence y 1 , . . . , y n one token at a time. At each time step j, it writes y j based on the current hidden state s j , then updates the hidden state to s j+1 based on s j and y j . Formally, the decoder is defined by the following equations: . (4) When not specified, i ranges over {1, . . . , m} and j ranges over {1, . . . , n}. Intuitively, the α ji 's define a probability distribution over the input words, describing what words in the input the decoder is focusing on at time j. They are computed from the unnormalized attention scores e ji . The matrices W (s) , W (a) , and U , as well as the embedding function φ (out) , are parameters of the model.

Attention-based Copying
In the basic model of the previous section, the next output word y j is chosen via a simple softmax over all words in the output vocabulary. However, this model has difficulty generalizing to the long tail of entity names commonly found in semantic parsing datasets. Conveniently, entity names in the input often correspond directly to tokens in the output (e.g., "iowa" becomes iowa in Figure 2). 1 To capture this intuition, we introduce a new attention-based copying mechanism. At each time step j, the decoder generates one of two types of actions. As before, it can write any word in the output vocabulary. In addition, it can copy any input word x i directly to the output, where the probability with which we copy x i is determined by the attention score on x i . Formally, we define a latent action a j that is either Write[w] for some w ∈ V (out) or Copy[i] for some i ∈ {1, . . . , m}. We then have The decoder chooses a j with a softmax over all these possible actions; y j is then a deterministic function of a j and x. During training, we maximize the log-likelihood of y, marginalizing out a.
Attention-based copying can be seen as a combination of a standard softmax output layer of an attention-based model (Bahdanau et al., 2014) and a Pointer Network (Vinyals et al., 2015a); in a Pointer Network, the only way to generate output is to copy a symbol from the input.
Our approach generalizes data augmentation, which is commonly employed to inject prior knowledge into a model. Data augmentation techniques focus on modeling invariancestransformations like translating an image or adding noise that alter the inputs x, but do not change the output y. These techniques have proven effective in areas like computer vision  and speech recognition (Jaitly and Hinton, 2013).
In semantic parsing, however, we would like to capture more than just invariance properties. Consider an example with the utterance "what states border texas ?". Given this example, it should be easy to generalize to questions where "texas" is replaced by the name of any other state: simply replace the mention of Texas in the logical form with the name of the new state. Underlying this phenomenon is a strong conditional independence principle: the meaning of the rest of the sentence is independent of the name of the state in question. Standard data augmentation is not sufficient to model such phenomena: instead of holding y fixed, we would like to apply simultaneous transformations to x and y such that the new x still maps to the new y. Data recombination addresses this need.

General Setting
In the general setting of data recombination, we start with a training set D of (x, y) pairs, which defines the empirical distributionp(x, y). We then fit a generative modelp(x, y) top which generalizes beyond the support ofp, for example by splicing together fragments of different examples. We refer to examples in the support ofp as recombinant examples. Finally, to train our actual model p θ (y | x), we maximize the expected value of log p θ (y | x), where (x, y) is drawn fromp.

SCFGs for Semantic Parsing
For semantic parsing, we induce a synchronous context-free grammar (SCFG) to serve as the backbone of our generative modelp. An SCFG consists of a set of production rules X → α, β , where X is a category (non-terminal), and α and β are sequences of terminal and non-terminal symbols. Any non-terminal symbols in α must be aligned to the same non-terminal symbol in β, and vice versa. Therefore, an SCFG defines a set of joint derivations of aligned pairs of strings. In our case, we use an SCFG to represent joint deriva-tions of utterances x and logical forms y (which for us is just a sequence of tokens). After we induce an SCFG G from D, the corresponding generative modelp(x, y) is the distribution over pairs (x, y) defined by sampling from G, where we choose production rules to apply uniformly at random.
It is instructive to compare our SCFG-based data recombination with WASP (Wong and Mooney, 2006;Wong and Mooney, 2007), which uses an SCFG as the actual semantic parsing model. The grammar induced by WASP must have good coverage in order to generalize to new inputs at test time. WASP also requires the implementation of an efficient algorithm for computing the conditional probability p(y | x). In contrast, our SCFG is only used to convey prior knowledge about conditional independence structure, so it only needs to have high precision; our RNN model is responsible for boosting recall over the entire input space. We also only need to forward sample from the SCFG, which is considerably easier to implement than conditional inference.
Below, we examine various strategies for inducing a grammar G from a dataset D. We first encode D as an initial grammar with rules ROOT → x, y for each (x, y) ∈ D. Next, we will define each grammar induction strategy as a mapping from an input grammar G in to a new grammar G out . This formulation allows us to compose grammar induction strategies (Section 4.3.4).

Abstracting Entities
Our first grammar induction strategy, ABSENTI-TIES, simply abstracts entities with their types. We assume that each entity e (e.g., texas) has a corresponding type e.t (e.g., state), which we infer based on the presence of certain predicates in the logical form (e.g. stateid). For each grammar rule X → α, β in G in , where α contains a token (e.g., "texas") that string matches an entity (e.g., texas) in β, we add two rules to G out : (i) a rule where both occurrences are replaced with the type of the entity (e.g., state), and (ii) a new rule that maps the type to the entity (e.g., STATEID → "texas", texas ; we reserve the category name STATE for the next section). Thus, G out generates recombinant examples that fuse most of one example with an entity found in a second example. A concrete example from the GEO domain is given in Figure 3.

Abstracting Whole Phrases
Our second grammar induction strategy, ABSW-HOLEPHRASES, abstracts both entities and whole phrases with their types. For each grammar rule X → α, β in G in , we add up to two rules to G out . First, if α contains tokens that string match to an entity in β, we replace both occurrences with the type of the entity, similarly to rule (i) from AB-SENTITIES. Second, if we can infer that the entire expression β evaluates to a set of a particular type (e.g. state) we create a rule that maps the type to α, β . In practice, we also use some simple rules to strip question identifiers from α, so that the resulting examples are more natural. Again, refer to Figure 3 for a concrete example.
This strategy works because of a more general conditional independence property: the meaning of any semantically coherent phrase is conditionally independent of the rest of the sentence, the cornerstone of compositional semantics. Note that this assumption is not always correct in general: for example, phenomena like anaphora that involve long-range context dependence violate this assumption. However, this property holds in most existing semantic parsing datasets.

Concatenation
The final grammar induction strategy is a surprisingly simple approach we tried that turns out to work. For any k ≥ 2, we define the CONCAT-k strategy, which creates two types of rules. First, we create a single rule that has ROOT going to a sequence of k SENT's. Then, for each rootlevel rule ROOT → α, β in G in , we add the rule SENT → α, β to G out . See Figure 3 for an example.
Unlike ABSENTITIES and ABSWHOLE-PHRASES, concatenation is very general, and can be applied to any sequence transduction problem. Of course, it also does not introduce additional information about compositionality or independence properties present in semantic parsing. However, it does generate harder examples for the attention-based RNN, since the model must learn to attend to the correct parts of the now-longer input sequence. Related work has shown that training a model on more difficult examples can improve generalization, the most canonical case being dropout Wager et al., 2013).
number of examples to sample n) Induce grammar G from D Initialize RNN parameters θ randomly for each iteration t = 1, . . . , T do Compute current learning rate ηt Initialize current dataset Dt to D for i = 1, . . . , n do Sample new example (x , y ) from G Add (x , y ) to Dt end for Shuffle Dt for each example (x, y) in Dt do θ ← θ + ηt∇ log p θ (y | x) end for end for end function Figure 4: The training procedure with data recombination. We first induce an SCFG, then sample new recombinant examples from it at each epoch.

Composition
We note that grammar induction strategies can be composed, yielding more complex grammars. Given any two grammar induction strategies f 1 and f 2 , the composition f 1 • f 2 is the grammar induction strategy that takes in G in and returns f 1 (f 2 (G in )). For the strategies we have defined, we can perform this operation symbolically on the grammar rules, without having to sample from the intermediate grammar f 2 (G in ).

Experiments
We evaluate our system on three domains: GEO, ATIS, and OVERNIGHT. For ATIS, we report logical form exact match accuracy. For GEO and OVERNIGHT, we determine correctness based on denotation match, as in Liang et al. (2011) and , respectively.

Choice of Grammar Induction Strategy
We note that not all grammar induction strategies make sense for all domains. In particular, we only apply ABSWHOLEPHRASES to GEO and OVERNIGHT. We do not apply ABSWHOLE-PHRASES to ATIS, as the dataset has little nesting structure.

Implementation Details
We tokenize logical forms in a domain-specific manner, based on the syntax of the formal language being used. On GEO and ATIS, we disallow copying of predicate names to ensure a fair comparison to previous work, as string matching between input words and predicate names is not commonly used. We prevent copying by prepending underscores to predicate tokens; see Figure 2 for examples.
On ATIS alone, when doing attention-based copying and data recombination, we leverage an external lexicon that maps natural language phrases (e.g., "kennedy airport") to entities (e.g., jfk:ap). When we copy a word that is part of a phrase in the lexicon, we write the entity associated with that lexicon entry. When performing data recombination, we identify entity alignments based on matching phrases and entities from the lexicon.
We run all experiments with 200 hidden units and 100-dimensional word vectors. We initialize all parameters uniformly at random within the interval [−0.1, 0.1]. We maximize the loglikelihood of the correct logical form using stochastic gradient descent. We train the model for a total of 30 epochs with an initial learning rate of 0.1, and halve the learning rate every 5 epochs, starting after epoch 15. We replace word vectors for words that occur only once in the training set with a universal <unk> word vector. Our model is implemented in Theano (Bergstra et al., 2010).
When performing data recombination, we sam- At test time, we use beam search with beam size 5. We automatically balance missing right parentheses by adding them at the end. On GEO and OVERNIGHT, we then pick the highest-scoring logical form that does not yield an executor error when the corresponding denotation is computed. On ATIS, we just pick the top prediction on the beam.

Impact of the Copying Mechanism
First, we measure the contribution of the attentionbased copying mechanism to the model's overall GEO Table 2: Test accuracy using different data recombination strategies on GEO and ATIS. AE is AB-SENTITIES, AWP is ABSWHOLEPHRASES, C2 is CONCAT-2, and C3 is CONCAT-3. performance. On each task, we train and evaluate two models: one with the copying mechanism, and one without. Training is done without data recombination. The results are shown in Table 1. On GEO and ATIS, the copying mechanism helps significantly: it improves test accuracy by 10.4 percentage points on GEO and 6.4 points on ATIS. However, on OVERNIGHT, adding the copying mechanism actually makes our model perform slightly worse. This result is somewhat expected, as the OVERNIGHT dataset contains a very small number of distinct entities. It is also notable that both systems surpass the previous best system on OVERNIGHT by a wide margin.
We choose to use the copying mechanism in all subsequent experiments, as it has a large advantage in realistic settings where there are many distinct entities in the world. The concurrent work of Gu et al. (2016) and Gulcehre et al. (2016), both of whom propose similar copying mechanisms, provides additional evidence for the utility of copying on a wide range of NLP tasks.

Main Results
2 The method of Liang et al. (2011) is not comparable to For our main results, we train our model with a variety of data recombination strategies on all three datasets. These results are summarized in Tables 2  and 3. We compare our system to the baseline of not using any data recombination, as well as to state-of-the-art systems on all three datasets.
We find that data recombination consistently improves accuracy across the three domains we evaluated on, and that the strongest results come from composing multiple strategies. Combining ABSWHOLEPHRASES, ABSENTITIES, and CONCAT-2 yields a 4.3 percentage point improvement over the baseline without data recombination on GEO, and an average of 1.7 percentage points on OVERNIGHT. In fact, on GEO, we achieve test accuracy of 89.3%, which surpasses the previous state-of-the-art, excluding Liang et al. (2011), which used a seed lexicon for predicates. On ATIS, we experiment with concatenating more than 2 examples, to make up for the fact that we cannot apply ABSWHOLEPHRASES, which generates longer examples. We obtain a test accuracy of 83.3 with ABSENTITIES composed with CONCAT-3, which beats the baseline by 7 percentage points and is competitive with the state-of-theart.
Data recombination without copying. For completeness, we also investigated the effects of data recombination on the model without attention-based copying. We found that recombination helped significantly on GEO and ATIS, but hurt the model slightly on OVERNIGHT. On GEO, the best data recombination strategy yielded test accuracy of 82.9%, for a gain of 8.3 percentage points over the baseline with no copying and no recombination; on ATIS, data recombination gives test accuracies as high as 74.6%, a 4.7 point gain over the same baseline. However, no data recombination strategy improved average test accuracy on OVERNIGHT; the best one resulted in a 0.3 percentage point decrease in test accuracy. We hypothesize that data recombination helps less on OVERNIGHT in general because the space of possible logical forms is very limited, making it more like a large multiclass classification task. Therefore, it is less important for the model to learn good compositional representations that generalize to new logical forms at test time.
ours, as they as they used a seed lexicon mapping words to predicates. We explicitly avoid using such prior knowledge in our system.  Table 3: Test accuracy using different data recombination strategies on the OVERNIGHT tasks.
Depth-2 (same length) x: "rel:12 of rel:17 of ent:14" y:  to simulate the acquisition of more training data). We constructed a simple world containing a set of entities and a set of binary relations. For any n, we can generate a set of depth-n examples, which involve the composition of n relations applied to a single entity. Example data points are shown in Figure 5. We train our model on various datasets, then test it on a set of 500 randomly chosen depth-2 examples. The model always has access to a small seed training set of 100 depth-2 examples. We then add one of four types of examples to the training set:
• Same length, recombinant: Depth-2 examples sampled from the grammar induced by applying ABSENTITIES to the seed dataset.
• Longer, recombinant: Depth-4 examples sampled from the grammar induced by applying ABSWHOLEPHRASES followed by AB-SENTITIES to the seed dataset.
To maintain consistency between the independent and recombinant experiments, we fix the recombinant examples across all epochs, instead of resampling at every epoch. In Figure 6, we plot accuracy on the test set versus the number of additional examples added of each of these four types. As expected, independent examples are more helpful than the recombinant ones, but both help the model improve considerably. In addition, we see that even though the test dataset only has short examples, adding longer examples helps the model more than adding shorter ones, in both the independent and recombinant cases. These results underscore the importance training on longer, harder examples.

Discussion
In this paper, we have presented a novel framework we term data recombination, in which we generate new training examples from a highprecision generative model induced from the original training dataset. We have demonstrated its effectiveness in improving the accuracy of a sequence-to-sequence RNN model on three semantic parsing datasets, using a synchronous context-free grammar as our generative model. There has been growing interest in applying neural networks to semantic parsing and related tasks. Dong and Lapata (2016) concurrently developed an attention-based RNN model for semantic parsing, although they did not use data recombination. Grefenstette et al. (2014) proposed a non-recurrent neural model for semantic parsing, though they did not run experiments. Mei et al. (2016) use an RNN model to perform a related task of instruction following.
Our proposed attention-based copying mechanism bears a strong resemblance to two models that were developed independently by other groups. Gu et al. (2016) apply a very similar copying mechanism to text summarization and singleturn dialogue generation. Gulcehre et al. (2016) propose a model that decides at each step whether to write from a "shortlist" vocabulary or copy from the input, and report improvements on machine translation and text summarization. Another piece of related work is Luong et al. (2015b), who train a neural machine translation system to copy rare words, relying on an external system to generate alignments.
Prior work has explored using paraphrasing for data augmentation on NLP tasks. Zhang et al. (2015) augment their data by swapping out words for synonyms from WordNet. Wang and Yang (2015) use a similar strategy, but identify similar words and phrases based on cosine distance between vector space embeddings. Unlike our data recombination strategies, these techniques only change inputs x, while keeping the labels y fixed. Additionally, these paraphrasing-based transformations can be described in terms of grammar induction, so they can be incorporated into our framework.
In data recombination, data generated by a highprecision generative model is used to train a second, domain-general model. Generative oversampling (Liu et al., 2007) learns a generative model in a multiclass classification setting, then uses it to generate additional examples from rare classes in order to combat label imbalance. Uptraining (Petrov et al., 2010) uses data labeled by an accurate but slow model to train a computationally cheaper second model.  generate a large dataset of constituency parse trees by taking sentences that multiple existing systems parse in the same way, and train a neural model on this dataset. Some of our induced grammars generate examples that are not in the test distribution, but nonetheless aid in generalization. Related work has also explored the idea of training on altered or out-of-domain data, often interpreting it as a form of regularization. Dropout training has been shown to be a form of adaptive regularization Wager et al., 2013). Guu et al. (2015) showed that encouraging a knowledge base completion model to handle longer path queries acts as a form of structural regularization.
Language is a blend of crisp regularities and soft relationships. Our work takes RNNs, which excel at modeling soft phenomena, and uses a highly structured tool-synchronous context free grammars-to infuse them with an understanding of crisp structure. We believe this paradigm for simultaneously modeling the soft and hard aspects of language should have broader applicability beyond semantic parsing.