Coupling Retrieval and Meta-Learning for Context-Dependent Semantic Parsing

In this paper, we present an approach to incorporate retrieved datapoints as supporting evidence for context-dependent semantic parsing, such as generating source code conditioned on the class environment. Our approach naturally combines a retrieval model and a meta-learner, where the former learns to find similar datapoints from the training data, and the latter considers retrieved datapoints as a pseudo task for fast adaptation. Specifically, our retriever is a context-aware encoder-decoder model with a latent variable which takes context environment into consideration, and our meta-learner learns to utilize retrieved datapoints in a model-agnostic meta-learning paradigm for fast adaptation. We conduct experiments on CONCODE and CSQA datasets, where the context refers to class environment in JAVA codes and conversational history, respectively. We use sequence-to-action model as the base semantic parser, which performs the state-of-the-art accuracy on both datasets. Results show that both the context-aware retriever and the meta-learning strategy improve accuracy, and our approach performs better than retrieve-and-edit baselines.


Introduction
Context-dependent semantic parsing aims to map a natural language utterance to a structural logical form (e.g.source code) conditioned on a given context (e.g.class environment) (Ling et al., 2016;Long et al., 2016;Iyyer et al., 2017;Iyer et al., 2018;Suhr et al., 2018;Suhr and Artzi, 2018).Standard approaches typically learn a one-sizefits-all model on the entire training dataset, which is fed with each example individually in the training phase and makes predictions for each test example in the inference phase.However, taking code generation as an example, programmers usually do not write codes from scratch in the real world.When they write a piece of code in a particular environment, they typically leverage past experience on writing or reading codes in the similar situation as a guidance.Meanwhile, datapoints for a task may vary widely (Huang et al., 2018a), thus it is desirable to learn a "personalized" model for the target datapoint.In this work, we study how to automatically retrieve similar datapoints in a context-dependent scenario and use them as the supporting evidence to facilitate semantic parsing.
There are recent attempts at exploiting retrieved examples to improve the generation of logical form and text.Retrieve-and-edit approaches (Hashimoto et al., 2018;Huang et al., 2018b;Wu et al., 2018;Gu et al., 2017) typically first use a context-independent retriever to find the most relevant datapoint, and then use it as an additional input of the editing model.However, a contextaware retriever is very important for the task of context-dependent semantic parsing.For examples, as shown in Figure 1, class environment can help the retriever decide whether the desired code of "Increment this vector" is generated by directly calling add() or iterating the vecElements array to increment each element.Furthermore, retrieveand-edit approaches typically consider only one similar example to edit.In semantic parsing, the pattern of a structural output may come from different retrieved examples.There also exist works to utilize multiple examples to guide the semantic parser (Hayati et al., 2018;Huang et al., 2018a), however, these approaches either use a heuristic way to exploit the retrieved logical form such as increasing the probability of actions (Hayati et al., 2018) or use a relevance function designed and learned based on expertise about the target logical form (Huang et al., 2018a).When we consider the context environment, it's nontrivial to design a context-aware relevance function since the form of the context environment varies widely.In this work, we propose to retrieve similar examples by taking into account the context environment, and use meta-learning to utilize retrieved examples to guide the generation of a logical form.Our retriever is a context-aware encoder-decoder model that takes context environment into consideration.Specially, the model is based on the variational auto-encoder framework (Kingma and Welling, 2013;Rezende et al., 2014), which encodes a natural language utterance with the context environment into a latent variable that can produce the correct logical form.We adopt metalearning framework (Finn et al., 2017) to train a general semantic parser that can quickly adapt to a new (pseudo) task via few-shot learning, where multiple retrieved examples are viewed as a support set of a pseudo task.Our approach naturally make use of multiple similar examples to guide the semantic parser in the current task.
We evaluate our approach on CONCODE (Iyer et al., 2018) and CSQA (Saha et al., 2018) datasets, where tasks are generating source code conditioned on the class environment in JAVA codes and answering conversational question over a knowledge graph conditioned on conversational history.Results show that our approach achieves the state-of-the-art performances on both datasets.We show that coupling retrieval and meta-learning performs better than two retrieve-and-edit baselines.Further analysis show that both the context-aware retriever and the meta-learning strategy improve the performance.

Task Definition and Datasets
Context-dependent semantic parsing aims to map a natural language to a structural logical form conditioned on the context environment.In this section, we introduce two tasks we study, namely code generation and conversational question answering, and the datasets we use.

Context-dependent Code Generation
Figure 1 shows a example of code generation.Given a natural language (NL) description x, the goal aims to generate a source code y conditioned on the class environment c.Formally, the class environment comprises two kinds of context: (1) class variables v composed of variable names and their data type (e.g.double[ ] vecElements), and (2) class methods m, including method names with their return type (e.g.void add()).We conduct experiments on the CONCODE1 dataset (Iyer et al., 2018).The dataset is built from about 33,000 public Java projects on Github that contains NL and codes together with class environment information.

Conversational Question Answering
This task aims to answer questions in conversations based on a knowledge base (KB).We tackle the problem in a context-dependent semantic parsing manner.Specially, the task aims to map the question x conditioned on conversational history c into a logical form y, which will be executed on the KB to produce the answer.The conversational history refers to preceding questions {q 1 , q 2 , .., q i−1 }.In particular, we use the CSQA2 dataset (Saha et al., 2018) to develop our model and to conduct the experiments.The dataset is created based on Wikidata with 12.8M entities, including 152K/16K/28K dialogs for training/development/testing.

Overview of the Approach
We present our approach in this section, which first retrieves supporting datapoints from the training dataset using a context-aware retriever, and then considers retrieved datapoints as a pseudo Step 1: ←  − ∇  ℒ   ′ ... ′ Step 2: Figure 2: An overview of our approach that couples context-aware retriever and meta-learning.
task for fast adaptation in a model-agnostic metalearning paradigm (Finn et al., 2017).Figure 2 gives an overview of our approach.First, we sample a batch of examples D from the training dataset D. In meta-learning, there are two optimizing steps, namely the meta-train step (Step 1 in Figure 2) that learns a task-specific learner M θ based on the current parameter θ, and the meta-test step (Step 2 in Figure 2) that updates the parameter θ based on the evaluation of M θ .In this work, D is used for meta-test process, and retrieved examples S from the context-aware retriever are used for meta-training.In the inference phase, We consider the prediction of each test example as a new task, given retrieved examples from the training data as the supporting evidence.Instead of applying the general model M θ directly, retrieved examples are used to update the model, and the updated model will be used to make predictions.The approach is summarized in Algorithm 1.
The details about the context-aware retriever and the semantic parser model will be introduced in Sections 4 and Section 5, respectively.

Context-Aware Retriever
In this section, we present the model architecture of our context-aware retriever, the way to use the model to retrieve similar examples using a distance metric in the latent space, and how to effectively train the model.Evaluate ∇ θ L(M θ ) using S , and compute adapted parameters with gradient descent: Update θ ← θ − β∇ θ L(M θ ) using D for meta-update 8: end while language x with the context environment c into a latent variable z that can predict the output y.
Encoder We use bidirectional RNNs with LSTM (Hochreiter and Schmidhuber, 1997) as encoders to compute the representation h x of the natural language x and the representation h c of the context environment c, where h x is the hidden states of the NL encoder at the last token and details about h c for CONCODE and CSQA datasets are provided in the appendix A.

Latent variable
We have two latent variables, one (z x ) is for the current utterance and another (z c ) is for the context.We use the concatenation of z x and z c as the embedding of the natural language with the context, namely z = [z x ; z c ].
We describe how to map the the natural language x into a latent variable z x here.The calculation of z c is analogous to z x .Following (Hashimoto et al., 2018), we choose z x to be a von Mises-Fisher (vMF) distribution over unit vectors centered on µ x , where both z x and µ x are unit vectors, and Z κ is a normalization constant depending only on constant κ and the dimension d of z x .The µ x is calculated by a linear layer followed by a activation function, and the input is h x .
(1) Other distributions such as the Gaussian distribution can also be used to represent latent variables, but we choose the vMF in this paper since  the KV-divergence is proportional to the squared Euclidean distance between their respective direction vectors µ with the same κ and d.The property will be used in the next section.
Decoder At the decoding, we first sample z from p(z|x, c) using the re-parametrization trick (Kingma and Welling, 2014), and then use an additional linear layer over z to obtain the initial hidden state of the decoder.We use LSTM as the decoder.At each time-step t, the current hidden state s t of the decoder is used to predict a word from the vocabulary.In order to ensure that the target y is only inferred by the latent variable z, we don't incorporate attention or copying mechanism.The strategy is also used in Hashimoto et al. (2018).

Retrieve Examples
We use KL-divergence as the distance metric to retrieve similar examples in the latent space.In particular, the KL divergence between two vMF distributions with the same concentration parameter κ is calculated as follows, where µ 1 , µ 2 ∈ R d−1 and C κ is calculated as Equation 3.
(2) I d stands for the modified Bessel function of the first kind at order d.Since C κ only depends on κ and d, the KL divergence is proportional to the squared Euclidean distance between their respective direction vectors µ with the same κ and d.More details about the proof of this proposition can be found in (Hashimoto et al., 2018).
Given two examples (x, c) and (x , c ), the KL divergence between their distributions of latent vari-ables (i.e.p(z|x, c) and p(z|x , c )) is equivalent to the distance calculated as given in Equation 4. The retriever will find top-K nearest examples according to the distance.

Training
Our entire approach corresponds to the following generative process.Given a example (x, c), we first use the retriever p ret (S|x, c) to find similar examples S as a support set and then generate an output y by a meta-learner model p m (y|x, c, S) based on S. Therefore, the probability distribution over targets y is formulated in Equation 5.
A basic idea for learning the retriever might be maximizing the marginal likelihood by jointly learning, but it is computationally intractable.Instead, we train the retriever in isolation, assuming that semantic parser provides the true conditional distribution over the target y given context c and retrieved examples S under the joint distribution p ret (S|x, c)p data (x, c, y).Then, we optimize a lower bound for the marginal likelihood under this semantic parser (Hashimoto et al., 2018), which decomposes the reconstruction term and the KL divergence as follows.
logp(y|x, c) ≥ E z∼p(z|x,c) logp(y|z) According to Equation 4, the upper bound of KL(p(z|x, c)||p(z|x , c )) is 8C κ .Therefore, we can maximize the this worst-case lower bound, where C κ is constant in our case.This lower bound objective is analogous to the recently proposed hyperspherical variational autoencoder (Davidson et al., 2018;Xu and Durrett, 2018).
Thus, we optimize the context-aware retriever by maximizing E z∼p(z|x,c) logp(y|z).

Semantic Parser
Recently, sequence-to-action models (Yin and Neubig, 2017;Chen et al., 2018;Iyer et al., 2018;Guo et al., 2018) have achieved strong performance in semantic parsing, which consider the generation of a logical form as the prediction of a sequence of actions (e.g.derivation rules in a defined grammar).We use two context-dependent sequence-to-action models (Iyer et al., 2018;Guo et al., 2018) as the base semantic parsers, both of which take a natural language with the context environment as the input and outputs an action sequence.Both models achieve state-of-the-art on CONCODE and CSQA datasets.
In the task of code generation, the JAVA abstract grammar contains a set of production rules composed of an non-terminal and multiple symbols (e.g.Statement → return Expression).We represent a source code as an Abstract Syntax Tree (AST) by applying several production rules (Aho et al., 2007).The sequence of production rules applied to generate an AST is viewed as an action sequence a 1 , ..., a n , where an action a refers to a production rule.To access to the context environment, we introduce several special actions.For examples, two actions Identif ierN T → ClassM ethod and ClassM ethod → constant are used to invoke class methods.The former action means that the identifier comes from class methods, and the latter is an action used for instantiating ClassM ethod by a copying mechanism (Gu et al., 2016).In the task of conversational question answering, we convert the logical form into a action sequence using the similar grammar defined in (Guo et al., 2018).
Specially, a bidirectional LSTM takes a source sentence as the input, and feeds the concatenation of both ends as the initial state of the decoder.The decoder has another LSTM to generate an action sequence in a sequential way.At each time-step t, the decoder calculates the current hidden state s dec t as Equation 8, where s dec t−1 is the last hidden state, n t is the current non-terminal to be expanded, and y t−1 is the previously predicted action.p nt and s nt are the parent action and the hidden state of the decoder respectively, which produce the current non-terminal.If the previously predicted action is an instantiated action, the embedding y t−1 is the representation of the selected constant.
In order to generate a valid logical form, the model incorporates an action-constrained grammar to filter illegitimate actions.An action is legitimate if its left-hand non-terminal is the same as the current non-terminal to be expanded.Let us denote the set of legitimate actions at the time step t as A t = {a 1 , ..., a N }.The probability distribution over the set is calculated as Equation 9, where v i is the one-hot indicator vector for a i , W a is model parameter, and a <t stands for the preceding actions of the t-th time step.
For instantiated actions (e.g.ClassM ethod → constant), the probability of a constant m being instantiated at time-step t is calculated as Equation 10, where W is model parameter, v m is the embedding of the constant.
) Please see more details about the model hyperparameters in the Appendix B.

Model Comparisons on CONCODE
Table 1 reports results of different approaches on the CONCODE dataset.We use Exact match accuracy as the major evaluation metric, which measure whether the generated program is exactly correct.Following Iyer et al. (2018), we also report BLEU-4 score (Papineni et al., 2002) between the reference and generated code as a reference.These approaches are divided into three groups.The first group is retrieval ONLY, which directly returns the top-ranked retrieved example.From the first group, we can see that directly using the retrieved output has extremely low Exact score since of mismatching environment variables and methods, which means that its meaning is incorrect.Yet, the BLEU score is acceptable, which means that some constituents might be useful.In the second group, we compare parsing based methods without retrieved examples.As we can see, our Seq2Action model outperforms others models, resulting in the state-ofthe-art accuracy without using retrieved examples.In the third group, we implement two retrievalaugmented methods for comparison.Retrieveand-edit uses a copying mechanism to copy tokens from the retrieved example (Hashimoto et al., 2018).Edit vector calculates an edit vector by considering lexical differences between a prototype context and current context, and uses the edit vector as an extra feature (Wu et al., 2018).We can see that applying the MAML framework to the Seq2Action model achieves a gain of 1.35% exact match accuracy.Results also show that our context-aware retriever performs better than the context-independent retriever in various settings.

Model Comparisons on CSQA
We follow the experiment protocol of Guo et al. (2018).To make the comparison clearer, we use F1 score as evaluation metrics for questions whose answers are sets of entities.Accuracy is used to measure the performance for questions which produce boolean and numerical answers.Table 2 shows the results of different methods on the CSQA dataset.More detailed numbers are provided in the appendix C. HRED+KVmem (Saha et al., 2018) is a encoder-decoder model with keyvalue memory network (Miller et al., 2016) to directly produce answers.D2A (Guo et al., 2018) is a sequence-to-action model described in Section 5. Since the dataset does not provide annotated action sequence for each question, we follow (Guo et al., 2018) to use a breadth-first-search algorithm to obtain action sequences that lead to correct answers.However, some of action sequences are spurious (Guu et al., 2017), in the sense they do not represent the meaning of questions but get the correct answers.We use retrieved examples by our context-aware model to filter out spurious action sequences.We choose the most similar action sequence to retrieved action sequences, measured by editing distance.We denote the model learned in this way as S2A.
Table 2    sequences brings about 5% point improvement on boolean and Quantitative Reasoning (Count) questions.Results also show that applying the MAML framework performs better than both retrieveand-edit approaches, namely RAndE (Hashimoto et al., 2018) and EditVec (Wu et al., 2018), on the majority of question types, especially on complex questions.

Model Analysis
We study how the amount of training dataset and retrieved examples impacts the overall performance on the CONCODE.From Figure 4, we can see that S2A+MAML performs better than S2A in when >20% supervised datapoints are available to retrieve from.From Figure 5, we can see that the accuracy increases as the number of retrieved examples expands.This is consistent with our intu- ition that the performance of the semantic parser is improved by utilizing multiple retrieved examples, since the pattern of a logical form may come from different retrieved examples.We did not try larger number of retrieved examples due to the memory limit of our GPU device.However, excessive retrieved examples may introduce noise, which hurts the performance of the semantic parser.Therefore, we need to choose the appropriate amount of retrieved examples.

Case Study
We give a case study to illustrate the retrieved results by our context-aware retriever, with a comparison to the context-independent retriever.Results are given in Figure 6.We can see that our retriever can capture semantic content to retrieve.For examples, in the second row, the current question (i.g."who is the spouse of that one") has the  same semantic as that one of the context-aware retrieved example (i.g."which person is married to that one"), which demonstrates that our retriever learns the semantic of "spouse" and "married" in the retrieval process.Comparing with the contextindependent retriever, incorporating the context environment can improve the performance of the retrieval.Taking the first row as a example, although the NL of the input (i.g."Does the set contain a particular item") have similar semantic to that one of the context-independent retriever (i.g."Does the set contain the given key"), source codes differ greatly because the types of the sets are different (i.g.HashM ap and N ode respectively).Our context-aware retriever can find the example with similar source code by considering their context environment (both HashM ap and M ap have same constainsKey function).

Error Analysis
We analyze a randomly selected set of wrongly predicted 100 instances on the CONCODE dataset.We observe that 44% examples do not correctly copy class members, among which the majority of them lack information about class member (e.g. the effect of the class method get).This problem might be mitigated by encoding source codes of class methods or incorporating descriptions of class members.24% examples fail to invoke functions of the library and member class (e.g. a model is required to know there exit a size() function in List class to invoke list.size()).A potential direction to mitigate the problem is incorporate definitions of the external or system classes, which requires an updated version of the dataset.Among the other 32% examples, the major problem is that some of retrieved examples are incorrect.Incorporating more signal to measure the usefulness of retrieved examples might alleviate this problem.
Neural encoder-decoder models have proved effective in semantic parsing (Neelakantan et al., 2015;Dong and Lapata, 2016;Yin and Neubig, 2017;Herzig and Berant, 2018).One direction is to employ sequence-to-sequence model by modeling semantic parsing as a sentence to logical form translation problem (Dong and Lapata, 2016;Jia and Liang, 2016;Ling et al., 2016;Xiao et al., 2016).However, regarding logical form as a se-quence could not guarantee the grammatical correctness of the generated output.Sequence-to-Action approaches (Yin and Neubig, 2017;Krishnamurthy et al., 2017;Iyyer et al., 2017;Chen et al., 2018) treat semantic parsing as the prediction of a action sequence that can construct logical forms, which not only guarantee the grammatical correctness of outputs, but also leverage the strength of sequence-to-sequence model in learning sequential transformations.
Recently, there are recent attempts at exploiting retrieved examples to improve the generation of logical forms.Hashimoto et al. (2018) propose a retrieve-and-edit approach, including an encoderdecoder based retrieval model learnt in a taskdependent way without relying on a hand-crafted metric, and an editing model with a copying mechanism to replicate tokens from the retrieved example.Hayati et al. (2018) increase the probability of actions that can derive the retrieved subtrees.Huang et al. (2018a) also use MAML and treat each example as a new task.The relevance function for retrieving examples is based on the predicated type of the SQL query and the question length.Different from these three works, we focus on context-dependent semantic parsing, and our context-aware retriever is learned from the dataset without the help of a hand-craft relevant function.Different from (Hashimoto et al., 2018), our approach naturally make use of multiple similar examples to improve the semantic parser.
Retrieval-augmented models have also been studied in text generation (Gu et al., 2017;Huang et al., 2018b;Guu et al., 2018;Wu et al., 2018).Gu et al. (2017) use the retrieved sentence pairs as extra inputs to the NMT model.Wu et al. (2018) calculate an edit vector by considering lexical difference between a prototype context and current context, which is used as extra features.

Conclusion
In this paper, we present an approach which combines a context-aware retrieval model and modelagnostic meta-learning (MAML) to utilize multiple retrieved examples for context-dependent semantic parsing.We show that both context-aware retriever and MAML are useful on CONCODE and CSQA datasets.Our approach achieves the state-of-the-art performances and outperforms two retrieve-and-edit baselines.

Figure 1 :
Figure 1: Code generation based on the class environment and a natural language documentation (NL).(a) shows a example of code generation by applying the class function add(), while (b) iterates the vecElements array to increment each element.

4. 1
Figure 3 illustrates an overview of the retrieval model in the task of generating source code.Following Hashimoto et al. (2018), our retriever is a encoder-decoder model based on the variational autoencoder framework, which encodes a natural

Figure 3 :
Figure 3: An overview of our context-dependent retriever.

Figure 4 :
Figure 4: Comparison between S2A and S2A+MAML with different portions of supervised data.

Figure 6 :
Figure 6: Examples from the CONCODE dataset (first row) and the CSQA dataset (second row).The retrieved examples found by context-aware retriever (center panels) and context-independent retriever (right panels) follow the input (left panels).

Table 1 :
The second group report numbers of existing systems and our base Performance of different approaches on the CONCODE dataset.
model Seq2Action, all of which do not use retrieved examples.Models in the last group utilize retrieved examples.
shows that filtering out spurious action

Table 2 :
Performance of different approaches on the CSQA dataset.