Zero-Shot Semantic Parsing for Instructions

We consider a zero-shot semantic parsing task: parsing instructions into compositional logical forms, in domains that were not seen during training. We present a new dataset with 1,390 examples from 7 application domains (e.g. a calendar or a file manager), each example consisting of a triplet: (a) the application’s initial state, (b) an instruction, to be carried out in the context of that state, and (c) the state of the application after carrying out the instruction. We introduce a new training algorithm that aims to train a semantic parser on examples from a set of source domains, so that it can effectively parse instructions from an unknown target domain. We integrate our algorithm into the floating parser of Pasupat and Liang (2015), and further augment the parser with features and a logical form candidate filtering logic, to support zero-shot adaptation. Our experiments with various zero-shot adaptation setups demonstrate substantial performance gains over a non-adapted parser.


Introduction
The idea of interacting with machines via natural language instructions and queries has fascinated researchers for decades (Winograd, 1971). Recent years have seen an increasing number of applications that have a natural language interface, either in the form of chatbots or via "intelligent personal assistants" such as Alexa (Amazon), Google Assistant, Siri (Apple), and Cortana (Microsoft).
In the near future, we may find ourselves in a world where even more functionality could be accessed via a natural language user interface (NLUI). If so, we better seek answers to the following questions: Will every developing team need to hire NLP experts to develop a NLUI for 1 Our code and data are available at: https://github.com/givoli/TechnionNLI. their specific application? Can we hope for a general framework that once trained on annotated data from a set of domains, does not require annotated data from a newly presented domain? Previous work on tasks related to NLUI for applications mostly relied on in-domain data (e.g. ; Long et al. (2016)), and papers that did not rely on in-domain data did not attempt to parse instructions into compositional logical forms (Kim et al., 2016).
To fill this gap, we address the task of zeroshot semantic parsing for instructions: training a parser so that it can parse instructions into compositional logical forms, where the instructions are from domains that were not seen during training. Formally, our task assumes a set D = {d 1 , ..., d n } of source domains, each corresponding to a simple application (e.g. a calendar or a file manager) and an application program interface (API) consisting of a set of interface methods. Each interface method is augmented with a list of description phrases that are expected to be used by the users of the application to ask for the invocation of that method. These instructions are to be parsed into logical forms that denote a method call with specific arguments.
We collected a new dataset of 1,390 examples from 7 domains. Each example in the dataset is a triplet consisting of (a) the application's initial state, (b) an instruction, to be carried out in context of that state, and (c) the state of the application after carrying out the instruction, also referred to as the desired state. The instructions were provided by MTurk workers, one for each pair of initial and desired states. Figure 1 demonstrates examples from two of the domains in our dataset.
We present a new training algorithm for zeroshot semantic parsing, which involves learning the weights in two steps, such that in each step different source domains are used. Our training algo- Figure 1: Two examples from our dataset. Annotators are presented with a visualization of initial and desired states, and are asked to write an instruction that will transfer the system from the former state to the latter. rithm is motivated by the goal of optimizing the weights for unseen domains rather than for the source domains, and is integrated into the floating parser (Pasupat and Liang, 2015): a parser that was designed for question answering, but is easy to adjust to instruction execution (see § 4.1).
To further assist the parser in dealing with the zero-shot setup, we extract additional features, mostly based on co-occurrence of primitive logical forms and the description phrases that are provided for the interface methods. We also use the application logic to dismiss candidate logical forms that represent a method call which does not modify the application state or results in an exception being thrown from the application logic.
Our training algorithm yields an averaged accuracy of 44.5%, compared to 39.1% of the parser when trained with its original AdaGrad training algorithm (Duchi et al., 2011), but with our features and filtering logic. Further exclusion of our features and filtering logic decreases accuracy to 28.3%. We demonstrate that, relative to the baseline, our training algorithm yields smaller weights to some features in a way that can be expected to benefit previously unseen domains.

Previous Work
Previous work on executable semantic parsing can be classified as either work on question answering (e.g. Clarke et al. (2010); Pasupat and Liang (2015); Krishnamurthy et al. (2017)) or instruction parsing (e.g. ; Long et al. (2016)). The result of executing a log-ical form is either an answer or a change in some state, respectively. Our work is the first to address the novel semantic parsing task of mapping natural language instructions into compositional logical forms in zero-shot settings.
While in our task each example contains a single sentence instruction, there are works on semantic parsing for instruction sequences (MacMahon et al., 2006;Long et al., 2016), but not in a zero-shot setup. We keep zero-shot parsing of instruction sequences for future research.
A lot of work has been done on slot tagging and goal-oriented dialog (Kim et al., 2016;Gašić et al., 2017;Zhao and Eskenazi, 2018) which, similarly to our work, involves automatically enabling an NLUI to a given system. Unlike in our task, the tasks that are investigated in those papers do not require the mapping of natural language to a meaning representation over a space of compositional logical forms. Other tasks related to ours include program synthesis (Raza et al., 2015;Desai et al., 2016) and mapping natural language to bash code (Lin et al., 2018), but these also did not consider zero-shot setups and did not synthesize code in the context of an application state.

Semantic Parsing with In-domain Data
Among the semantic parsing work that relied on in-domain data, many relied on a domain-specific lexicon (Kwiatkowski et al., 2010;Gerber and Ngomo, 2011;Krishnamurthy and Mitchell, 2012;Zettlemoyer and Collins, 2005;Cai and Yates, 2013) which maps natural language phrases to primitive logical forms. Many of these works automatically constructed a domain-specific lexicon using some additional domain-specific resources that are associated with the entities and relations of the given domain. Such a resource can be either a very large corpus (Gerber and Ngomo, 2011;Krishnamurthy and Mitchell, 2012), search results from the Web (Cai and Yates, 2013) or pairs of a sentence and an associated logical form (Zettlemoyer and Collins, 2005;Kwiatkowski et al., 2010). In our task none of the above resources is assumed to be available but instead we use the description phrases of the interface methods.

Cross-domain and Zero-shot Semantic Parsing
Previous semantic parsers use supervised training, either with (Zettlemoyer and Collins, 2005;Kwiatkowski et al., 2010) or without (Clarke et al., 2010;Berant et al., 2013) logical forms annota-tion, in addition to unsupervised training (Goldwasser et al., 2011). We take the supervised approach with no logical form annotations.
While most semantic parsing work trains on indomain data, there are some exceptions. Cai and Yates (2013) and Kwiatkowski et al. (2013) introduced semantic parsers for question answering that can parse utterances from Free917 (Cai and Yates, 2013) such that no Freebase entities or relations appear in both training and test examples. We also note the relevance of the dataset presented in Pasupat and Liang (2015) which contains questions about Wikipedia tables, such that the context of each question is a single table. They evaluate a parser on questions about tables that have not been observed during training. Their work does not fully constitute zero-shot semantic parsing due to table columns across the train/test split that share column headers (which correspond to primitive logical forms that represent relations). Our parser is based on the floating parser introduced in that paper, and the space of logical forms we use is very similar to theirs (see § 4.1).
Recently, Herzig and Berant (2017) and Su and Yan (2017) experimented with the Overnight dataset (Wang et al., 2015) in cross-domain settings. These papers did not experiment with zeroshot setups (i.e. training without any data from the target domain), and they both observed that the less in-domain training data was used, the more training data from other domains was valuable. Recently Herzig and Berant (2018) explored zeroshot semantic parsing with the Overnight dataset. Their framework, unlike ours, requires logical form annotation, and is designed for question answering rather than instruction parsing.
Another zero-shot semantic parsing task was introduced in Yu et al. (2018a). The task requires mapping natural language questions to SQL queries, and includes a setting in which no databases appear in both the training and test sets (as attempted in Yu et al. (2018b)).

Task and Data
We now describe our task and dataset.

Task
Our task involves parsing a natural language instruction, in the context of a small application, into a method call that corresponds to the application's API. For example, the LIGHTING domain corre-sponds to a lighting control system application that allows the user to turn the lights on and off in each room in their house.
Formally, a domain has a set of interface methods (e.g. turnLightOn and turnLightOff) that can be invoked with some arguments. Each argument is a set of entities (e.g. a set of Room entities). There are two kinds of entity types: domain-specific (e.g. Room) and non domain-specific (Integer, String). Each interface method is augmented with 1-3 description phrases that correspond to its functionality. For example, the interface method removeEvents from the CALENDAR domain has the description phrases remove and cancel.
A state defines a knowledge base, consisting of a set of (e 1 , r, e 2 ) triplets, where e 1 and e 2 are entities and r is a relation. For example, a knowledge base in the LIGHTING domain might contain the triplet (room3, floor, 2) which indicates that room3 is on the second floor of the house. In figure 1 (b) we demonstrate two possible states in the LIGHTING domain. In the first one, there is a bedroom on the second floor with the lights turned on. If that room is represented in the state s by the entity room1, the following triplets will be in the knowledge base of s: (room1, name, bedroom), (room1, floor, 2) and (room1, lightMode, ON).
Our task is limited to mapping an utterance into a single method call. A method call formally consists of an [interface method, argument list] pair. The invocation of the method call changes the application state according to the deterministic application logic.
Our dataset consists of examples from 7 application domains. Each example is a triplet (s, x, s ), where s is an initial application state, x is a natural language instruction and s is a desired application state, resulting from carrying out the instruction x on the state s. The task is to train a parser with examples from a given subset of domains (the source domains), so that it can effectively parse instructions from a different domain (the target domain), which is unseen at training.

Data
Our task requires a dataset that consists of examples from multiple domains, such that each example corresponds to an instruction in the context of an application state. Following Long et al. (2016), we constructed the dataset by presenting human annotators with visualizations of initial and desired state pairs. The annotators were then asked to write an English instruction that can be executed in order to transfer the application from the initial state to the desired state (see figure 1).
Given a domain and an interface method, we randomly generate a state pair with the following steps: 1. Randomly generating an initial state. For example, in the LIGHTING domain (see figure  1 (b) of the main paper), we randomly select the number of floors, number of rooms in each floor, and for each room we randomly select a name (e.g. bedroom) from a list of possible names, and a light mode (either ON or OFF).
2. Randomly selecting arguments for the interface methods. For example, in the LIGHTING domain we randomly select a set of rooms as an argument for the interface method (turnLightOn or turnLightOff).
3. Invoking the interface method with the selected arguments on the initial state. If the result is a state that is identical to the initial state, or if an error occurred during execution, we go back to step 2. After 1,000 failed attempts we deem the random initial state as problematic and go back to step 1.
With this process, we collected 1,390 examples from 7 domains (Table 1). We used the MTurk platform and recruited annotators located in the US with at least 1,000 approved tasks and a task approval rate of at least 95%. Our dataset contains utterances written by 53 unique annotators. Throughout the dataset construction we blocked 16 annotators that generated utterances that did not correspond to our instructions (mostly, referring to irrelevant details of the provided figure which are not part of the domain represented by the visualized states). Annotators were paid 15-23 cents per task (i.e. per utterance they write given a state pair).
Each initial and desired state pair was given to a single annotator. We filtered out examples with utterances that consist of more than one sentence (we instructed annotators to write only one sentence). The average instruction length in the training set is 8.1 words.
We consider 7 domains: (a) CALENDAR: removing calendar events and setting their color; (b) CONTAINER: loading, unloading and removing shipping containers; (c) FILE: removing files and moving them from one directory to another; (d) LIGHTING: turning lights on and off in rooms inside a house; (e) LIST: removing elements and moving an element to the beginning/end of a list; (f) MESSENGER: creating/deleting chat groups and muting/unmuting them; and (g) WORKFORCE: assigning employees to a new manager, firing employees, assigning an employee to a new position and updating an employee's salary.
Our choice of domains aims to include a variety of linguistic phenomena. These include superlatives (e.g. remove the longest container in CONTAINER, figure 1 (a)), spatial language (e.g. turn off the light in the bedroom on floor 2 in LIGHTING, figure 1 (b)), and temporal language (e.g. delete my last two appointments on Thursday, from CALENDAR). Also, the domains are chosen to be rich enough to allow utterances with highly compositional logical forms.

Zero-Shot Semantic Parsing For Instructions
We modify the floating parser (henceforth denoted with FParser), to address zero-shot learning in three ways: (a) presenting a new training algorithm; (b) filtering logical form candidates based on the application logic; and (c) adding new features. We begin with a brief description of the FParser and then go on to describe our approach.

The Floating Parser
The FParser was designed to handle unseen predicates, in the context of answering questions about Wikipedia tables that did not appear during training. 2 It is hence a natural starting point for our zero-shot setup. We now describe the FParser and its inference algorithm (with necessary model modifications to support instruction parsing). For each inference, the input of the parser is an initial application state s, a set of interface methods and their description phrases, and a natural language instruction x. The state s defines a knowledge base K s of (entity, relation, entity) triples. The parser generates a set of logical form candidates Z x that can be executed over the knowledge base K s to produce a method call c  formulated as an (interface method, argument list) pair. The method call c can be invoked in the context of s with the provided application logic, producing the denotation y = z s , the resulting state. For each logical form z ∈ Z x the parser extracts a feature vector φ(x, s, z). The probability assigned to a logical form candidate z ∈ Z x is defined by a log-linear model: where θ is the weight vector. The logical form with maximal probability is chosen as the predicted logical form, and its denotation is the predicted desired state. Our logical form space is based on λ-DCS (Liang, 2013), as in the original FParser, but we use an additional derivation rule that derives the logical form f (z 1 , ..., z n ), denoting a method call, given the primitive logical form f (denoting an interface method) and the logical forms z 1 , ..., z n (each denoting a set of entities that correspond to an argument of f ). The objective function is the L1 regularized loglikelihood of the correct denotations across the training data: where p θ (y|x, s) is the sum of the probabilities assigned to all the candidate logical forms with the denotation y.

Zero-Shot Parsing
We now present our modified FParser. We start with our training algorithm, and proceed to the application logic filtering and our new features.

Training: Gap Minimization via Domain Partitioning (GMDP)
We start with some notations and definitions. Let us denote the set of training domains with D = d 1 , ..., d n . Let D = D 1 ∪ D 2 be some partition of the set D. The target domain is denoted with d n+1 .
Let us now describe the training algorithm, formulated in figure 2. The GMDP algorithm consists of two steps. In the first step, an initial estimate of the model parameters, denoted with θ D 1 , is learned on the training examples from the source domain subset D 1 . θ D 1 is then used as an initialization for the second step, in which the parser is re-trained, this time on the training examples of the domains in D 2 .
In each of the two steps we update the weights with AdaGrad (Duchi et al., 2011), the training algorithm of the original FParser, using the objective function in equation 1. Since this objective is nonconvex and hence sensitive to its starting point (parameter initialization), the weights learned at the first step (θ D 1 ) have an impact on the final parameters of the parser (θ D 1 ,D 2 ).
The motivation of GMDP training is simple. A good zero-shot parser should perform well on examples from domains that have not been available to it during training. To address this challenge, this two step method first estimates its parameters  with respect to one set of domains (D 1 ) and then adjusts those parameters to fit a second set of domains (D 2 ) that have not been available at the first training step. We refer to this adjustment process as gap minimization. The parser parameters learned by GMDP strongly depend on the domains included in D 1 and D 2 , and on the extent to which the adaptation from D 1 to D 2 mimics the adaptation from D = D 1 ∪ D 2 to the target domain d n+1 . In this paper we treat the division of domains to the D 1 and D 2 subsets as a hyper-parameter and tune it together with the other hyper-parameters of the parser. Because this tuning process has to do with the important division of the training domains to D 1 and D 2 , we detail it here as part of the description of the algorithm.
For every target domain d n+1 we iterate over the training domains D = d 1 , ..., d n in a leave one out manner, each time holding out one of the domains d i ∈ D, training on the other domains (D ¬i ) with various hyper-parameter configurations and testing on the training data of d i . We consider as a hyper-parameter the order of the domains in D, assigning the first M domains to D 1 and the rest to D 2 (d i is excluded from the ordered list), where M is another hyper-parameter. The hyperparameter configuration that works best in those n iterations (achieves the highest average accuracy on the held-out domains) is the one selected for the training of the final parser for d n+1 . When a parser for d n+1 is then trained, we increase by one the size of either D 1 or D 2 , whichever is larger, because this way the ratio between the size of D 1 and D 2 is kept as similar as possible to the ratio during the hyper-parameter tuning.

Logical Form Filtering
The Fparser is a bottom-up beam-search parser, in which a dynamic programming table is filled with derivations. Each cell in the table corresponds to a derivation size and a logical form category, where the size is defined as the number of rules applied when generating the logical form.
We add an additional stage to the inference step of the parser. In this stage we filter logical form candidates based on the application logic, which is part of the domain definition. This filtering stage dismisses incorrect candidate logical forms when they represent a method call c that either does not modify the application state or results in an exception being thrown. To do that, c is invoked on the initial state s and if the result is a state identical to s, or if an exception has been thrown by the application logic, we dismiss the candidate logical form. This added stage is especially important for zero-shot settings, in which the application logic of the target domain does not have any impact on the learned weights.
As an example to application logic based filtering, consider the LIGHTING domain in which the lights in some rooms can be turned on and off. A method call that turns off the lights in rooms where the lights are already off does not change the application state, and in such cases the corresponding logical form will be dismissed. In the WORK-FORCE domain, attempting to assign employees to report to an employee who is not a manager results in an exception being thrown.

Features
Given a state, an instruction and a logical form, we extract the relevant features of the FParser (phrasepredicate co-occurrence features 3 and missingpredicate features) and add our own features. We extract features based on the description phrases: the phrases that are provided for each interface method as part of the domain definition. The description phrases are used to extract additional phrase-predicate co-occurrence features and missing-predicate features. For example, consider the utterance Delete the largest file from the FILE domain, with the logical form:

R[sizeInBytes]))
The interface method removeFiles has the description phrase delete, which matches the phrase Delete in the utterance, resulting in the extraction of the corresponding co-occurrence features. Conversely, when the parser considers logical forms that contain the method moveFiles instead of removeFiles, it will extract features indicating that a match between the unigram Delete and a primitive logical form is possible but does not occur in the candidate logical form. We note the resembles of this technique to the way Tafjord et al. (2018) handled properties that were unseen during training, using a list of provided words that are associated with the property.
We also extract features that correspond to the size of a candidate logical form (the number of derivation rules applied). The extracted features indicate that the logical form size is larger than n, for any n ≥ 2. These features captures a domainindependent preference for simplicity. Notice that in some real-world settings our application logic filtering, which requires invoking interface methods hundreds of times per inference, might be impractical (e.g. if executing the interface methods is computationally intensive). This motivates us to consider the results of the ablated model that does not use the application logic.

Experimental Setup
Experiments In each experiment we train the parsers on examples from 6 application domains and test them on the remaining domain. Our evaluation metric is accuracy: the fraction of the test examples where a correct denotation (desired state) is predicted. For examples where multiple logical forms achieve maximum score, we consider the fraction that yields the desired state.
While in our main results we report the accuracy of the parsers on the target domain's test set, for the error and qualitative analyses we report the accuracy on the target domain's training set. We do that in order to avoid multiple runs on the test sets; we do not use the target domain's training set for other purposes (e.g. hyper-parameter tuning). The average number of training and test examples per domain is 101 and 97.6, respectively.
Hyper-parameter tuning We use a grid search and leave-one-out cross-validation over the source domains to tune the hyper-parameters. We tune the following hyper-parameters: the L1 regularization coefficient, the initial step-size, the number of training iterations (for the GMDP algorithm: the number of training iterations in the second step), and for the GMDP algorithm also: the number and identity of training domains used in the first (D 1 ) and second (D 2 ) steps, and the number of training iterations during the first step. We use a beam size of 200 and limit the number of rule applications per derivation to 15. We provide more details about the values of the hyper-parameters we consider in the appendix.

Results
Our results are summarized in Table 2. GMDP outperforms ADAGRAD−FA, the original FParser, in all the domains, and by 16.2% on average accuracy.
We next analyze the importance of each of our zero-shot components: the training algorithm, new features and application logic filtering.
Training Algorithms GMDP (our full model) yields an averaged accuracy of 44.5%, outperforming ADAGRAD, which is identical to our full model except that training is performed with Ada-Grad, by 5.4%. In four domains GMDP outperforms ADAGRAD by more than 4%. The gap is most notable in the LIST and LIGHTING domains where GMDP outperforms ADAGRAD by 14.3% and 12.4%, respectively, but the improvements on MESSENGER and WORKFORCE are also substantial (8.4% and 4.1%, respectively). In the other three domains, GMDP and ADAGRAD perform identically (CONTAINER) or demonstrate differences of up to 2% (CALENDAR and FILE). Interestingly, for the CALENDAR domain, performing GMDP training without the features and filtering (GMDP−FA) yields the best accuracy.  As shown in Table 2, the accuracy of ADA-GRAD with in-domain training is 15.2% higher than that of GMDP with zero-shot training (59.7% vs. 44.5%), despite the smaller number of training examples. A comparison between ADAGRAD and ADAGRAD−FA reveals that in the in-domain setup, our new features and filtering logic yields only a modest performance gain that corresponds to 3.2% on average (59.7% vs. 56.5%). This is another induction for the relevance of our zero-shot components to zero-shot adaptation. The word longest did not appear in any example in the source domains, and thus none of the relevant lexicalized features associated with argmax were useful. In future work, we hence plan to extend our parser to take word similarity into account.

Features and Application Logic
Moreover, we found that 12.8% of the error is due to incorrect parsing of utterances that reference an entity by its index. An example of such an error is the mapping of the utterance unload the container in terminal four to the logical form unloadContainers(R [length].4) instead of unloadContainers(R[index].4).
Qualitative analysis. We observe that GMDP yields smaller weights for features that can be expected to correlate with incorrect logical forms due to the zero-shot setup. For example, in the CONTAINER domain annotators often referred to entities by their index (e.g. Remove the container in the third terminal), while in the LIST domain annotators mostly refer to entities (integers) by their numeric value. When LIST is the target domain, we observe that in GMDP the lexicalized phrase-predicate features that indicate cooccurrence between the logical form R[index] and phrases that do not indicate an index based reference, receive smaller weights when compared to ADAGRAD. For example, we find that the feature that indicates a co-occurrence between R[index] and the phrase in corresponds to the largest decrease in weight percentile rank: 88.7 points. At the same time, the feature that indicates a cooccurrence with the phrase the first (which should correlate with R[index] being in the logical form) corresponds to the largest increase in weight percentile rank: 86.1 points.
As a result of this change in feature weighting, for LIST utterance such as: Remove the number 2 from the list, GMDP tends to yield correct logical forms (e.g. remove(R[value.2])), unlike ADAGRAD that tends to query entities by their index instead of by their value (e.g. remove(R[index.2])).

Conclusion
We presented a novel task of zero shot semantic parsing for instructions, and introduced a new dataset. We proposed a new training algorithm as well as features and filtering logic that should enhance zero-shot learning, and integrated them into the FParser (Pasupat and Liang, 2015). Our new parser substantially outperforms the original parser and we further show that each of our zeroshot components is vital for this improvement.
We hope this work will inspire readers to use our framework for collecting a larger dataset and experimenting with more approaches. Our framework is designed to allow the definition of new domains and collecting examples with minimal effort. Promising future directions include experimenting with our zero-shot adaptation methods in the context of neural semantic parsing (after increasing the number of examples per domain) and extending the dataset to include more complicated applications and multi-utterance instructions.