Programming in Natural Language with fuSE: Synthesizing Methods from Spoken Utterances Using Deep Natural Language Understanding

The key to effortless end-user programming is natural language. We examine how to teach intelligent systems new functions, expressed in natural language. As a first step, we collected 3168 samples of teaching efforts in plain English. Then we built fuSE, a novel system that translates English function descriptions into code. Our approach is three-tiered and each task is evaluated separately. We first classify whether an intent to teach new functionality is present in the utterance (accuracy: 97.7% using BERT). Then we analyze the linguistic structure and construct a semantic model (accuracy: 97.6% using a BiLSTM). Finally, we synthesize the signature of the method, map the intermediate steps (instructions in the method body) to API calls and inject control structures (F1: 67.0% with information retrieval and knowledge-based methods). In an end-to-end evaluation on an unseen dataset fuSE synthesized 84.6% of the method signatures and 79.2% of the API calls correctly.


Introduction
Intelligent systems became rather smart lately. One easily arranges appointments by talking to a virtual assistant or controls a smart home through a conversational interface. Instructing a humanoid robot in this way no longer seems to be futuristic. For the time being, users can only access built-in functionality. However, they will soon expect to add new functionality themselves. For humans, the most natural way to communicate is by natural language. Thus, future intelligent systems must be programmable in everyday language.
Today's systems that claim to offer programming in natural language enable laypersons to issue single commands or construct short scripts (e.g. Mihalcea et al. (2006); Rabinovich et al. (2017)); usually no new functionality is learned. Only a few addressed learning new functionality from natural language instructions (e.g. Le et al. (2013); Markievicz et al. (2017)). However, even recent approaches still either restrict the language or are (over-)fitted to a certain domain or application.
We propose to apply deep natural language understanding to the task of synthesizing methods from spoken utterances. Our approach combines modern machine learning techniques with information retrieval and knowledge-based methods to grasp the user's intent. As a first step, we have performed a user study to investigate how laypersons teach new functionality with nothing but natural language. In a second step, we develop fu SE (Function Synthesis Executor). fu SE translates teaching efforts into code. On the basis of the gathered data we constructed a three-tiered approach. We first determine, whether an utterance comprises an explicitly stated intent to teach a new skill. Then, we decompose these teaching efforts into distinct semantic parts. We synthesize methods by transferring these semantic parts into a model that represents the structure of method definitions. Finally, we construct signatures, map instructions of the body to API calls, and inject control structures.

Related Work
The objective of programming in natural language was approached from different perspectives over the years. Quite a few approaches are natural language interfaces to code editors (Price et al., 2000;Begel, 2004;Begel and Graham, 2005;Désilets et al., 2006). However, they assume that users literally dictate source code. Thus, these approaches are intended for developers rather than laypersons. Other approaches such as Voxelurn by Wang et al. (2017) aim to naturalize programming languages to lower the hurdle for programming novices.
Approaches for end-user programming in natu-  Figure 1: Schematic overview of fu SE 's three-tiered approach.
ral language take up the challenge of bridging the semantic gap between informal spoken or written descriptions in everyday language and formal programming languages. Early systems were syntaxbased (Winograd, 1972;Ballard and Biermann, 1979;Biermann and Ballard, 1980;Biermann et al., 1983;Liu and Lieberman, 2005). Some were already capable to synthesize short scripts including control structures and comments, e.g. NLP for NLP by Mihalcea et al. (2006). Others take the user in the loop and create scripts with a dialog-driven approach (Le et al., 2013). In further developments intelligent assistants offer their service to assist with programming (Azaria et al., 2016). Often these assistants support multi-modal input, e.g. voice and gestures (Campagna et al., 2017(Campagna et al., , 2019. Others combine programming in natural language with other forms of end-user programming, such as programming by example (Manshadi et al., 2013) or programming by demonstration (Li et al., 2018). Some authors such as Landhäußer et al. (2017) and Atzeni and Atzori (2018a,b) take a knowledgebased approach by integrating domain and environmental information in the form of ontologies.
Suhr and Artzi (2018) employ a neural network to learn a situational context model that integrates the system environment and the human-systeminteraction, i.e. the dialog. Many recent approaches integrate semantic parsing in the transformation process (Guu et al., 2017;Rabinovich et al., 2017;Chen et al., 2018;Dong and Lapata, 2018). Even though the natural language understanding capabilities are often impressive, the synthesized scripts are still (semantically) erroneous in most cases. Additionally, learning of new functionality is not covered by approaches of that category so far.
Programming in natural language is of particular interest in the domain of humanoid robotics (Lauria et al., 2001(Lauria et al., , 2002She et al., 2014;Mei et al., 2016). People expect to teach them as they teach human co-workers. Therefore, some authors, e.g. Markievicz et al. (2017), use task descriptions that were intended to instruct humans to benchmark their approach. However, often the assumed vocabulary is rather technical (Lincoln and Veres, 2012). Thus, the usability for laypersons is limited.

Approach
The goal of our work is to provide a system for programming in (spoken) natural language. Laypersons shall be enabled to create new functionality in terms of method definitions by using natural language only. We offer a general approach, i.e. we do not restrict the natural language regarding wording and length. Since spontaneous language often comprises grammatical flaws, disfluencies, and alike, our work must be resilient to these issues.
We decompose the task in three consecutive steps. The rationale behind this decision is as follows. On the one hand, we can implement more focused (and precise) approaches for each task, e.g. using machine learning for one and information retrieval for another. On the other hand, we are able to evaluate and optimize each approach individually. The stages of our three-tiered approach are the following (see Figure 1 for an example): 1. Classification of teaching efforts: Determine whether an utterance comprises an explicitly stated teaching intent or not.
2. Classification of the semantic structure: Analyze (and label) the semantic parts of a teaching sequence. Teaching sequences are composed of a declarative and a specifying part as well as superfluous information.
3. Method synthesis: Build a model that represents the structure of methods from syntactic information and classification results. Then, map the actions of the specifying part to API calls and inject control structures to form the body; synthesize the method signature.
The first two stages are classification problems. Thus, we apply various machine learning techniques. The first stage is a sequence-to-single-label task, while the second is a typical sequence-tosequence task. For the first we compare classical machine learning techniques, such as logistic regression and support vector machines, with neural network approaches including the pre-trained language model BERT (Devlin et al., 2019). For the second task we narrow down to neural networks and BERT. A more detailed description of the first two stages may be found in (Weigelt et al., 2020). The implementation of the third stage is a combination of syntactic analysis, knowledge-based techniques and information retrieval. We use semantic role labeling, coreference analysis, and a context model (Weigelt et al., 2017) to infer the semantic model. Afterwards, we synthesize method signatures heuristically and map instructions from the body to API calls using ontology search methods and datatype analysis. Additionally, we inject control structures, which we infer from keywords and syntactic structures. To cope with spontaneous (spoken) language, our approach relies on shallow NLP techniques only.

Dataset
We carried out a study to examine how laypersons teach new functionality to intelligent systems. The study consists of four scenarios in which a humanoid robot should be taught a new skill: greeting someone, preparing coffee, serving drinks, and setting a table for two. All scenarios take place in a kitchen setting but involve different objects and actions. Subjects were supposed to teach the robot using nothing but natural language descriptions. We told the subjects that a description ideally comprises a declaration of intent to teach a new skill, a name for the skill, and an explanation of intermediate steps. However, we do not force the subjects into predefined wording or sentence structure. Instead, we encouraged them to vary the wording and to 'speak' freely. We also instructed them to imagine that they were standing next to the robot. After the short introduction, we successively presented the scenarios to the subjects. Finally, we requested some personal information in a short questionnaire. We used the online micro-tasking platform Prolific 1,2 . In less than three days, 870 participants  completed the study. The share of male and female participants is almost equal (50.5% vs. 49.5%); more than 60% are native English speakers. Most of them (70%) had no programming experience at all. An analysis of the dataset revealed that there is barely any difference in the language used by subjects, who are inexperienced in programming, compared to more experienced subjects (except for a few subjects that used a rather technical language). The age of the participants ranges from 18 to 76 with more than half being 30 or younger. The collected data comprises 3,168 descriptions with more than 109,000 words altogether (1,469 unique words); the dataset statistics are depicted in Table 1. We provide a set of six descriptions from the dataset in Table 13 (Appendix A). A thorough analysis of the dataset revealed that a notable share (37%) lacks an explicitly stated intent to teach a skill, albeit we even consider phrases such as "to prepare lunch" as teaching intent. Regarding the semantic structure, we observed that the distinct parts can be clearly separated in almost all cases. However, the respective parts occurred in varying order and are frequently non-continuous.
The data was jointly labeled by two of the authors. We first attached the binary labels teaching and non-teaching. These labels correspond to the classification task from the first stage. Then we add ternary labels (declaration, specification, and miscellaneous) to all words in descriptions that were classified as teaching effort in the first step. This label set is used for the second stage. The distribution of the labels is depicted in Table 2.
Both label sets are unequally distributed, which may cause the machine learning models to overfit in favor of the dominating label. This mainly affects the ternary classification task; the speech recordings would be more natural. However, from previous studies we learned that subjects more willingly write texts than speak. Besides, the audio quality of recordings is often poor, when subjects use ordinary microphones.

First Stage: Teaching Intents
The first step of fu SE is discovering teaching intents in utterances. An utterance can either be an effort to teach new functionality or merely a description of a sequence of actions. This problem is a typical sequence-to-single-label task, where the words of the utterance are the sequential input and the output is either teaching or non-teaching.
To train, validate, and test classifiers we split up the dataset in two ways. The first is the common approach to randomly split the set in an 80-to-20 ratio, where 80% of the data is used for training and 20% for testing. As usual, we again split the training set in 80 parts for training and 20 for validation. However, we felt that this approach does not reflect realistic set-ups, where a model is learned from historical data and then applied to new unseen data, that is semantically related but (potentially) different. Therefore, we introduced an additional so-called scenario-based split in which we separate the data according to the scenarios. We use three of the four scenarios for training and the remaining for testing. Note that we again use an 80-20 split to divide training and validation sets.
We applied classical machine learning and neural network approaches to the task. The classical techniques are: decision trees, random forests, support vector machines, logistic regression, and Naïve Bayes. As baseline for the classification accuracy we use the so-called Zero-Rule classifier (ZeroR); it always predicts the majority class of the training set, i.e. teaching in this case.
We transform the words to bag-of-words vectors and use tri-and quadrigrams as additional features. The measured accuracy of each classifier on the random and scenario-based data is depicted in Table 3; the validation set accuracy is given in parenthesis and the test set accuracy without.
On the random set all classifiers exceed the baseline. Thus, the (slightly) imbalanced dataset does not seem to affect the classifiers much. Logistic regression performs surprisingly well. However, on the scenario-based split the accuracy of all classifiers decreases drastically. While the accuracies on the validation set remain stable, these classifier techniques are unable to generalize to unseen input. The logistic regression remains the best classifier. However, its accuracy decreases to 71.9%.
These results reinforced our intuition that deep learning is more appropriate for this task. We implemented a broad range of neural network architectures: artificial neural networks, convolutional networks, and recurrent networks, including LSTMs and GRUs and their bidirectional variants. We experimented with additional layers, which we systematically added to the networks, such as dropout (DO), dense (D), or global max pooling (GMax). We altered all hyper-parameters in reasonable ranges of values 3 . We present only the best performing configurations, i.e. architecture and hyperparameter combinations, in Table 4. Detailed information on the tested hyper-parameter values and further results may be found in Appendices B and C. The words from the input are represented as fastText word embeddings (Bojanowski et al., 2017;Joulin et al., 2017); we use the 300-dimensional embeddings that were trained on the  2018). Moreover, we use Google's pre-trained language model BERT (base-uncased), which we equipped with a flat binary output layer.
The results attest that deep learning approaches clearly outperform the best classical technique (logistic regression). In particular, the accuracies show smaller differences between random and scenario-based split. This suggests that the classification is more robust. The best accuracy on the scenario test set is achieved by a bidirectional GRU: 93.2%. Using BERT, the accuracy increases by more than 4% with a peak at 97.7% using 300 training epochs. However, the ten-epochs version is a feasible choice, since the accuracy loss is negligible and the training savings are immense.

Second Stage: Semantic Structures
The second stage, detecting the semantic parts in teaching efforts, is a typical sequence-to-sequencelabeling task with the labels declaration, specification, and miscellaneous. Even though these semantic structures correspond to phrases from a grammatical point of view, we decided to use perword labels. For this task we only use neural network approaches and BERT. The remaining set-up is similar to the first stage. We again use fastText embeddings and vary the network architectures and hyper-parameters. Except for a ternary output layer, we use the same configuration for BERT as in the first stage.
The results for both, the random and scenariobased split, are reported in Table 5   the clear choice for this task; accuracy values are consistently high. Most encouragingly, the decline on the scenario data is negligible (less than 1%). Apparently, the models generalize well and are thus resilient to a change in vocabulary. For the second stage the use of BERT is of no advantage; the results even fall behind the best RNN configurations.

Third Stage: Method Synthesis
During stage three we first transfer the natural language utterances into a model that represents both method definitions and scripts. Afterwards, we synthesize methods (or scripts) from this model. We create a method signature and map instructions in the body to API calls; to synthesize scripts we only map the instructions and inject control structures. Before we can transfer natural language utterances to the semantic model we must perform a few NLP pre-processing steps that enrich the input with syntactic and semantic information. To obtain parts of speech (PoS), we apply a joint tagging approach; we consolidate the PoS tags produced by the Stanford Log-linear Part-Of-Speech Tagger (Toutanova et al., 2003) and SENNA (Collobert et al., 2011). The Stanford Tagger also provides us with word lemmas. Then we detect individual events in terms of clauses. Since our approach is supposed to cope with spoken language, we are unable to make use of punctuation. Instead, we split the input in a continuous sequence of instructions based on heuristics that make use of PoS tags and keywords. However, the instructions do not necessarily span complete clauses. Thus, we can not apply common parsers. Instead, we use the shallow parser BIOS 6 that provides us with chunks. To obtain semantic roles for each instruction, we again    (Fellbaum, 1998), we can also make use of synonyms.
We use ontologies to model the target systems, i.e. APIs. An ontology represents the classes, methods, parameters, data types, and values (resp. value ranges), of an API (similar to the ontologies used by Landhäußer et al. (2017) and Atzeni and Atzori (2018a,b)). The basic ontology structure is depicted in Table 6. If the system is supposed to interact with an environment, we employ additional ontologies that model the environment including objects and their states (see Table 7). Environment ontologies are merged into system ontologies by copying concepts to the respective placeholders.
To bridge the semantic gap between natural and programming language we introduce a semantic model, as depicted in Figure 2. The model resembles the basic structure of method definitions. However, the leaves are composed of natural language phrases. To determine the phrases that will make up the model elements, we first smooth the classification results provided by the second stage. fu SE maps all phrases of an instruction to the same second-level model element, i.e. either method signature or an instruction of the body. Therefore, we 7 SENNA uses the semantic role label set defined in the CoNLL-2004 resp. CoNLL-2005 shared tasks Màrquez, 2004, 2005  unify the second stage classification labels for each instruction using majority decision. Afterwards, we map phrases to leaf elements. Roughly speaking, we use the roles provided by semantic role labeling (SRL) and map predicates to names and arguments to parameters. If we detect a coreference, we substitute the referring expression with the referent, e.g. it with the cup. We also add a lemmatized variant of the phrase and all synonyms. Note that the parameters are a list of phrases. The first step to create method definitions is signature synthesis. To construct a meaningful name, we heuristically clean up the phrase, e.g. remove auxiliary verbs and stop words, and concatenate the remaining words. The parameters are either mapped to data types to infer formal parameters or -if no mapping is to be found -they are attached to the name. For instance, assuming that the declarative instruction is serving wine means, fu SE extracts serve as the first part of the name. Then it tries to map wine to an ontology individual (as discussed later). Assuming it finds the individual RedWineBottle and it is an instance of the concept Graspable in the environment ontology. If the system ontology supports the data type Graspable, fu SE synthesizes the signature serve(serve.what : Graspable). Otherwise, the method signature serveWine() is created.
The instructions in the method body are mapped to API calls. Therefore, we first query the ontologies for each leaf element individually. For the queries we use three sets of words we create from the original phrase, the lemmatized version, and the synonyms. We then build the power sets and all permutations of each set, before we concatenate the words to construct a query set. For instance, for the phrase is closed, we produce the query strings: isclosed, closedis, beclose, closebe, closed, is, . . . The ontology search returns all individuals with a Jaro-Winkler score (Winkler, 1990)    a fuzzy score 8 above .15. We decided for these comparatively low thresholds, since we see them as lightweight filters that let pass numerous generally valid candidates. Since an individual may be returned more than once with different scores, we set the score of the individual to the maximum of each of its scores. Afterwards, we construct API calls from the model structure and rate each candidate. We start with the method name candidates. For each candidate we query the ontology for formal parameters. Then, we try to satisfy the parameters with the candidates returned by the individual ontology search. Note that we perform type checking for the parameters (including inheritance if applicable). For instance, for the instruction take the cup we may have found the individual grasp as candidate for a method name and the parameter candidates Mug (type Graspable) and Cupboard (type Location). The ontology indicates that the method grasp has one parameter of type Graspable. Then, the type check ensures that fu SE creates the call candidate grasp(Mug) but not grasp(Cupboard). The score is composed of the individual scores of the method names and parameters, the share of mapped words of query string to all words in the query, the ratio of mapped parameters to (expected) formal parameters, and the number of additional (superfluous) parameters. In Appendix D we give a more formal introduction to our scoring approach.
The result of the scoring process is a ranked list of candidates for each instruction. For the time being, we simply use the top-ranked candidates to synthesize the method body. However, re-ranking the candidates based on other semantic resources is promising future work. In a last step, we inject control structures, i.e. conditional branching, various types of loops, and concurrency (Weigelt et al., 2018b,c). The approach is rule-based. We use key phrases, such as in case, until, and at the same time. Proceeding from these anchor points we look for structures that fit into the respective control structure. Here, we apply heuristics on the syntax (based on PoS tags and chunks) and coreference. Utterances that were labeled as non-teaching in the first stage also run through the third stage, except for signature synthesis. Thus, we only construct scripts for this type of utterances.
We determine the quality of the approach for the third stage based on utterances from scenarios one, two, and three, since we used scenario four during development. The assessment is partly manual. Hence, we randomly drew 25 utterances from each scenario to reduce the effort. For each description we used the manual labels of first-stage and secondstage classifications and prepared a gold standard for API calls in the method body. Table 9 depicts the dataset. We did not prepare solutions for the signatures, since plenty of valid solutions are imaginable. Thus, we decided to review the signatures manually afterwards. Of the 52 synthesized method names we assessed eight inappropriate. A name is inappropriate if either the name is off-topic or it contains unrelated terms, e.g. askSpeaker or prepareCoffeeFriend for the scenario How to prepare coffee. Moreover, fu SE correctly mapped 23 parameters without any false positive.
The API ontology used in our setting (household robot) comprises 92 methods, 59 parameters, and 20 data types. To represent the environment (a kitchen) of the robot, we used another ontology   with 70 objects of six types, and six states. Table 8 details the results for the method body synthesis. Besides precision, recall, and F 1 , it shows the average rank at which the correct element is to be found. Since the semantic role labeling introduces a vast amount of errors on spoken utterances and our approach heavily depends on it, we also determine recall and F 1 excluding SRL errors. The results are encouraging. We achieve an F 1 value of 76.7% for the individuals and 62.0% for entire calls; in both cases the precision is slightly ahead of the recall. If we excluded SRL errors, the overall performance increases (about 7% for individuals and 5% for calls). Besides the SRL, missing and inappropriate synonyms are a major source of errors. If Word-Net lacks a synonym for an important word in the utterance, fu SE 's API mapping may be unable to determine the correct ontology individual. Contrary, if WordNet provides an inappropriate synonym, fu SE may produce an incorrect (superfluous) mapping. In other cases, our language model is unable to capture the semantics of the utterance properly. For example, fu SE creates two method calls for the phrase "make sure you close it" : close(. . . ) and make(. . . ). It may also produce superfluous mappings for explanatory phrases, such as "the machine fills cups", if the second stage did not classify it as miscellaneous. Regarding the composition of API calls (methods plus arguments), the majority of errors is introduced by the arguments. In addition to the afore-mentioned error sources, arguments are often ambiguous. For instance, the phrase "open the door" leaves it up to interpretation, which door was intended to be opened. Even though fu SE makes use of an elaborated context model, some ambiguities are impossible to resolve (see section 5). A related issue is the incorrect resolution of coreferences; each mistake leads to a misplaced argument. Most of these error sources can be eliminated, if the pre-processing improves. However, many difficulties simply arise from erroneous or ambiguous descriptions. Still, fu SE interprets most of them correctly. Most encouragingly, the average rank of the correct element is near 1. Thus, our scoring mechanism succeeds in placing the right elements on top of the list.

Evaluation
To measure the performance of fu SE on unseen data, we set up an end-to-end evaluation. We created two new scenarios. They take place in the kitchen setting again, but involve different actions and objects. In the first, subjects are supposed to teach the robot, how to start the dishwasher and in the second, how to prepare cereals. Once more we used Prolific to collect the data and set the number of participants to 110. However, we accepted only 101 submissions, i.e. 202 descriptions. We randomly drew 50 descriptions each. Since the evaluation of the overall approach entails the same output as the third stage, we prepared the gold standard like in subsection 3.4 and used the same ontologies. Table 11 details the dataset used in the end-to-end evaluation. Additionally, we provide five exemplary descriptions from the dataset in Table 14 (Appendix A).
In the end-to-end evaluation our approach synthesized 73 method signatures; five were missed due to an incorrect first-stage classification. Out of 73 synthesized methods we assessed seven to be inappropriate. Additionally, 36 parameters were mapped correctly and no false positives were created. Except for the missing method signatures the results are in line with the third-stage evaluation.
The results for the method body synthesis, as depicted in  However, the effect is smaller here. Moreover, the average rank is also closer to the optimum (1.0) in both cases. Since the first two stages of fu SE are based on neural networks, it is difficult to say why the results in the end-to-end evaluation improve. However, we believe the main cause is the introduction of a new test dataset, which has two consequences. First, the models used in the first two stages are learned on all four scenarios instead of three, i.e. the models are trained on a larger dataset, which (presumably) makes them more robust. Second, the new task may be simpler to describe. Consequently, the descriptions comprise simpler wordings and become easier to handle. In summary, the results show that fu SE generalizes to different settings -at least in the same domainand is marginally degraded by error propagation.
To assess how well fu SE generalizes to truly spoken utterances we evaluated on another dataset. It is a collection of recordings from multiple recent projects. The setting (instructing a humanoid robot in a kitchen setting) is the same. However, none of the scenarios involved teaching new functionality. Thus, we can only measure fu SE 's ability to construct scripts. The descriptions in this dataset comprise control structures to a much larger extent. Altogether the dataset comprises 234 recordings and manual transcriptions. The 108 subjects were mostly under-graduate and graduate students.
On the transcripts we assess the mapping of methods and parameters individually. The results for both and entire calls are depicted in Table 12. Even though the spoken samples comprise a vast number of disfluencies and grammatical flaws, fu SE maps more calls correctly. This counter-intuitive effect may be explained by the lower complexity and briefness of the spoken descriptions. Regarding the control structures, 27.4% were injected correctly. Note that correctly means an appropriate condition plus a block with correct extent. If we lower the standards for condition correctness, the share of correct structures is 71.23%.

Conclusion
We have presented fu SE , a system for programming in natural language. More precisely, we aim to enable laypersons to teach an intelligent system new functionality with nothing but spoken instructions. Our approach is three-tiered. First, we classify whether a natural language description entails an explicitly stated intent to teach new functionality. If an intent is spotted, we use a second classifier to separate the input into semantically disjoint parts; we identify declarative and specifying parts and filter out superfluous information. Finally, we synthesize method signatures from the declarative and method bodies from the specifying parts. Method bodies contain instructions and control structures. Instructions are mapped to API calls. We implemented the first two steps using classical machine learning and neural networks. Teaching intents are identified with an accuracy of 97.7% (using BERT). The classification of the semantics is correct in 97.6% of the cases (using a BiLSTM).
We evaluated fu SE on 100 descriptions obtained from a user study. The results are promising; fu SE correctly synthesized 84.6% of the method signatures. The mapping of instructions in the body to API calls achieved an F 1 -score of 66.9%. In a second evaluation on a speech corpus the F 1 -score for API calls is 79.2%.
We plan to evaluate fu SE in other domains. It will be interesting to see, if we can reuse (or transfer) the machine learning models as well as the rest of the approach. Future adoptions to fu SE will include the integration of a dialog component. We may query the user in case of ambiguous statements or missing parameters. We have implemented an extensible dialog module and shown that it can be used to resolve ambiguous references, word recognition errors, and missing conditions (Weigelt et al., 2018a). However, we still have to figure out, how to query users properly if an API mapping is ambiguous or parameters are missing. Another improvement concerns the analysis of verb references. Humans often refer to previous actions, which may cause superfluous instructions. We will also implement a sanity check that considers feasibility and meaningfulness of the sequence of actions in the method body. The latter may involve a feedback mechanism via the dialog component. Giving feedback to newly learned method definitions that may be lengthy and therefore unhandy to repeat as a whole is an interesting challenge.

A Dataset Examples
The dataset includes descriptions of varying quality. Some texts have syntactical flaws such as typos and grammar mistakes. They also vary in terms of descriptiveness and style; the latter ranges from full sentences to notes. Table 13 shows six examples from the preliminary study (scenarios one to four) and Table 14 five examples from the end-to-end evaluation (scenarios five and six). Most of the descriptions contain errors. For instance, description 2180 contains typos, such as "ring some beverage".

B Architectures and Hyper-parameters
We applied a broad range of machine learning approaches to the classification tasks. Table 15 shows the types, architectures and hyper-parameters we tested in the process. We also experimented with self-trained and pre-trained fastText embeddings. Table 16 shows representative configurations for the first stage of fu SE (binary classification); for neural networks we altered the hyper-parameters systematically to give an intuition of the effects. There are general trends. Classifiers perform better on randomly split data, a batch size of 100 is better than 300, and pre-trained embeddings outperform the self-trained in almost all cases. Overall, BERTbased classifiers achieve the best results. However, some neural network configurations come close (e.g. RNN 6.0 ); classical machine learning techniques are inadequate. For the second stage (ternary classification) we show interesting results in Table 17. The trends are as follows. The preferable batch size is 32, pre-trained embeddings again outperform the self-trained, and RNNs are best.

D Call Candidate Scoring
In subsection 3.4 we only discuss the rationale behind our call candidate scoring mechanism. Subsequently, we give a formal introduction. A call candidate is an API method with arguments (extracted from the natural language input). The arguments are of either primitive, composite (strings or enumerations), or previously defined types (e.g. objects from the environment). The arguments adhere to the formal definition of the API method. For each call candidate c fu SE calculates the score S(c) as follows: The score is composed of two components: the method score S M (c) and the weighted parameter score W S P (c). The impact of the latter on the final score can be adjusted with the weight φ. Further, S M (c) is scaled by the perfect match bonus P (c): The perfect match bonus P (c) allows us to prefer call candidates with a method name score M (c) above 0.9. The scaling factor τ is configurable (τ ≥ 1). The method score S M (c) is computed as follows: The method name score M (c) is the maximal similarity of the natural language chunk that represents the action (or event) and the (API) method name. We use Jaro-Winkler and fuzzy score as similarity measures. To obtain the method score S M (c), the method name score M (c) is reduced by a subtrahend that indicates how well the method name represents the words in the original natural language chunk. The subtrahend is composed of two factors. The second is one minus the fraction of words in the chunk that can be found in the method name and the total amount of words in the chunk; i.e., this factor is the share of unmapped words. The other factor scales it by a configurable parameter β, which is divided by length of the chunk. The rationale behind this is as follows. In short chunks each word is important. Therefore, unmapped words are strongly penalized. With an increasing number of words in the chunk, it is increasingly unlikely to map all words. However, in longer chunks many words are semantically irrelevant. Therefore, we reduce the subtrahend with the length of the chunk. The weighted parameter score W S P (c) in Equation 1 is calculated as follows: The score is composed of the parameter score S P (c) and a penalty value P en(c); the latter is weighted by the configurable factor ω. The parameter score S P (c) is calculated as follows:   P M is the set of all parameters p i (extracted from the natural language input) that were mapped to formal method parameters. Each p i has a similarity score (P i (c)). Thus, S P (c) is the sum of all similarity scores of mapped parameters multiplied with the share of mapped (P M ) and expected formal parameters as defined in the ontology (P O (c)). To calculate W S P (c) (see Equation 4), S P (c) is reduced by the penalty value P en(c) that is calculated as follows: P E is the set of parameters that were extracted from natural language input (see Figure 2). Thus, P en(c) is the number of parameters in the input that were not mapped to a formal method parameter, normalized by the total amount of extracted (natural language) parameters.
For the evaluation of the third stage of fu SE and the end-to-end-evaluation we set the method score weight φ to 0.6, the perfect match multiplier τ to 1.5, the search string coverage weight β to 0.5, and the penalty factor ω to 0.   Table 16: Classification accuracy obtained on the validation (in parenthesis) and the test set for the first stage (binary classification). The best results (per classifier category) are printed in bold type. The basic structure of each neural network includes an embedding layer and an output layer (dense layer).