Retrieval-Based Neural Code Generation

In models to generate program source code from natural language, representing this code in a tree structure has been a common approach. However, existing methods often fail to generate complex code correctly due to a lack of ability to memorize large and complex structures. We introduce RECODE, a method based on subtree retrieval that makes it possible to explicitly reference existing code examples within a neural code generation model. First, we retrieve sentences that are similar to input sentences using a dynamic-programming-based sentence similarity scoring method. Next, we extract n-grams of action sequences that build the associated abstract syntax tree. Finally, we increase the probability of actions that cause the retrieved n-gram action subtree to be in the predicted code. We show that our approach improves the performance on two code generation tasks by up to +2.6 BLEU.


Introduction
Natural language to code generation, a subtask of semantic parsing, is the problem of converting natural language (NL) descriptions to code (Ling et al., 2016;Yin and Neubig, 2017;Rabinovich et al., 2017). This task is challenging because it has a well-defined structured output and the input structure and output structure are in different forms.
A number of neural network approaches have been proposed to solve this task. Sequential approaches (Ling et al., 2016;Jia and Liang, 2016;Locascio et al., 2016) convert the target code into a sequence of symbols and apply a sequence-tosequence model, but this approach does not ensure that the output will be syntactically correct. 1 Code available at https://github.com/ sweetpeach/ReCode Tree-based approaches (Yin and Neubig, 2017;Rabinovich et al., 2017) represent code as Abstract Syntax Trees (ASTs), which has proven effective in improving accuracy as it enforces the well-formedness of the output code. However, representing code as a tree is not a trivial task, as the number of nodes in the tree often greatly exceeds the length of the NL description. As a result, tree-based approaches are often incapable of generating correct code for phrases in the corresponding NL description that have low frequency in the training data.
In machine translation (MT) problems Gu et al., 2018;Amin Farajian et al., 2017;Li et al., 2018), hybrid methods combining retrieval of salient examples and neural models have proven successful in dealing with rare words. Following the intuition of these models, we hypothesize that our model can benefit from querying pairs of NL descriptions and AST structures from training data.
In this paper, we propose RECODE, and adaptation of 's retrieval-based approach neural MT method to the code generation problem by expanding it to apply to generation of tree structures. Our main contribution is to introduce the use of retrieval methods in neural code generation models. We also propose a dynamic programming-based sentence-tosentence alignment method that can be applied to similar sentences to perform word substitution and enable retrieval of imperfect matches. These contributions allow us to improve on previous stateof-the-art results.

Syntactic Code Generation
Given an NL description q, our purpose is to generate code (e.g. Python) represented as an AST a. In this work, we start with the syntactic code gen-eration model by Yin and Neubig (2017), which uses sequences of actions to generate the AST before converting it to surface code. Formally, we want to find the best generated ASTâ given by: where y t is the action taken at time step t and y <t = y 1 ...y t−1 and T is the number of total time steps of the whole action sequence resulting in AST a.
We have two types of actions to build an AST: APPLYRULE and GENTOKEN. APPLYRULE(r) expands the current node in the tree by applying production rule r from the abstract syntax grammar 2 to the current node. GENTOKEN(v) populates terminal nodes with the variable v which can be generated from vocabulary or by COPYing variable names or values from the NL description. The generation process follows a preorder traversal starting with the root node. Figure 1 shows an action tree for the example code: the nodes correspond to actions per time step in the construction of the AST.
Interested readers can reference Yin and Neubig (2017) for more detail of the neural model, which consists of a bidirectional LSTM (Hochreiter and Schmidhuber, 1997) encoder-decoder with action embeddings, context vectors, parent feeding, and a copy mechanism using pointer networks.

RECODE: Retrieval-Based Neural Code Generation
We propose RECODE, a method for retrievalbased neural syntactic code generation, using retrieved action subtrees. Following 's method for neural machine translation, these retrieved subtrees act as templates that bias the generation of output code. Our pipeline at test time is as follows: • retrieve from the training set NL descriptions that are most similar with our input sentence ( §3.1), • extract n-gram action subtrees from these retrieved sentences' corresponding target ASTs ( §3.2), • alter the copying actions in these subtrees, by substituting words of the retrieved sentence with corresponding words in the input sentence ( §3.3), and • at every decoding step, increase the probability of actions that would lead to having these subtrees in the produced tree ( §3.4).

Retrieval of Training Instances
For every retrieved NL description q m from training set (or retrieved sentence for short), we compute its similarity with input q, using a sentence similarity formula (Gu et al., 2016;: where d is the edit distance. We retrieve only the top M sentences according to this metric where M is a hyperparameter. These scores will later be used to increase action probabilities accordingly.

Extracting N -gram Action Subtrees
In , they collect n-grams from the output side of the retrieved sentences and encourage the model to generate these n-grams. Word n-grams are obvious candidates when generating a sequence of words as output, as in NMT. However, in syntax-based code generation, the generation target is ASTs with no obvious linear structure. To resolve this problem, we instead use retrieved pieces of n-gram subtrees from the target code corresponding to the retrieved NL descriptions. Though we could select successive nodes in the AST as retrieved pieces, such as [assign; expr * (targets); expr] from Figure  1, we would miss important structural information from the rules that are used. Thus, we choose to exploit actions in the generation model rather than AST nodes themselves to be candidates for our retrieved pieces.
In the action tree (Figure 1), we considered only successive actions, such as subtrees where each node has one or no children, to avoid overly rigid structures or combinatorial explosion of the number of retrieved pieces the model has to consider.
As the node in the action tree holds structural information about its children, we set the subtrees to have a fixed depth, linear in the size of the tree. These can be considered "n-grams of actions", emphasizing the comparison with machine translation which uses n-grams of words. n is a hyperparameter to be tuned.

Word Substitution in Copy Actions
Using the retrieved subtree without modification is problematic if it contains at least one node corresponding to a COPY action because copied tokens from the retrieved sentence may be different from those in the input. Figure 1 shows an example when the input and retrieved sentence have four common words, but the object names are different. The extracted action n-gram would contain the rule that copies the second word ("lst") of the retrieved sentence while we want to copy the first word ("params") from the input. By computing word-based edit distance between the input description and the retrieved sentence, we implement a one-to-one sentence alignment method that infers correspondences between uncommon words. For unaligned words, we alter all COPY rules in the extracted n-grams to copy tokens by their aligned counterpart, such as replace "params" with "lst", and delete the n-gram subtree, as it is not likely to be relevant in the predicted tree. Thus, in the example in Figure 1, the GENTOKEN(LST) action in t 5 will not be executed.

Retrieval-Guided Code Generation
N -gram subtrees from all retrieved sentences are assigned a score, based on the best similarity score  Yin and Neubig (2017) of all instances where they appeared. We normalize the scores for each input sentence by subtracting the average over the training dataset.
At decoding time, incorporate these retrievalderived scores into beam search: for a given time step, all actions that would result in one of the retrieved n-grams u to be in the prediction tree has its log probability log(p(y t | y t−1 1 )) increased by λ * score(u) where λ is a hyperparameter, and score(u) is the maximal sim(q, q m ) from which u is extracted. The probability distribution is then renormalized.

Datasets and Evaluation Metrics
We evaluate RECODE with the Hearthstone (HS) (Ling et al., 2016) and Django (Oda et al., 2015) datasets, as preprocessed by Yin and Neubig (2017). HS consists of Python classes that implement Hearthstone card descriptions while Django contains pairs of Python source code and English pseudo-code from Django web framework. Table  1 summarizes dataset statistics.
For evaluation metrics, we use accuracy of exact match and the BLEU score following Yin and Neubig (2017).

Experiments
For the neural code generation model, we use the settings explained in Yin and Neubig (2017). For the retrieval method, we tuned hyperparameters and achieved best result when we set n max = 4 and λ = 3 for both datasets 3 . For HS, we set M = 3 and M = 10 for Django. We compare our model with Yin and Neubig (2017)'s model that we call YN17 for brevity, and a sequence-to-sequence (SEQ2SEQ) model that we implemented. SEQ2SEQ is an attentionenabled encoder-decoder model (Bahdanau et al., 2015). The encoder is a bidirectional LSTM and the decoder is an LSTM. Table 2 shows that RECODE outperforms the baselines in both BLEU and accuracy, providing ev-idence for the effectiveness of incorporating retrieval methods into tree-based approaches.  We ran statistical significance tests for RECODE and YN17, using bootstrap resampling with N = 10,000. For the BLEU scores of both datasets, p < 0.001. For the exact match accuracy, p < 0.001 for Django dataset, but for Hearthstone, p > 0.3, showing that the retrieval-based model is on par with YN17. It is worth noting, though, that HS consists of long and complex code, and that generating exact matches is very difficult, making exact match accuracy a less reliable metric.

Results
We also compare RECODE with Rabinovich et al. (2017)'s Abstract Syntax Networks with supervision (ASN+SUPATT) which is the state-of-the-art system for HS. RECODE exceeds ASN without extra supervision though ASN+SUPATT has a slightly better result. However, ASN+SUPATT is trained with supervised attention extracted through heuristic exact word matches while our attention is unsupervised.

Discussion and Analysis
From our observation and as mentioned in Rabinovich et al. (2017), HS contains classes with similar structure, so the code generation task could be simply matching the tree structure and filling the terminal tokens with correct variables and values. However, when the code consists of complex logic, partial implementation errors occur, leading to low exact match accuracy (Yin and Neubig, 2017). Analyzing our result, we find this intuition to be true not only for HS but also for Django.
Examining the generated output for the Django dataset in Table 3, we can see that in the first example, our retrieval model can successfully generate the correct code when YN17 fails. This difference suggests that our retrieval model benefits from the action subtrees from the retrieved sentences. In the second example, although our generated code does not perfectly match the reference code, it has a higher BLEU score compared Example 1 "if offset is lesser than integer 0, sign is set to '-', otherwise sign is '+' " Input sign = offset < 0 or '-' YN17 sign = '-' if offset < 0 else '+' RECODE sign = '-' if offset < 0 else '+' Gold Example 2 "evaluate the function timesince with d, now and reversed set Input to boolean true as arguments, return the result." return reversed(d, reversed=now) YN17 return timesince(d, now, reversed=now) RECODE return timesince(d, now, reversed=True) Gold Example 3 "return an instance of SafeText , Input created with an argument s converted into a string ." return SafeText(bool(s)) YN17 return SafeText(s) RECODE return SafeString(str(s)) Gold Table 3: Django examples on correct code and predicted code with retrieval (RECODE) and without retrieval (YN17).  to the output of YN17 because our model can predict part of the code (timesince(d, now, reversed)) correctly. The third example shows where our method fails to apply the correct action as it cannot cast s to str type while YN17 can at least cast s into a type (bool). Another common type of error that we found RECODE's generated outputs is incorrect variable copying, similarly to what is discussed in Yin and Neubig (2017) and Rabinovich et al. (2017). Table 4 presents a result on the HS dataset 4 . We can see that our retrieval model can handle complex code more effectively.

Related Work
Several works on code generation focus on domain specific languages (Raza et al., 2015;Kushman and Barzilay, 2013). For general purpose code generation, some data-driven work has been done for predicting input parsers (Lei et al., 2013) or a set of relevant methods (Raghothaman et al., 2016). Some attempts using neural networks have used sequence-to-sequence models (Ling et al., 2016) or tree-based architectures (Dong and Lapata, 2016; Alvarez-Melis and Jaakkola, 2017). Ling et al. (2016); Jia and Liang (2016); Locascio et al. (2016) treat semantic parsing as a sequence generation task by linearizing trees. The closest work to ours are Yin and Neubig (2017) and Rabinovich et al. (2017) which represent code as an AST. Another close work is Dong and Lapata (2018), which uses a two-staged structure-aware neural architecture. They initially generate a lowlevel sketch and then fill in the missing information using the NL and the sketch.
Recent works on retrieval-guided neural machine translation have been presented by Gu et al. (2018); Amin Farajian et al. (2017); Li et al. (2018); . Gu et al. (2018) use the retrieved sentence pairs as extra inputs to the NMT model.  employ a simpler and faster retrieval method to guide neural MT where translation pieces are n-grams from retrieved target sentences. We modify 's method from textual n-grams to n-grams over subtrees to exploit the code structural similarity, and propose methods to deal with complex statements and rare words.
In addition, some previous works have used subtrees in structured prediction tasks. For example, Galley et al. (2006) used them in syntaxbased translation models. In Galley et al. (2006), subtrees of the input sentence's parse tree are associated with corresponding words in the output sentence.

Conclusion
We proposed an action subtree retrieval method at test time on top of an AST-driven neural model for generating general-purpose code. The predicted surface code is syntactically correct, and the retrieval component improves the performance of a previously state-of-the-art model. Our successful result suggests that our idea of retrieval-based generation can be potentially applied to other treestructured prediction tasks.