Mapping Language to Code in Programmatic Context

Source code is rarely written in isolation. It depends significantly on the programmatic context, such as the class that the code would reside in. To study this phenomenon, we introduce the task of generating class member functions given English documentation and the programmatic context provided by the rest of the class. This task is challenging because the desired code can vary greatly depending on the functionality the class provides (e.g., a sort function may or may not be available when we are asked to “return the smallest element” in a particular member variable list). We introduce CONCODE, a new large dataset with over 100,000 examples consisting of Java classes from online code repositories, and develop a new encoder-decoder architecture that models the interaction between the method documentation and the class environment. We also present a detailed error analysis suggesting that there is significant room for future work on this task.


Introduction
Natural language can be used to define complex computations that reuse the functionality of rich, existing code bases.However, existing approaches for automatically mapping natural language (NL) to executable code have considered limited language or code environments.They either assume fixed code templates (i.e., generate only parts of a method with a predefined structure; Quirk et al., 2015), a fixed context (i.e., generate the body of the same method within a single fixed class; Ling et al., 2016), or no context at all (i.e., generate code tokens from the text alone; Oda et al., 2015).In this paper, we introduce new data and methods for learning to map language to source code within the context of a real-world programming environment, with application to generating  The figure shows a class where the programmer wants to automatically generate the add method from documentation, assuming the rest of the class is already written.The system needs to understand that vecElements is the vector to be augmented, and that the method must take in a scalar parameter as the element to be added.The model also needs to disambiguate between the member variables vecElements and weights.member functions from documentation for automatically collected Java class environments.
The presence of rich context provided by an existing code environment better approximates the way programmers capitalize on code re-use, and also introduces new language understanding challenges.Models must (a) map the NL to environment variables, library API calls and user-defined methods found elsewhere in the class based on their names, types and signatures, and (b) decide on the structure of the resulting code.For example, in Figure 1, to generate the method inc() from the corresponding NL, Increment this vector, it is crucial to know of the existence of class method add().This helps us decide if it should directly call add() or generate the method from scratch by iterating through the vecElements array and incrementing each element.Similarly, for generating the add() method, the code needs to use the class variable vecElements correctly.Overall, the code environment provides rich information relating to the intent of the developer, and can be used to considerably reduce ambiguity in the NL documentation.
To learn such a code generator, we use a specialized neural encoder-decoder model that (a) encodes the NL together with representations based on sub-word units for environment identifiers (member variables, methods) and data types, and (b) decodes the resulting code using an attention mechanism with multiple steps, by first attending to the NL, and then to the variables and methods, thus also learning to copy variables and methods.This two-step attention helps the model to match words in the NL with representations of the identifiers in the environment.Rather than directly generating the output source code tokens (Dong and Lapata, 2016;Iyer et al., 2017), the decoder generates production rules from the grammar of the target programming language similar to Rabinovich et al. (2017), Yin and Neubig (2017), and Krishnamurthy et al. (2017) and therefore, guarantees the syntactic well-formedness of the output.
To train our model, we collect and release CON-CODE, a new dataset comprising over 100,000 (class environment, NL, code) tuples by gathering Java files containing method documentation from public Github repositories.This is an order of magnitude larger than existing datasets that map NL to source code for a general purpose language (MTG from Ling et al. (2016) has 13k examples), contains a larger variety of output code templates than existing datasets built for a specific domain, and is the first to condition on the environment of the output code.Also, by design, it contains examples from several domains, thus introducing open-domain challenges of new identifiers during test time (some e.g.class environments are LookupCommand, ColumnFileReader and Im-ageSequenceWriter).Our model achieves an exact match accuracy of 8.6% and a BLEU score (a metric for partial credit; Papineni et al., 2002) of 22.11, outperforming retrieval and recent neural methods.We also provide an extensive ablative analysis, quantifying the contributions that come from the context and the model, and suggesting that our work opens up various areas for future investigation.We introduce the task of generating source code from NL documentation, conditioned on the class environment the code resides in.The environment comprises two lists of entities: (1) class member variable names with their data types (for example, double[] vecElements as seen in Figure 2), and (2) member function names together with their return types (for example, void inc()).1 Formally, let q (i) , a (i) denote the NL and source code respectively for the i th training example, where a (i) is a sequence of production rules that forms the derivation of its underlying source code.The environment comprises a list of variables names v (i) 1..|v (i) | and their corresponding types t (i)  1..|t (i) | , as well as method names m (i) 1..|m (i) | and their return types

Task Definition
Our goal is to generate the derivation of a (i) given q (i) and the environment (see Figure 2).

Models
We evaluate a number of encoder-decoder models that generate source code derivations from NL and the class environment.Our best model encodes all environment components broken down into subword units (Sennrich et al., 2016) separately, using Bi-LSTMs and decodes these contextual representations to produce a sequence of valid production rules that derive syntactically valid source code.The decoder also uses a two-step attention mechanism to match words in the NL with environment components, and then uses a supervised copy mechanism (Gu et al., 2016a) to incorporate environment elements in the resulting code.We describe this architecture below.

Encoder
The encoder computes contextual representations of the NL and each component in the environment.Each word of the NL, q i , is embedded into a high dimensional space using Identifier matrix I (denoted as q i ) followed by the application of a n-layer bidirectional LSTM (Hochreiter and Schmidhuber, 1997).The hidden states of the last layer (h 1 , • • • , h z ) are passed on to the attention layer, while the hidden states at the last token are used to initialize the decoder.
To encode variables and methods, the variable types (t i ) and method return types (r i ) are embedded using a type matrix T (denoted as t i and r i ).To encode the variable and method names (v i , m i ), they are first split based on camel-casing, and each component is embedded using I, represented as v i1 , . . ., v ij and m i1 , . . ., m ik .The encoded representation of the variable and method names is the final hidden state of the last layer of a Bi-LSTM over these embeddings (v i and m i ).Finally, a 2-step Bi-LSTM is executed on the concatenation of the variable type embedding and the variable name encoding.The corresponding hidden states form the final representations of the variable type and the variable name ( ti , vi ) and are passed on to the attention mechanism.Method return types and names are processed identically using the same Bi-LSTMs and embedding matrices ( ri , mi ).
Figure 3 shows an example of the encoder.

Decoder
We represent the source code to be produced as a sequence of production rules (a t at step t), with a non-terminal n t on the left hand side and a combination of terminal and non-terminal symbols on the right hand side (see Figure 2).The first nonterminal is MemberDeclaration.Subsequently, every non-terminal is expanded in a depth first left to right fashion, similar to Yin and Neubig (2017).The probability of a source code snippet is decomposed as a product of the conditional probability of generating each step in the sequence of rules conditioned on the previously generated rules.Our decoder is an LSTM-based RNN that produces a context vector c t at each time step, which is used to compute a distribution over next actions.
Here, W nt is a |n t | × H matrix, where |n t | is the total number of unique production rules that n t can be expanded to.The context vector c t is computed using the hidden state s t of an n-layer decoder LSTM cell and attention vectors over the NL and the context (z t and e t ), as described below.
Decoder LSTM The decoder uses an n-layer LSTM whose hidden state s t is computed based on the current non-terminal n t to be expanded, the previous production rule a t−1 , the parent production rule par(n t ) that produced n t , the previous decoder LSTM state s t−1 , and the decoder state of the LSTM cell that produced n t , denoted as s nt .
We use an embedding matrix N to embed n t and matrix A to embed a t−1 and par(n t ).If a t−1 is a rule that generates a terminal node that represents an identifier or literal, it is represented using a special rule IdentifierOrLiteral to collapse all these rules into a single previous rule.
Two-step Attention At time step t, the decoder first attends to every token in the NL representation, h i , using the current decoder state, s t , to compute a set of attention weights α t , which are used to combine h i into an NL context vector z t .We use a general attention mechanism (Luong et al., 2015), extended to perform multiple steps.
The context vector z t is used to attend over every type (return type) and variable (method) name in the environment, to produce attention weights β t that are used to combine the entire context x = [t : v : r : m] into an environment context vector e t . 2 Finally, c t is computed using the decoder state and both context vectors z t and e t : 2 ":" denotes concatenation.Supervised Copy Mechanism Since the class environment at test time can belong to previously unseen new domains, our model needs to learn to copy variables and methods into the output.We use the copying technique of Gu et al. (2016a) to compute a copy probability at every time step t using vector b of dimensionality H.
Since we only require named identifiers or user defined types to be copied, both of which are produced by production rules with n t as IdentifierNT, we make use of this copy mechanism only in this case.Identifiers can be generated by directly generating derivation rules (see equation 1), or by copying from the environment.The probability of copying an environment token x j , is set to be the attention weights β t,j computed earlier, which attend exactly on the environment types and names which we wish to be able to copy.The copying process is supervised by preprocessing the production rules to recognize identifiers that can be copied from the environment, and both the generation and the copy probabilities are weighted by 1 − copy(t) and copy(t) respectively.The LSTM decoder with attention mechanism is illustrated in Figure 4.

Baseline Models
Retrieval We evaluate a retrieval baseline, where the output source code for a test example is the training example source code whose NL is closest in terms of cosine similarity to the test NL using a tf-idf representation.We then replace all occurrences of environment variables and methods in the chosen training source code with similarly typed variables and methods from the environment of the test example, and we break ties randomly.

Seq2seq
We evaluate a Seq2Seq baseline by representing the NL and context as a sequence formed by the concatenation of the NL, the variables and the methods with separators between them.The variables (methods) are represented with the type (return type) followed by the name, with a different separator between them.The encoder is an nlayer LSTM which initializes an LSTM-based decoder using its final hidden states.The decoder uses an attention mechanism (Luong et al., 2015) over the encoder states to produce a conditional distribution over the next source code token (not production rule) given all the previous tokens.We replace UNK tokens in the output with source tokens having the most attention weight.We also attempted to evaluate the Seq2Tree model of Dong and Lapata (2016) but the redundancy in the model resulted in extremely long output sequences which did not scale.Experiments on a smaller dataset gave comparable results to Seq2seq.
Seq2prod This baseline corresponds to the action sequence model by Yin and Neubig (2017), with a BiLSTM over a sequence representation of the NL and context (same as Seq2seq), and a decoder that learns to generate a derivation of the AST of the underlying source code, similar to our model.The decoder uses the same attention mechanism as the Seq2seq, however, it uses supervised copying from the entire input sequence to handle unknown words encountered during testing.

CONCODE
We built a new dataset (CONCODE) from public Java projects on Github that contains environment information together with NL (Javadoc-style method comments) and code tuples.We gather Java files from Every method that contains a Javadoc comment is treated as a training example.The Javadoc comment forms the NL, the method body is the target code to be generated, and all member variables, as well as other member method signatures are extracted and stored as part of the context.The Javadoc is preprocessed by stripping special symbols such as @link, @code, @inheritDoc and special fields such as @params.Methods that do not parse are eliminated and the rest are preprocessed by renaming locally defined variables canonically, beginning at loc0 and similarly for arguments, starting with arg0.We also replace all method names with the word function since it doesn't affect the semantics of the resulting program.Generating informative method names has been studied by Allamanis et al. (2015a).We replace all string literals in the code with constants as they are often debug messages.Finally, we use an ANTLR java grammar3 that is post-processed by adding additional non-terminals and rules to eliminate wildcard symbols in the grammar, in order to convert the source code into a sequence of production rules.The resulting dataset contains 100,000 examples for training, and 2000 examples each for development and testing, respectively.Table 1 summarizes the various statistics.We observe that on average, an environment contains ∼5 variables and ∼11 methods.Around 68% of the target code uses class member variables, and 16% of them use member methods, from the environment.Based on a frequency cutoff of 2 on the training set, we find that 7% of the types in the development set code are unknown, hence they need to be copied from the environment.Since CON-CODE is extracted from a diverse set of online repositories, there is a high variety of code templates in the dataset compared to existing datasets.For example, a random baseline on the Hearthstone card game dataset (Ling et al., 2016) gives a BLEU score of 40.3, but only a score of 10.2 on CONCODE.We plan to release all code and data from this work.4

Experimental Setup
We restrict all models to examples where the length of the combination of the NL and the context is at most 200 tokens and the length of the output source code is at most 150 tokens.Source NL tokens are lower-cased, camel-cased identifiers are split and lower-cased, and are used together with the original version.The vocabulary for identifiers uses a frequency threshold of 7, resulting in a vocabulary of 32, 600 tokens.The types vocabulary uses a threshold of 2 resulting in 22, 324 types.We include all 153 non-terminals and 342 previous rules.We use a threshold of 2 for output production rules to filter out the long tail of rules creating identifiers and literals, resulting in 18, 135 output rules.Remaining values are replaced with the UNK symbol.
Hyperparameters We use an embedding size H of 1024 for identifiers and types.All LSTM cells use 2-layers and a hidden dimensionality of 1024 (512 on each direction for BiLSTMs).We use an embedding size of 512 for encoding nonterminals and rules in the decoder.We use dropout with p = 0.5 in between LSTM layers and at

Model
Exact  the output of the decoder over c t .We train our model for 30 epochs using mini-batch gradient descent with a batch size of 20, and we use Adam (Kingma and Ba, 2015) with an initial learning rate of 0.001 for optimization.We decay our learning rate by 80% based on performance on the development set after every epoch.

Inference and Metrics
Inference is done by first encoding the NL and context of the test example.We maintain a stack of symbols starting with the non-terminal, MemberDeclaration, and at each step, a non-terminal (terminals are added to the output) is popped off the stack to run a decoding step to generate the next set of symbols to push onto the stack.The set of terminals generated along the way forms the output source code.We use beam search and maintain a ranked list of partial derivations (or code tokens for Seq2seq) at every step to explore alternate high-probability derivations.We use a beam size of 3 for all neural models.We copy over source tokens whenever preferred by the model output.We restrict the output to 150 tokens or 500 production rules.
To evaluate the quality of the output, we use Exact match accuracy between the reference and generated code.As a measure of partial credit, we also compute the BLEU score (Papineni et al., 2002), following recent work on code generation (Ling et al., 2016;Yin and Neubig, 2017).BLEU is an n-gram precision-based metric that will be higher when more subparts of the predicted code match the provided reference.e)-(g) are cases where the model output is very reasonable for a practical setting.In (f), the model produces a better solution than the reference.In (h), the context lacks information to produce the reference code.The model chooses the wrong element in (i) and could be improved by better encoder representations.

Results
We present results for the context based code generation task on the test and dev sets in Table 2. Our model outperforms all baselines and achieves a gain of 1.95% exact match accuracy and 0.82 BLEU points, with respect to the next best models.The combination of independently encoding sub-word units and applying a two-step attention mechanism helps the model learn to better associate the correct variables/methods from the context and the language in the NL. Figure 5 (a) shows an example output of our model, which produces code structure intermixed with member variables (tags).In (b) our model learns to call method addUnderscores (an UNK in the vocabulary) with its correct return type (String).Similarly, in (d) our model also successfully learns to use a previously unseen type (ExecutionDataStore) when making use of the corresponding variable.(c) is an example of where the NL does not directly refer to the variable to be used.The mismatch between dev and test results is because we ensure that the dev and test examples come from non-overlapping Github repositories, resulting in different distributions.
Using a constrained decoder that generates syntax tree rules instead of tokens leads to significant gains in terms of exact match score (6.65 for Seq2prod vs 3.2 for Seq2seq), and shows that this is an important component of code generation systems.Seq2prod, however, fails on examples (b)-(d), since it is harder to learn to match the NL tokens with environment elements.Finally, all neural models outperform the retrieval baseline with member substitution.
To understand which components of the data and the model contribute most to the output, we perform an ablation study on the development set (Table 3).Removing the variables leads to a significant hit in exact match accuracy since 68% of examples (Table 1) refers to class variables; a similar but smaller reduction is incurred by removing methods.The presence of these components makes this task more challenging and also more aligned with programming scenarios in practice.Removing the two-step attention mechanism leads to a 1.3% drop in accuracy since the attention on the NL is unable to interact with the attention on the environment variables/methods.Removing camel-case encoding leads to a small drop mainly because many variable (method)   words.

Error Analysis
Subfigures 5(e)-(i) show cases where our model output did not exactly match the reference.In (e)-(g), the model output is semantically equivalent to the reference and is a very reasonable prediction in a practical setting.For example, in (e) the only difference between the prediction and the reference is the string concatenations to the url.Interestingly, in example (f) the prediction is a cleaner way to achieve the same effect as the reference, and this is a great example of the application of these models for suggesting standard and efficient code.Unfortunately, our model is penalized by the exact match metric here.Similarly, in (g), the model uses a generic list (List<?>) in place of the specific type (Transformer[]).Example (h) demonstrates a case where the model is unaware of methods that can be called on class members (specifically that evictAll is a member of the TimestampsRegion class), and requires augmenting the environment with additional member type documentation, which we believe will be an important area for future work.Example (i) calls for richer encoder representations, since our model incorrectly uses the values variable instead of register, as it is unable to associate the word "registry" with the right elements.
We further perform a qualitative analysis of 100 predictions on our development set (Table 4) and find that there is significant room for improvement with 71% of the predictions differing significantly from their references.16% of the predictions are very close to their references with the difference being 1-2 tokens, while 11% are exactly correct.2% of the predictions were semantically equivalent but not exactly equal to their references.

Related Work
There is significant existing research on mapping NL directly to executable programs in the form of logical forms (Zettlemoyer and Collins, 2005), λ-DCS (Liang et al., 2013), regular expressions (Kushman and Barzilay, 2013;Locascio et al., 2016), database queries (Iyer et al., 2017;Zhong et al., 2017) and general purpose programs (Balog et al., 2016;Allamanis et al., 2015b).Ling et al. (2016) 2015) augment neural models with a small set of basic arithmetic and logic operations to generate more meaningful programs.In this work, we introduce a new task of generating programs from NL based on the environment in which the generated code resides, following the frequently occurring pattern in large repositories where the code depends on the types and availability of variables and methods in the environment.
Neural encoder-decoder models have proved effective in mapping NL to logical forms and also for directly producing general purpose programs.Ling et al. (2016) use a sequence-to-sequence model with attention and a copy mechanism to generate source code.Instead of directly generating a sequence of code tokens, recent methods focus on constrained decoding mechanisms to generate syntactically correct output using a decoder that is either grammar-aware or has a dynamically-determined modular structure paralleling the structure of the abstract syntax tree (AST) of the code (Dong and Lapata, 2016;Rabinovich et al., 2017;Krishnamurthy et al., 2017;Yin and Neubig, 2017).
Our model also uses a grammar-aware decoder similar to Yin and Neubig (2017) to generate syntactically valid parse trees, augmented with a two-step attention mechanism (Chen et al., 2016), followed by a supervised copying mechanism (Gu et al., 2016a) over the class environment.
Recent models for mapping NL to code have been evaluated on datasets containing highly templated code for card games (Hearthstone & MTG; Ling et al., 2016), or manually labeled per-line comments (DJANGO; Oda et al., 2015).These datasets contain ∼20,000 programs with short textual descriptions possibly paired with categorical data, whose values need to be copied onto the resulting code from a single domain.In this work, we collect a new dataset of over 100,000 NL and code pairs, together with the corresponding class environment.Each environment and NL describe a specific domain and the dataset comprises thousands of different domains, that poses additional challenges.Having an order of magnitude more data than existing datasets makes training deep neural models very effective, as we saw in the experimental evaluation.While massive amounts of Github code have been used before for creating datasets on source code only (Allamanis andSutton, 2013, 2014;Allamanis et al., 2016), we instead extract from Github a dataset of NL and code with an emphasis on context, in order to learn to map NL to code within a class.

Conclusion
In this paper, we introduce new data and methods for learning to generate source code from language within the context of a real-world code base.To train models for this task, we collect and release CONCODE, a large new dataset of NL, code, and context tuples from online repositories, featuring code from a variety of domains.We also introduced a new encoder decoder model with a specialized context encoder which outperforms strong neural baselines by 1.95% exact match accuracy.Finally, analysis suggests that even richer models of programmatic context could further improve these results.

Figure 1 :
Figure 1: Code generation based on the class environment and method documentation.The figure shows a class where the programmer wants to automatically generate the add method from documentation, assuming the rest of the class is already written.The system needs to understand that vecElements is the vector to be augmented, and that the method must take in a scalar parameter as the element to be added.The model also needs to disambiguate between the member variables vecElements and weights.

Figure 2 :
Figure 2: Our task involves generating the derivation of the source code of a method based on the NL documentation, class member variables (names and data types), and other class member methods (method names and return types), which form the code environment.

Figure 3 :
Figure 3: The encoder creates contextual representations of the NL (a), the variables and the methods (b).Variable (method) names are split based on camel-casing and encoded using a BiLSTM.The variable (method) type and name are further contextualized using another BiLSTM.

Figure 4 :
Figure4: The hidden state st of our decoder is a function of the previous hidden state, current non-terminal, previous production rule, parent rule, and the parent hidden state.st is used to attend on the NL and compress it into zt, which is then used to attend over the environment variables and methods to generate et.The decoder uses all these context vectors to produce a distribution over valid right hand side values of the current non-terminal, and also learns to copy from the environment.

Figure 5 :
Figure 5: Analysis of our model output on development set examples.Some environment variables and methods are omitted for space.(a)-(d) represent cases where the model exactly produced the reference output.(e)-(g) are cases where the model output is very reasonable for a practical setting.In (f), the model produces a better solution than the reference.In (h), the context lacks information to produce the reference code.The model chooses the wrong element in (i) and could be improved by better encoder representations.

Table 1 :
Statistics of our dataset of (NL, context and code)

Table 2 :
Exact match accuracy and BLEU score (for partial credit) on the test (development) set, comprising 2000 examples from previously unseen repositories.

Table 3 :
Ablation of model features on the development set.
names are single

Table 4 :
Qualitative distribution of errors on the develop- generate Java and Python source code from NL for card games, conditioned on categorical card attributes.Manshadi et al. (2013) generate code based on input/output examples for applications such as database querying.Gu et al. (2016b) use neural models to map NL queries to a sequence of API calls, and Neelakantan et al. (