Unified Pre-training for Program Understanding and Generation

Code summarization and generation empower conversion between programming language (PL) and natural language (NL), while code translation avails the migration of legacy code from one PL to another. This paper introduces PLBART, a sequence-to-sequence model capable of performing a broad spectrum of program and language understanding and generation tasks. PLBART is pre-trained on an extensive collection of Java and Python functions and associated NL text via denoising autoencoding. Experiments on code summarization in the English language, code generation, and code translation in seven programming languages show that PLBART outperforms or rivals state-of-the-art models. Moreover, experiments on discriminative tasks, e.g., program repair, clone detection, and vulnerable code detection, demonstrate PLBART’s effectiveness in program understanding. Furthermore, analysis reveals that PLBART learns program syntax, style (e.g., identifier naming convention), logical flow (e.g., “if“ block inside an “else“ block is equivalent to “else if“ block) that are crucial to program semantics and thus excels even with limited annotations.


Introduction
Engineers and developers write software programs in a programming language (PL) like Java, Python, etc., and often use natural language (NL) to communicate with each other. Use of NL in software engineering ranges from writing documentation, commit messages, bug reports to seeking help in different forums (e.g., Stack Overflow), etc. Automating different software engineering applications, such as source code summarization, generation, and translation, heavily rely on the understanding of PL and NL-we collectively refer them as PLUG (stands for, Program and Language Understanding and Generation) applications or tasks. * Equal contribution. return o1.get(0) == o2.get(0); 7 } 8 }); 9 } Summary: sort a list of tuples by first element Figure 1: Example motivating the need to understand the association of program and natural languages for code summarization, generation, and translation.
Note that the use of NL in software development is quite different than colloquially written and spoken language. For example, NL in software development often contains domain-specific jargon, e.g., when software engineers use Code Smell 1 , it means a potential problem in code (something other than Smell in regular English language). In this work, our goal is to develop a generalpurpose model that can be used in various PLUG applications. Recent advancements in deep learning and the availability of large-scale PL and developers' NL data ushered in the automation of PLUG applications. One important aspect of PLUG applications is that they demand a profound understanding of program syntax and semantics and mutual dependencies between PL and NL. For example, Figure 1 shows two implementations of the same algorithm (sorting) in two PL and corresponding NL summary. An automatic translation tool must understand that function sorted in Python acts similar to Arrays.sort in Java and the lambda operation in Python is equivalent to instantiating a Comparator object in Java. Similarly, a tool that summarizes either of these code must understand that x[0] in Python or Tuple.get(0) in Java refers to the first element in the tuple list.
Most of the available data in PL and NL are unlabeled and cannot be trivially used to acquire PLUG task-specific supervision. However, PLUG tasks have a common prerequisite -understanding PL and NL syntax and semantics. Leveraging unlabelled data to pretrain a model to learn PL and NL representation can be transferred across PLUG tasks. This approach reduces the requirement of having large-scale annotations for taskspecific fine-tuning. In recent years we have seen a colossal effort to pretrain models on a massive amount of unlabeled data (e.g., text, images, videos) (Devlin et al., 2019;Conneau and Lample, 2019;Conneau et al., 2020;Li et al., 2019; to transfer representation encoders across a wide variety of applications. There are a few research effort in learning general purpose PL-NL representation encoders, such as CodeBERT  and GraphCode-BERT  that are pretrained on a small-scale bimodal data (code-text pairs). Such models have been found effective for PLUG tasks, including code search, code completion, etc.
Language generation tasks such as code summarization is modeled as sequence-to-sequence learning, where an encoder learns to encode the input code and a decoder generates the target summary. Despite the effectiveness of existing methods, they do not have a pretrained decoder for language generation. Therefore, they still require a large amount of parallel data to train the decoder. To overcome this limitation,  proposed denoising sequence-to-sequence pre-training where a Transformer (Vaswani et al., 2017) learns to reconstruct an original text that is corrupted using an arbitrary noise function. Very recently,  studied denoising pre-training using a largescale source code collection aiming at unsupervised program translation and found the approach useful. This raises a natural question, can we unify pretraining for programming and natural language? Presumably, to facilitate such pre-training, we need unlabeled NL text that is relevant to software development. Note that unlike other bimodal scenarios (e.g., vision and language), PL and associated NL text share the same alphabet or uses anchor tokens  Table 1: Statistics of the data used to pre-train PLBART. "Nb of documents" refers to the number of functions in Java and Python collected from Github and the number of posts (questions and answers) in the natural language (English) from StackOverflow.
(e.g., "sort", "list", "tuple" as shown in Figure 1) that can help to learn alignment between semantic spaces across languages.
We introduce PLBART (Program and Language BART), a bidirectional and autoregressive transformer pre-trained on unlabeled data across PL and NL to learn multilingual representations applicable to a broad spectrum of PLUG applications. We evaluate PLBART on code summarization, generation, translation, program repair, clone detection, and vulnerability detection tasks. Experiment results show that PLBART outperforms or rivals stateof-the-art methods, e.g., CodeBERT and Graph-CodeBERT, demonstrating its promise on program understanding and generation. We perform a thorough analysis to demonstrate that PLBART learns program syntax, logical data flow that is indispensable to program semantics, and excels even when limited annotations are available. We release our code 2 to foster future research.

PLBART
PLBART uses denoising sequence-to-sequence pretraining to utilize unlabeled data in PL and NL. Such pre-training lets PLBART reason about language syntax and semantics. At the same time, PLBART learns to generate language coherently.

Denoising Pre-training
Data & pre-processing We pre-train PLBART on a large-collection of Java and Python functions and natural language descriptions from Github and StackOverflow, respectively. We download all the GitHub repositories associated with Java and Python languages available on Google BigQuery.   Table 1. We tokenize all the data with a sentencepiece model (Kudo and Richardson, 2018) learned on 1/5'th of the pre-training data. We train sentencepiece to learn 50,000 subword tokens.
One key challenge to aggregate data from different modalities is that some modalities may have more data, such as we have 14 times more data in PL than NL. Therefore, we mix and up/down sample the data following Conneau and Lample (2019) to alleviate the bias towards PL. We sample instances for pre-training according to a multinomial distribution with probabilities (q 1 , q 2 , . . . , q N ): where N is the total number of languages and n i is the total number of instances in language i. We set the smoothing parameter α to 0.3.
Architecture PLBART uses the same architecture as BART base , it uses the sequence-to-sequence Transformer architecture (Vaswani et al., 2017), with 6 layers of encoder and 6 layers of decoder with model dimension of 768 and 12 heads (∼140M parameters). The only exception is, we include an additional layernormalization layer on top of both the encoder and decoder following , which is found to stabilize training with FP16 precision.
Noise function, f In denoising autoencoding, a model learns to reconstruct an input text that is corrupted by a noise function. Reconstruction of the original input requires the model to learn language syntax and semantics. In this work, we use three noising strategies: token masking, token deletion, 4 https://archive.org/download/stackexchange and token infilling . According to the first two strategies, random tokens are sampled and replaced with a mask token or deleted from the input sequence. In token infilling, a number of text spans are sampled and replaced with a single mask token. The span lengths are drawn from a Poisson distribution (λ = 3.5). We mask 35% of the tokens in each instance.
Input/Output Format The input to the encoder is a noisy text sequence, while the input to the decoder is the original text with one position offset. A language id symbol (e.g., <java>, <python>) is appended and prepended to the encoder and decoder inputs, respectively. We provide a few examples in Table 2. The input instances are truncated if they exceed a maximum sequence length of 512.
Learning PLBART is pre-trained on N languages (in our case, N =3), where each language N i has a collection of unlabeled instances D i = {x 1 , . . . , x n i }. Each instance is corrupted using the noise function f and we train PLBART to predict the original instance x from f (x). Formally, PLBART is trained to maximize L θ : where m i is the number of sampled instances in language i and the likelihood P is estimated following the standard sequence-to-sequence decoding.
Optimization We train PLBART on 8 Nvidia GeForce RTX 2080 Ti GPUs for 100K steps. The effective batch size is maintained at 2048 instances. We use Adam ( = 1e-6, β 2 = 0.98) with a linear learning rate decay schedule for optimization. We started the training with dropout 0.1 and reduced it to 0.05 at 50K steps and 0 at 80K steps. This is done to help the model better fit the data   Table 3: Example inputs to the encoder and decoder for fine-tuning PLBART on sequence generation tasks: source code summarization (S), generation (G), and translation (T).
276 hours (11.5 days). All experiments are done using the Fairseq library .

Fine-tuning PLBART
We fine-tune PLBART for two broad categories of downstream applications.
Sequence Generation PLBART has an encoderdecoder architecture where the decoder is capable of generating target sequences autoregressively. Therefore, we can directly fine-tune PLBART on sequence generation tasks, such as code summarization, generation, and translation. Unlike denoising pre-training, the source sequence is given as input to the encoder during fine-tuning, and the decoder generates the target sequence. The source and target sequence can be a piece of code or text sequence. Table 3 shows a few examples of input and output to and for PLBART for different generation tasks. Note that PLBART prepends a language id to the decoded sequence; it enables fine-tuning PLBART in a multilingual setting (e.g., code generation in multiple languages).

5
Sequence Classification We fine-tune PLBART on sequence classification tasks following . The input sequence is fed into both the encoder and decoder. For a pair of inputs, we concatenate them but insert a special token ("</s>") between them. A special token is added at the end of the input sequence. This last token's representation from the final decoder layer is fed into a linear classifier for prediction.
Optimization We fine-tune PLBART for a maximum of 100K steps on all the downstream tasks with 2500 warm-up steps. We set the maximum learning rate, effective batch size, and dropout rate to 3e-5, 32 and 0.1, respectively. The final models are selected based on the validation BLEU (in generation task) or accuracy (in classification tasks). 5 We do not perform multilingual fine-tuning in this work.
Fine-tuning PLBART is carried out in one Nvidia GeForce RTX 2080 Ti GPU.

Experiment Setup
To understand PLBART's performance in a broader context, we evaluate PLBART on several tasks. Our evaluation focuses on assessing PLBART's ability to capture rich semantics in source code and associated natural language text.

Evaluation Tasks
We divide the evaluation tasks into four categories. The evaluation task datasets are summarized in Table 4. We use CodeXGLUE  provided public dataset and corresponding trainvalidation-test splits for all the tasks.
Code Summarization refers to the task of generating a natural language (English) summary from a piece of code. We fine-tune PLBART on summarizing source code written in six different programming languages, namely, Ruby, Javascript, Go, Python, Java, and PHP.
Code Generation is exactly the opposite of code summarization. It refers to the task of generating a code (in a target PL) from its NL description. We fine-tune PLBART on the Concode dataset , where the input is a text describing class member functions in Java and class environment, the output is the target function.
Code Translation requires a model to generate an equivalent code in the target PL from the input code written in the source PL. Note that the source and target PL can be the same. Hence, we consider two types of tasks in this category. The first task is a typical PL translation task, translating a code i.e., from Java code to C#, and vice versa. In this task, the semantic meaning of the translated code should exactly match the input  code. Thus, this task evaluates PLBART's understanding of program semantics and syntax across PL. The second task we consider is program repair. In this task, the input is a buggy code, and the output is a modified version of the same code which fixes the bug. This task helps us understand PLBART's ability to understand code semantics and apply semantic changes in the code.
Code Classification aims at predicting the target label given a single or a pair of source code. We evaluate PLBART on two classification tasks. The first task is clone detection, where given a pair of code, the goal is to determine whether they are clone of each other (similar to paraphrasing in NLP). The second task is detecting whether a piece of code is vulnerable. This task help us gauging PLBART's effectiveness in program understanding in an unseen PL since the code examples in this task are written in C/C++.

Evaluation Metrics
BLEU computes the n-gram overlap between a generated sequence and a collection of references. We use corpus level BLEU (Papineni et al., 2002) score for all the generation tasks, except code summarization where we use smoothed BLEU-4 score (Lin and Och, 2004) following Feng et al. (2020).
CodeBLEU is a metric for measuring the quality of the synthesized code (Ren et al., 2020). Unlike BLEU, CodeBLEU also considers grammatical and logical correctness based on the abstract syntax tree and the data-flow structure.
Exact Match (EM) evaluates if a generated sequence exactly matches the reference.

Baseline Methods
We compare PLBART with several state-of-the-art models and broadly divide them into two categories. First, the models that are trained on the evaluation tasks from scratch, and second, the models that are pre-trained on unlabeled corpora and then finetuned on the evaluation tasks.

Training from Scratch
Seq2Seq (Luong et al., 2015) is an LSTM based Seq2Seq model with attention mechanism. Vocabulary is constructed using byte-pair encoding.
Transformer (Vaswani et al., 2017) is the base architecture of PLBART and other pre-trained models. Transformer baseline has the same number of parameters as PLBART. Hence, a comparison with this baseline demonstrates the direct usefulness of pre-training PLBART.

Pre-trained Models
As described in section 2, PLBART consists of an encoder and autoregressive decoder. We compare PLBART on two categories of pre-trained models. First, the encoder-only models (e.g., RoBERTa, CodeBERT, and GraphCodeBERT) that are combined with a randomly initialized decoder for taskspecific fine-tuning. The second category of baselines include decoder-only models (CodeGPT) that can perform generation autoregressively.  CodeBERT  combines masked language modeling (MLM) (Devlin et al., 2019) with replaced token detection objective (Clark et al., 2020) to pretrain a Transformer encoder.
GraphCodeBERT ) is a concurrent work with this research which improved CodeBERT by modeling the data flow edges between code tokens. We report GraphCodeBERT's performance directly from the paper since their implementation is not publicly available yet.

Results & Analysis
We aim to address the following questions. 1. Does PLBART learn strong program and language representations from unlabeled data? 2. Does PLBART learn program characteristics, e.g., syntax, style, and logical data flow? 3. How does PLBART perform in an unseen language with limited annotations?  Table 6: Results on text-to-code generation task using the CONCODE dataset . the significant performance improvement indicates that PLBART learns better generic program semantics. In contrast, PLBART performs poorly in the PHP language. The potential reason is syntax mismatch between the pre-trained languages and PHP. Surprisingly, RoBERTa performs better than PLBART on the PHP language. We suspect that since RoBERTa is pre-trained on natural language only, it does not suffer from the syntax mismatch issue. Overall in comparison to the Transformer baseline, PLBART improves with an average of 2.76 BLEU-4, and we credit this improvement to the pre-training step. Table 6 shows the evaluation result on code generation from NL description. PLBART outperforms all the baselines in terms of BLEU and CodeBLEU. While CodeGPT-adapted  achieves the best Exact Match (EM) score, PLBART outperforms CodeGPT-adapted by a large margin in terms of CodeBLEU. This result implies that PLBART generates significantly more syntactically and logically correct code than all the baselines. Figure 2 shows an example of code generated by PLBART. The difference between the reference code and the generated code is in line 6 onward. In the reference code, loc0 is returned, however  Table 7: Results on source code translation using Java and C# language dataset introduced in . PBSMT refers to phrase-based statistical machine translation where the default settings of Moses decoder (Koehn et al., 2007) is used. The training data is tokenized using the RoBERTa  tokenizer.

Code Generation
Input text: returns the count to which the specified key is mapped in this frequency counter , or 0 if the map contains no mapping for this key .  Figure 2: An example of generated code by PLBART that is syntactically and semantically valid, but does not match the reference. same loc0 is returned in an else block in the generated code. If we look closely, in the reference code, line 6 will be executed only if the condition in line 3 (i.e., loc0 == null) is false. In the generated code, loc0 will be returned only if the condition in line 3 is false, making the generated code semantically equivalent to the reference code.
To study whether PLBART learns code syntax and logical flow during pre-training or fine-tuning, we perform an ablation study where we use subset of the training examples (10K, 20K, and 50K) to fintune PLBART in this task. As As shown in prior works (Yin and Neubig, 2017;, generating syntactically and logically correct code has been a big challenge in program generation. We conjecture that PLBART's large-scale denoising sequence-tosequence pre-training helps understand program syntax and logical flow; therefore enables PLBART to generate syntactically and logically valid code. Table 7 presents the evaluation results on code translation. PLBART outperforms all the baselines w.r.t. EM, BLEU, and CodeBLEU. PLBART improves over CodeBERT by 9.5% and 10.5% when translating from Java to C# and C# to Java, respectively. Although PLBART is not pretrained on C# language, there is a significant syntactic and semantic similarity between Java and C#. Thus PLBART understands C# language syntax and semantics. However, such similarities are non-trivial, making the Naive copy and PBSMT perform very poorly in both the translation tasks. Figure 3 shows an example where PLBART's generated C# code does not exactly match the reference; however, they are semantically equivalent. In the reference, the else block (line 4-9) is equivalent to the else if block (line 4-7) in the generated code. In addition, start is generated as function parameter and used in the function body, equivalent to start_1 in the reference code. This further corroborates the syntactic understanding of PLBART and its ability to reason about the data flow in source code. We present more qualitative examples in Appendix.  Figure 3: Example C# code generated by PLBART that does not exactly match the reference code.

Code Translation
In the program repair task, both the input and the output are in the same language. While the input is a buggy code, the output should be the target bugfree code. Thus in this task, the exact match is the critical metric. Nevertheless, as shown in table 8, PLBART can generate 17.13%, and 74.03% more correct bug fixes than CodeBERT in Java small and Java medium datasets, respectively. On the other hand, PLBART performs comparably to Graph-CodeBERT that uses structure-aware pre-training to learn program syntax and semantics.

Classification
In both clone detection and the vulnerability detection tasks, PLBART outperforms CodeBERT. We present the results in Table 9. In the vulnerability detection task, code semantics is the most critical feature . Since PLBART is not pretrained on C/C++ language, its improved performance compared to the Transformer baseline is the testament that PLBART can identify semantics beyond the language syntax's specifics. Moreover, PLBART's improved performances over CodeBERT and GraphCode-BERT confirms its effectiveness in program understanding in addition to its generation ability.
We acknowledge that neither PLBART nor Code-BERT is state-of-the-art in vulnerability detection, as graph-based models perform best in this task.   In this evaluation, our goal is to study how well PLBART understands program semantics in an unseen language for a different type of task (other than the generation, i.e., classification).

Related Work
Pre-training for Language Understanding and Generation Transformer (Vaswani et al., 2017), a sequence-to-sequence architecture that includes an encoder and decoder, has shown tremendous promise in natural language processing (NLP), computer vision, software engineering, and more. Devlin et al. (2019) first proposed to pre-train a large Transformer architecture, called BERT, to learn representations of natural language using large-scale unlabeled data in a self-supervised fashion. Later, BERT's task-independent pre-training approach is rigorously studied (Devlin et al., 2019;Solaiman et al., 2019;Feng et al., 2020;Li et al., 2020). While BERT-like models have shown effectiveness in learning contextualized representation, it is not very useful in generation tasks. GPT (Radford et al., 2018) style models improve upon BERT for generative tasks with autoregressive pre-training; however, unlike BERT, they are not bidirectional.  introduced BART, a denoising autoencoder that uses a bidirectional encoder and an auto-regressing decoder. Similar to BART, PLBART uses denoising pre-training to cope with generative tasks and learns multilingual representations of programming and natural language jointly.

Conclusion
This paper presents PLBART, a sizeable pre-trained sequence-to-sequence model that can perform program and language understanding and generation tasks. PLBART achieves state-of-the-art performance on various downstream software engineering tasks, including code summarization, code generation, and code translation. Furthermore, experiments on discriminative tasks establish PLBART's effectiveness on program understanding. We also show that PLBART learns crucial program characteristics due to pre-training, such as syntax, identifier naming conventions, data flow. In the future, we want to explore ways to fine-tune PLBART on all the downstream tasks jointly.

Broader Impact
Automation in software engineering is paramount in increasing programmers' productivity. A reduced workload of tedious works at the part of developers' daily routine would give them more time to solve significant problems for society's wellbeing. There are numerous program-and-language applications in the software development lifecycle, such as code documentation/summarization, code synthesis, translating code across languages, etc that can be automated to facilitate software engineering. The availability of large-scale data (thanks to open source repositories, forums, and millions of contributors worldwide) opens up the opportunity to solve many of those problems in a data-driven fashion. PLBART aims at program-and-language applications that demand a complete syntactic and semantic understanding of source code and associated textual data. For the tasks we have shown evaluation, PLBART will serve as a solid and replicable baseline to guide future research. We also believe our work could be an excellent starting point for future works aim at solving a variety of software engineering problems.