A Transformer-based Approach for Source Code Summarization

Generating a readable summary that describes the functionality of a program is known as source code summarization. In this task, learning code representation by modeling the pairwise relationship between code tokens to capture their long-range dependencies is crucial. To learn code representation for summarization, we explore the Transformer model that uses a self-attention mechanism and has shown to be effective in capturing long-range dependencies. In this work, we show that despite the approach is simple, it outperforms the state-of-the-art techniques by a significant margin. We perform extensive analysis and ablation studies that reveal several important findings, e.g., the absolute encoding of source code tokens’ position hinders, while relative encoding significantly improves the summarization performance. We have made our code publicly available to facilitate future research.


Introduction
Program comprehension is an indispensable ingredient of software development and maintenance (Xia et al., 2018). A natural language summary of source code facilitates program comprehension by reducing developers' efforts significantly (Sridhara et al., 2010). Source code summarization refers to the task of creating readable summaries that describe the functionality of a program.
With the advancement of deep learning and the availability of large-scale data through a vast number of open-source repositories, automatic source code summarizing has drawn attention from researchers. Most of the neural approaches generate source code summaries in a sequence-to-sequence fashion. One of the initial works Iyer et al. (2016) trained an embedding matrix to represent the individual code tokens and combine them with a Re-1 https://github.com/wasiahmad/NeuralCodeSum current Neural Network (RNN) via an attention mechanism to generate a natural language summary. Subsequent works (Liang and Zhu, 2018;Hu et al., 2018a,b) adopted the traditional RNNbased sequence-to-sequence network (Sutskever et al., 2014) with attention mechanism (Luong et al., 2015) on different abstractions of code.
The RNN-based sequence models have two limitations in learning source code representations. First, they do not model the non-sequential structure of source code as they process the code tokens sequentially. Second, source code can be very long, and thus RNN-based models may fail to capture the long-range dependencies between code tokens. In contrast to the RNN-based models, Transformer (Vaswani et al., 2017), which leverages self-attention mechanism, can capture long-range dependencies. Transformers have been shown to perform well on many natural language generation tasks such as machine translation (Wang et al., 2019), text summarization (You et al., 2019), story generation (Fan et al., 2018), etc.
To learn the order of tokens in a sequence or to model the relationship between tokens, Transformer requires to be injected with positional encodings (Vaswani et al., 2017;Shaw et al., 2018;Shiv and Quirk, 2019). In this work, we show that, by modeling the pairwise relationship between source code tokens using relative position representation (Shaw et al., 2018), we can achieve significant improvements over learning sequence information of code tokens using absolute position representation (Vaswani et al., 2017).
We want to emphasize that our proposed approach is simple but effective as it outperforms the fancy and sophisticated state-of-the-art source code summarization techniques by a significant margin. We perform experiments on two wellstudied datasets collected from GitHub, and the results endorse the effectiveness of our approach over the state-of-the-art solutions. In addition, we provide a detailed ablation study to quantify the effect of several design choices in the Transformer to deliver a strong baseline for future research.

Proposed Approach
We propose to use Transformer (Vaswani et al., 2017) to generate a natural language summary given a piece of source code. Both the code and summary is a sequence of tokens that are represented by a sequence of vectors, x = (x 1 , . . . , x n ) where x i ∈ R d model . In this section, we briefly describe the Transformer architecture ( § 2.1) and how to model the order of source code tokens or their pairwise relationship ( § 2.2) in Transformer.

Architecture
The Transformer consists of stacked multi-head attention and parameterized linear transformation layers for both the encoder and decoder. At each layer, the multi-head attention employs h attention heads and performs the self-attention mechanism.
Self-Attention. We describe the self-attention mechanism based on Shaw et al. (2018). In each attention head, the sequence of input vec- where α ij = exp e ij n k=1 exp e ik and W Q , W K ∈ R d model ×d k , W V ∈ R d model ×dv are the parameters that are unique per layer and attention head.
Copy Attention. We incorporate the copying mechanism (See et al., 2017) in the Transformer to allow both generating words from vocabulary and copying from the input source code. We use an additional attention layer to learn the copy distribution on top of the decoder stack (Nishida et al., 2019). The copy attention enables the Transformer to copy rare tokens (e.g., function names, variable names) from source code and thus improves the summarization performance significantly ( § 3.2).

Position Representations
Now, we discuss how to learn the order of source code tokens or model their pairwise relationship. Encoding absolute position. To allow the Transformer to utilize the order information of source code tokens, we train an embedding matrix W Pe that learns to encode tokens' absolute positions into vectors of dimension d model . However, we show that capturing the order of code tokens is not helpful to learn source code representations and leads to poor summarization performance ( § 3.2). It is important to note that we train another embedding matrix W P d that learns to encode the absolute positions of summary tokens. 2 Encoding pairwise relationship. The semantic representation of a code does not rely on the absolute positions of its tokens. Instead, their mutual interactions influence the meaning of the source code. For instance, semantic meaning of the expressions a+b and b+a are the same.
To encode the pairwise relationships between input elements, Shaw et al. (2018) extended the self-attention mechanism as follows.
where, a V ij and a K ij are relative positional representations for the two position i and j. Shaw et al. (2018) suggested clipping the maximum relative position to a maximum absolute value of k as they hypothesize that precise relative position information is not useful beyond a certain distance.
Hence, we learn 2k + 1 relative position representations:  Table 2: Comparison of our proposed approach with the baseline methods. The results of the baseline methods are directly reported from . The "Base Model" refers to the vanilla Transformer (uses absolute position representations) and the "Full Model" uses relative position representations and includes copy attention.
In this work, we study an alternative of the relative position representations that ignores the directional information (Ahmad et al., 2019). In other words, the information whether the j'th token is on the left or right of the i'th token is ignored.

Setup
Datasets and Pre-processing. We conduct our experiments on a Java dataset (Hu et al., 2018b) and a Python dataset (Wan et al., 2018). The statistics of the two datasets are shown in Table 1. In addition to the pre-processing steps followed by , we split source code tokens of the form CamelCase and snake case to respective sub-tokens 3 . We show that such a split of code tokens improves the summarization performance. Metrics. We evaluate the source code summarization performance using three metrics, BLEU (Papineni et al., 2002), METEOR (Banerjee and Lavie, 2005), and ROUGE-L (Lin, 2004). Baselines. We compare our Transformer-based source code summarization approach with five baseline methods reported in  and their proposed Dual model. We refer the readers to  for the details about the hyperparameter of all the baseline methods. Hyper-parameters. We follow  to set the maximum lengths and vocabulary sizes for code and summaries in both the datasets. We train the Transformer models using Adam optimizer (Kingma and Ba, 2015) with an initial learning rate of 10 −4 . We set the mini-batch size and dropout rate to 32 and 0.2, respectively. We train the Transformer models for a maximum of 200 epochs and perform early stop if the validation performance does not improve for 20 consecutive iterations. We use a beam search during inference and set the beam size to 4. Detailed hyperparameter settings can be found in Appendix A.

Results and Analysis
Overall results. The overall results of our proposed model and baselines are presented in Table 2. The result shows that the Base model outperforms the baselines (except for ROUGE-L in java), while the Full model improves the performance further. 4 We ran the Base model on the original datasets (without splitting the CamelCase and snake case code tokens) and observed that the performance drops by 0.60, 0.72 BLEU and 1.66, 2.09 ROUGE-L points for the Java and Python datasets respectively. We provide a few qualitative examples in Appendix C showing the usefulness of the Full model over the Base model.
Unlike the baseline approaches, our proposed model employs the copy attention mechanism. As shown in Table 2, the copy attention improves the performance 0.44 and 0.88 BLEU points for the Java and Python datasets respectively.

Impact of position representation.
We perform an ablation study to investigate the benefits   of encoding the absolute position of code tokens or modeling their pairwise relationship for the source code summarization task, and the results are presented in Table 3 and 4. Table 3 demonstrates that learning the absolute position of code tokens are not effective as we can see it slightly hurts the performance compared to when it is excluded. This empirical finding corroborates the design choice of Iyer et al. (2016), where they did not use the sequence information of the source code tokens.
On the other hand, we observe that learning the pairwise relationship between source code tokens via relative position representations helps as Table  4 demonstrates higher performance. We vary the clipping distance, k, and consider ignoring the directional information while modeling the pairwise relationship. The empirical results suggest that the directional information is indeed important while 16, 32, and 2 i relative distances result in similar performance (in both experimental datasets).
Varying model size and number of layers. We perform ablation study by varying d model and l and the results are presented in Table 5. 5 In our experiments, we observe that a deeper model (more layers) performs better than a wider model (larger d model ). Intuitively, the source code summariza-  tion task depends on more semantic information than syntactic, and thus deeper model helps.
Use of Abstract Syntax Tree (AST). We perform additional experiments to employ the abstract syntax tree (AST) structure of source code in the Transformer. We follow Hu et al. (2018a) and use the Structure-based Traversal (SBT) technique to transform the AST structure into a linear sequence. We keep our proposed Transformer architecture intact, except in the copy attention mechanism, we use a mask to block copying the nonterminal tokens from the input sequence. It is important to note that, with and without AST, the average length of the input code sequences is 172 and 120, respectively. Since the complexity of the Transformer is O(n 2 × d) where n is the input sequence length, hence, the use of AST comes with an additional cost. Our experimental findings suggest that the incorporation of AST information in the Transformer does not result in an improvement in source code summarization. We hypothesize that the exploitation of the code structure information in summarization has limited advantage, and it diminishes as the Transformer learns it implicitly with relative position representation.
Qualitative analysis. We provide a couple of examples in Table 6 to demonstrate the usefulness of our proposed approach qualitatively (more examples are provided in Table 9 and 10 in the Appendix). The qualitative analysis reveals that, in comparison to the Vanilla Transformer model, the copy enabled model generates shorter summaries  with more accurate keywords. Besides, we observe that in a copy enabled model, frequent tokens in the code snippet get a higher copy probability when relative position representations are used, in comparison to absolute position representations. We suspect this is due to the flexibility of learning the relation between code tokens without relying on their absolute position.

Related Work
Most of the neural source code summarization approaches frame the problem as a sequence generation task and use recurrent encoder-decoder networks with attention mechanisms as the fundamental building blocks (Iyer et al., 2016;Liang and Zhu, 2018;Hu et al., 2018a,b). Different from these works, Allamanis et al. (2016) proposed a convolutional attention model to summarize the source codes into short, name-like summaries. Recent works in code summarization utilize structural information of a program in the form of Abstract Syntax Tree (AST) that can be encoded using tree structure encoders such as Tree-LSTM Among other noteworthy works, API usage information (Hu et al., 2018b), reinforcement learning (Wan et al., 2018), dual learning , retrieval-based techniques (Zhang et al., 2020) are leveraged to further enhance the code summarization models. We can enhance a Transformer with previously proposed techniques; however, in this work, we limit ourselves to study different design choices for a Transformer without breaking its' core architectural design philosophy.

Conclusion
This paper empirically investigates the advantage of using the Transformer model for the source code summarization task. We demonstrate that the Transformer with relative position representations and copy attention outperforms state-of-the-art approaches by a large margin. In our future work, we want to study the effective incorporation of code structure into the Transformer and apply the techniques in other software engineering sequence generation tasks (e.g., commit message generation for source code changes). A Hyper-Parameters   While conducting our study using the Transformer on the Python dataset, we observed a significant gain over the state-of-the-art methods as reported in . However, our initial experiments on this dataset using recurrent sequence-to-sequence models also demonstrated higher performance compared to the results report in . We suspect that such lower performance is due to not tuning the hyperparameters correctly. So for the sake of fairness and to investigate the true advantages of Transformer, we present a comparison on recurrent Seq2seq model and Transformer in Table 8 using our implementation. 6