Towards Context-Aware Code Comment Generation

Code comments are vital for software maintenance and comprehension, but many software projects suffer from the lack of meaningful and up-to-date comments in practice. This paper presents a novel approach to automatically generate code comments at a function level by targeting object-oriented programming languages. Unlike prior work that only uses information locally available within the target function, our approach leverages broader contextual information by considering all other functions of the same class. To propagate and integrate information beyond the scope of the target function, we design a novel learning framework based on the bidirectional gated recurrent unit and a graph attention network with a pointer mechanism. We apply our approach to produce code comments for Java methods and compare it against four strong baseline methods. Experimental results show that our approach outperforms most methods by a large margin and achieves a comparable result with the state-of-the-art method.


Introduction
High-quality code comments are important for software maintenance. Yet, few software projects adequately document the code (Kajko-Mattsson, 2005). One way to overcome the lack of human-written comments, and guard against mismatch and obsolete comments is to automatically generate them.
Classical approaches for auto-comment generation use hand-crafted templates to produce code descriptions (Sridhara et al., 2010;Cortes-Coy et al., 2014;Dawood et al., 2017), but suffer from poor scalability and high maintenance cost due to the expensive overhead of writing comment templates. More recent work takes a learning-based approach by employing neural network (NN) models developed for natural language processing tasks like machine translation to automatically generate comments Luong et al., 2015). Compared to hand-written templates, a learningbased approach based on empirical data is more scalable and sustainable.
The key to generating high-quality comments is to utilize as much relevant information as possible from the source code to infer the high-level algorithmic intents. Prior work achieves this by converting a representation of the program, e.g., an Abstract Syntax Tree (AST), into a sequential sequence where a sequential model like LSTM can be applied to translate the token sequence into natural language descriptions (Hu et al., 2018a;Alon et al., 2019;LeClair et al., 2019). Some more recent work has employed the Graph Convolution Network (GCN) to directly operate on a graph representation, e.g., an adjacency matrix, of the AST (LeClair et al., 2020). While promising, all existing learning-based approaches only capture information from the target function (or method) to be commented but fail to capitalize on the abundant information and algorithmic intentions available in a broader context like the definition of the callee functions or other information (like the purposes and semantic meanings of data variables) that can only be gleaned through relevant methods in the same class. Because the programmer's strategic intentions are often encapsulated in a scope greater than a local function, ignoring such contextual information would miss massive opportunities.
As a motivation example, consider the Java code shown in Figure 1. Here, our task is to describe the purpose of method append defined at lines 10 and 12. This example is from a real-life opensource project, where a developer-written comment is given. The human-written comment states that this method adds a component to a panel object and then moves to the next data column. Because neither panel nor the next data column appears in the target function, existing approaches operate on the AST of this local method would fail to generate meaningful descriptions. Simply inlining the callee function, append(Control, int), does not offer the context of the panel object, which is also important for understanding the programmer intentions. If we could look at a broader context outside the target function, i.e., by leveraging the construction function, DefaultFormBuilder, and the callee function, we can then obtain much of the contextual information needed for generating a good quality comment text.
The above example demonstrates the importance of leveraging broader contextual information for comment generation. In object-oriented programming, object classes are the building block for expressing algorithmic intentions. Indeed, it is the class (but not a single local method) that forms the mental boundary of functionality. Since a class encapsulates much of the calling relationship and semantic information that cannot be obtained from a local function, the global structural information in a class should not be ignored when we attempt to understand the purpose of a function.
This paper thus presents a new code comment generation approach by leveraging the global structural information in object-oriented programming languages. Doing so allows us to utilize much of the contextual information within a class to enhance function-level comment generation. As a departure from all prior methods that only consider local information during encoding, our approach employs a two-way encoding mechanism by considering both information within and outside the target function. We achieve this by simultaneously modeling the token sequence of the target function and a program graph that connects all methods of the same class to the target function. We then learn the approximate synergy between the information available within the local function and the wider class scope. A key challenge here is how to determine the importance and relevance of information available at code scopes. To this end, we design a novel decoder for the comment generation process by learning what information at both the local and class level should be emphasized. Our encoder is composed of a local encoder for extracting information from the local function and a global encoder for extracting information at the class level. The decoder then decides which code segmentations within the class should be paid most attention to so that we can employ a pointer mechanism to copy words directly from the source code to generate comments.
We apply our approach to generate functionlevel comments for Java programs. We evaluate our approach on Java methods collected from over 1,600 open-source repositories hosted on GitHub, and compare it against four strong baseline methods of code comment generation. Experimental results show that our approach consistently delivers higher-quality comments, improving BLUE and Rouge by at least 7.7% and 5.1% (up to 87.1% and 68.3%), respectively. This paper makes the following contributions: • It presents the first approach to exploit relevant methods of the same class to enhance the understanding of the target function.
• It is first to show that other functions or methods outside the call graph of the target function can also have a positive contribution to the generated comment.
• It proposes a novel learning framework that can leverage both local and class-level contextual information for code comment generation.

Our Approach
As depicted in Figure 2, our code comment generation framework consists of three innovative components. The local encoder, based on a bi-directional Gated Recurrent Unit (bi-GRU) (Cho et al., 2014), extracts features from the source code of the target function. The global encoder, built upon a Graph Attention (GAT) network ( (Velickovic et al., 2018), propagates and exchanges information between all functions with the target class. The decoder aggregates the local and global information learned by the local and global encoders. Our decoder employs an attention mechanism to determine which part of the local and global contexts we should pay attention to and then uses a pointer mechanism to copy words from the source code to generate comments.

Local Encoder
Our local encoder extracts features from the source code token sequence of the target function. Given the source code of a function x = (x 1 , ..., x n ) of n words, we use a bi-GRU to encode it to a dense representation sequence → q j and ← − q j are the hidden state of x j in both directions. We concatenate the last hidden states of both directions to be used as the local representation of the function:

Global Encoder
Graph Construction. Unlike prior work that only focuses on extracting information from the target function, we aim to exploit the information available at the class level. To do so, we connect the target function to all other functions in the same class to form a class-level contextual graph, for which we refer to as a C-Graph. We then use a GAT network to exchange information between all the nodes in C-Graph to construct a global, classlevel representation for the target function. Given a C-Graph G = (V, E), each vertex v i ∈ V represents a function in the class, v t represents the target function, and each edge e t,j = (v t , v j ) ∈ E indicates the connection between the target function v t and other functions v j .
Vertex Initialization. To encode each vertex in the graph to a hidden vector, we apply the local encoder to each individual function (i.e., a vertex in our Cgraph) and concatenate the last hidden states in both directions as the initial vertex representation {g 0 i |v i ∈ V f } for GAT. Graph Attention Network. We feed the initial state of each vertex into a GAT, which applies different weights to different nodes when exchanging information with neighbors. In the l + 1 th layer of the GAT network, each vertex v i sends its own current hidden state to all neighbors represented as N (v i ). At the same layer, each vertex v i also receives a set of messages {g l j |v j ∈ N (v i )}, where g l j is the hidden state of vertex v j in the l th layer. Then, each vertex calculates a linear combination of the neighborhood hidden states from the message as its new hidden state: where W g is a learnable matrix and α ij is the graph attention distribution between vertex v i and its neighbors v j . α ij is computed as: where W a and a are learnable parameters, σ stands for an activation function, usually LeakyReLU and [·||·] means the concatenation of two matrices. We repeat this neighborhood aggregation process L times, and get the final state for each vertex of the C-Graph, represented as {g L i |v i ∈ V f }. We use the final state of the target function g L t as the global representation.

Decoder
As our encoders embed both local and global information, the decoder needs to integrate information extracted at different levels and takes the integrated information into consideration during the generation of comment tokens. Once again, we adopt a GRU as decoder and use the concatenation of local representation q n and global representation g L t as its initial state. h i is the hidden state of decoder in step i. When generating the i th words, we leverage a graph attention mechanism to extract the most relevant methods of the class and use a pointer mechanism to copy words from the method body to form the comment text.
Graph Attention. In the i th decoding step , we leverage results of the global encoder and compute a global context vector cg i reflecting which parts of the graph structure should be paid attention to.
where g L j and g L k are the representations of functions in the C-Graph and W ga is a trainable matrix.
Local Attention. We apply another attention mechanism to locate most relevant words in the body of target function when generating the i th word of comment, which is represented as a local context vector c i : where q j and q k are the contextual embeddings of words x i and x j in the input sequence of the target function and W la is a trainable parameter.
Pointer. As source codes may contain information which can be directly used in comment, we propose to add a pointer mechanism (See et al., 2017) which can copy useful words from source codes. Pointer mechanism merges a copy distribution with a normal output prediction distribution. In the i th decoding step, where β is the local attention distribution in Eq 8 and W v , b v , w h , w c , w y are all trainable parameters. P vocab is the normal output prediction distribution and p gen serves as a switch that chooses between generating words normally from vocabulary or directly copying from the source codes.
3 Experimental Setup

Dataset
We apply our approach to Java programs. Our dataset 1 downloaded from 1,634 open-source repositories hosted on Github. These projects are top-ranked Github projects with Java as the primary programming language. Our dataset consists of 150239 target methods within 27323 classes, where each target method has a developer-written comment as the golden comment. We split the dataset by projects, and use 90% of the project data for training, 5% for validation and 5% for testing. This gives us 133,058 methodcomment pairs for training, 6,952 pairs for validation, and 10,229 pairs for testing. The average number of tokens in target functions and target comments are 26.27 and 8.25 respectively. The average number of nodes of the C-Graph is 37.

Preprocessing
At the preprocessing stage, we serialize source code of the target method as a sequential token sequence and remove any none-alphabetical letters. We also split identifier and function names written in camel-Case or underscore style into independent words. Due to the memory limitation of applying GRU to long sequences, we truncate the source sequence of a function to 100 tokens.

Model Structures and Hyper-parameters
The dimension of embedding and hidden vectors are set to 128, and word embeddings are randomly initialized. The layer of our GRU encoder is set to 1. The GAT network has four layers since we find using more layers does not improve the results. Each of the GAT layers has a residual connection to avoid gradient vanishment except for the last layer. We set the dropout (Srivastava et al., 2014) rate to 0.1, the weight decay rate to 0.0001 and the batch size to 20. We use Adam optimizer (Kingma and Ba, 2015) with a learning rate of 0.001. To reduce randomness, we run each model setting five times and take the average as the report performance.

Baseline Models
We compare our model with three types of models, described as follows.
The first kind of models, CodeNN (Iyer et al., 2016) and Seq2Seq (Luong et al., 2015), treat source code as a sequential sequence of words. Co-deNN is a modified language model with attention mechanism, and it is among the earliest NN models for code comment generation. Seq2Seq is the most widely used model for many generation tasks. For these models, we use a bidirectional GRU as the encoder and a GRU with attention as the decoder.
The second kind of models incorporate information from the AST into the model and exploit the structural information of the AST to assist comment generation. We choose two state-ofthe-art AST-based generation models: DeepCom (Hu et al., 2018a) and Code2Seq (Alon et al., 2019). DeepCom works by flattening the whole AST and the applying a Seq2Seq model with attention mechanism to the flatten sequence. It has a specially designed method called structure-based traversal (SBT) that guarantees the relationship between functions and flatten sequences is injective. Code2Seq exploits the AST structural information lies in the paths between leaf nodes. It randomly samples k paths and encodes them together as the representation of AST and apply a simple decoder with attention to generate comments.
Besides, we propose two kinds of model that also leverage class-level information but in a simpler way. The first model, GraphAttn, utilizes the C-Graph with an attention mechanism. It applies a bi-GRU on all functions in C-Graph as the graph encoding results and during decoding stage, it attends to each function with a graph attention module. The second model, GraphFlatten, flattens the C-Graph into sequences. It concatenates all functions in the C-Graph together, applies a bi-GRU and takes concatenation of last hidden states as the global representation for target function.

Main Result
We summarize the main results in Table 1. As we can see, most models manage to outperform Seq2Seq in all alternative metrics and introducing AST structural information or global information can both bring improvements. We can draw that structural information is essential if we want to better comprehend source codes and generate more accurate comments.
AST related models show quite different performance results. DeepCom has a BLEU score of less than 15 and this result consists with LeClair et al. (2019) who also performs experiments on Java projects and indicates that splitting functions by project and length growth of input sequence for GRU both have a bad influence on DeepCom. In contrast to DeepCom, Code2Seq achieves the best results in all baseline models, which means it effectively extracts the local structural information provided by AST. Code2Seq generally generates comments that are shorter than ours. It achieves a higher score at n-gram accuracy but suffers from a larger penalty, so our model outperforms it by Rouge-L and BLEU. We can draw that AST provides local information that can assist comment generation and we assume combining AST structure with our methods together can bring further improvement, as it can benefit both AST structural information and global context information of related methods.
After introducing class-level information, both GraphAttn and GraphFlatten show improvements compared to Seq2Seq, indicating that related functions in the same class are beneficial to the target function in comment generation. Despite the inadequate utilization, global information still shows effectiveness in both models.
Our model manages to outperform most baseline models which indicates that global information has an obviously positive effect towards comment generation. When comparing our model with GraphAttn and GraphFlatten, we can find that building C-Graph is a better choice than utilizing attention mechanism or simply flattening the whole class to extract global information.

Ablation Study
We perform an ablation study to evaluate the effect of each component of our model. As shown in table 2, we can see that all components of the model contribute to the final results.
Among all the results, removing local encoder has the worst performance, dropping 8 in BLEU score, indicating the importance of local encoder. Though introducing graph structure into the model brings much improvement, the target function alone still manages to provide indispensable information towards the comment generation process.
Comparing our model with ours-pointer (our model without pointer module), we can see that adding pointer mechanism brings about 0.3 improvement. However, comparing Seq2Seq with PointerNet, which is the same architecture as oursglobal encoder (our model without global encoder module), we find a 0.6 improvement. Thus, we can see that introducing global information into model enhances the ability to copy words directly from the source codes.
After removing attention module, both performance drops, indicating the benefit of attention mechanism in our decoder. In each step of word generation, it is essential to attend to different functions and words. The attention module not only helps decoder determine the most related part to focus on, but also enhances the local and global information during the decoding process.

Q1: What kind of information is useful?
When the scale of a class grows larger, the functions in the same class start to become various. Given a function, not all of its neighbors are guaranteed to be closely related to it when the class is large. Therefore, it leads to the question of what kind of information is useful in C-Graph. To explore this question, we run a set of experiments and find that inside a class, function calls are able to provide the most valuable information while other functions can provide useful but messier information.
We divide all functions connected to the target function into two categories, function calls and other functions. Experiment Call uses a graph only containing edges between target function and function calls. Experiment Random uses a graph which has same structure as Call but replaces all nodes with functions randomly selected from other functions in class. Experiment Sample also has the same architecture as Call but replaces all nodes with functions randomly selected from other classes.
As shown in Table 1, Call outperforms most baseline models by a large margin and exceeds Random by 2 points in BLEU score, indicating that function calls are particularly effective towards comment generation process and they serve the key role in forming the global information. Although Random is outperformed by Call, it still achieves some improvement, indicating that other functions except function calls also have a positive effect towards comment generation. But their enhancement is relatively weaker compared to function  calls. Sample shows nearly no improvement at all, indicating functions from other classes are of no use towards target function as we expect. Q2: What is the impact of C-Graph on comment generation?
We hypothesize that the global information of the C-Graph can have a positive impact on comment generation. To quantify the impact of C-Graph, we define two metrics at the word level.
The first metric P o evaluates if C-Graph can help emphasize key information in the target function. We define set S o as the set of words in target comment that exist in both target function and neighborhood functions and P o is the percentage of S o that can be correctly generated by a model.
The second metric P c evaluates if C-Graph can provide information that does not exist in the target function. We define set S c as the set of words of target comment that exist in neighborhood functions instead of target function and P c is the percentage of S c that can be correctly generated by a model. To be noted, we neglect all the stop words when collecting sets S o and S c .
Seq2Seq has a result of P o = 51.6%, P c = 19.1% while our model has P o = 53.1%, P c = 25.1%, which proves that C-Graph is able to provide extra information as well as help emphasize important information to the comment generation process.
Q3: Does more contextual information bring more improvement?
We apply a statistic experiment on our results and find that with more information brought by C-Graph, there is more improvement. We evaluate the amount of information that C-Graph provides by word overlaps.
3 depicts the BLEU score over the percentage of functions that has word overlap with target comment on C-Graph on the left. As we can see, with more functions overlap with target comment, we achieve better BLEU scores, except the last point. Figure 3 depicts the BLEU score over word over- lap rate between target comment and comments of neighbor functions on C-Graph on the right. Along with the increase of this word overlap rate, the metric improvement start to increase, presenting a horn-shape. Figure Using word overlap as a simple measure of the information amount given by C-Graph, we found that the more information is provided, the more improvement can be observed. Table ?? gives four example comments generated by our approach and Seq2seq, as well as the golden comment written by the developer. As we can see, the comments given by our model are more closer to the golden comments and is more meaningful than the one produced by Seq2seq.

Case Study
In the first case, function y2 does not give much information and we can not understand its meaning by looking at it alone. Therefore, the comment generated by Seq2Seq is meaningless and unreadable. However, when taking other functions in the same class into consideration, we can find that function setBox uses y2 as well as other similar variables including x1, x2, y1, and they are obtained from a object named location. Based on these extra information, we can infer that y2 is a coordinate of a point. As we can see, comment generated by our model successfully capture the

Golden comment
inserts an item into the quadtree retrieves the name of the first element within a dimension Seq2Seq insert the methods description here returns the value of the specified string Our model inserts an item into the tree returns the value of the specified name from the given dimension correct meaning of y2.
In the second case, the function elementString mainly calls a function toStringHelper in its return statement. When we refer to function toStringHelper, we know that it is to convert a map object to a string representation. With this extra information, we can understand the goal of the target func-tion. As we can see, the name of the function elementString is not a good reflection of this function. Comment generated by Seq2Seq is misled to a completely irrelevant meaning while our model successfully captures the important position of function call and manage to generate correct description for target function.
In the third case, the target function insert(ForceItem item) mainly calls another function insert(ForceItem p, QuadTreeNode n...) to accomplish its goal. The target function itself is too short to give enough information to generate a meaningful and useful comment. We can only tell that the target function is to insert something but we do not know more details. However, when referring to its callee function, we can know that it is to insert an item into a quadtree With this extra information, we can understand the goal of the target function. As we can see, comment generated by our model successfully captures that the it is to insert an item into the tree structure.
In the fourth case, comment generated by Seq2Seq model only uses ambiguous words such as "value" and "special string" which do not offer any useful information. However, with related function involved, our model manages to capture the key information "dimension" and "name".

Related Work
As a critical task in software engineering, code comment generation has been exploited with various solutions. In recent years, deep learning based methods has dominated this line of research, most of which follow an encoder-decoder framework and can produce readable comments. Iyer et al. (2016) proposes an LSTM based language model with attention mechanism to generate short description for C# and SQL queries. Hu et al. (2018b) exploits the transferred knowledge from automatic API summaries to enhance the generation of code comments. However, this kind of methods fail to utilize the structural characteristics of programming codes. More recent efforts explicitly adopt the AST structure to explore the structure of code. Hu et al. (2018a) introduces AST sequences generated by Structure-Based Traversal (SBT) as a structured summary of the code into the generation model. Liang and Zhu (2018) applies an RNN unit over AST trees to extract both semantic and syntactic information and design a special decoder Code-GRU for the generation process. Alon et al. (2019) decomposes AST trees into paths between leaf nodes and sample a certain amount from them as the structural input for the model. LeClair et al. (2019) proposes to use both AST sequences and source code sequence as multiple input for model. LeClair et al. (2020) proposes to employ GNN over AST structures to better extract structure information. Our work, on the contrast, explores broader context, class-level neighboring functions, to introduce rich information for comment generation.

Conclusion
We have presented a novel approach for automatic code comment generation, targeting objectoriented programming languages. Unlike prior work that only leverages information of the target function, our approach leverages related methods of the same class to exploit the information available in a broader context to improve the quality of the generated comment. Our novel learning framework extracts local information from the target function and global contextual information at the class level. Experiment results show that our model can efficiently combine both local and classlevel information and generate more detailed and higher-quality comments over prior methods.