Embedding Dynamic Attributed Networks by Modeling the Evolution Processes

Network embedding has recently emerged as a promising technique to embed nodes of a network into low-dimensional vectors. While fairly successful, most existing works focus on the embedding techniques for static networks. But in practice, there are many networks that are evolving over time and hence are dynamic, e.g., the social networks. To address this issue, a high-order spatio-temporal embedding model is developed to track the evolutions of dynamic networks. Specifically, an activeness-aware neighborhood embedding method is first proposed to extract the high-order neighborhood information at each given timestamp. Then, an embedding prediction framework is further developed to capture the temporal correlations, in which the attention mechanism is employed instead of recurrent neural networks (RNNs) for its efficiency in computing and flexibility in modeling. Extensive experiments are conducted on four real-world datasets from three different areas. It is shown that the proposed method outperforms all the baselines by a substantial margin for the tasks of dynamic link prediction and node classification, which demonstrates the effectiveness of the proposed methods on tracking the evolutions of dynamic networks.


Introduction
Network embedding (NE) aims to represent each node by a low-dimensional vector, while seeking to preserve their neighborhood information as much as possible. It has been shown that working on the lowdimensional representations is much more efficient than on the original large-scale networks directly in various real-world applications, such as friend recommendation, product advertising, community detection (Cavallari et al., 2017), nodes classification etc. Because of its capability in facilitating downstream applications, many methods have been developed to embed network nodes into vectors efficiently and effectively, like DeepWalk in (Perozzi et al., 2014), LINE in (Tang et al., 2015), Node2Vec in (Grover and Leskovec, 2016) etc. Later, the attributes/texts available at nodes are further taken into account, e.g. the CANE in (Tu et al., 2017) and WANE in (Shen et al., 2018), to obtain more comprehensive embeddings. However, these methods mostly focus on static networks, but in practice, networks are often dynamic. In social networks, for instance, new friend connections are established all the time, and user profiles are also updated from time to time. For these dynamic networks, how to learn their embeddings, and more importantly, how to leverage the embeddings to predict their evolution trends is crucial for many applications.
Existing methods for dynamic network embedding can be roughly divided into two categories. The first category concerns about how to obtain new embeddings from the stale ones efficiently when changes of networks are observed. In (Du et al., 2018), by decomposing the learning objective into different parts, it is shown that new embeddings can be updated by only considering the newly added and most influential nodes. Differently, (Hamilton et al., 2017;Cheng et al., 2020) proposed to use a graph convolutional network (GCN) or Gaussian process to learn a mapping from the associated attributes to the embeddings, respectively, then the embeddings can be updated directly with the output 1 ⋅⋅⋅ 2 Figure 1: Illustration of the evolution process of a dynamic network, where red lines indicate the newly formed relationships. of the mapping. The aforementioned methods avoid to re-compute the embeddings from scratch at every timestamp by tracking the network changes continuously. However, since these methods focus only on the changes at the current timestamp rather than their dynamics, the induced embeddings can only represent the networks at the current timestamp, but are poor at predicting their future evolvement.
The second category methods focus more on the improvement of prediction performance. Singer et al. (2019) proposed to learn embeddings and an alignment matrix for the network at each timestamp by solving a sequential optimization problem, and then input the embeddings into recurrent neural networks (RNN) to predict the future links. However, sequential optimization is very expensive, hindering it from being applied to large networks. Alternatively, Zhou et al. (2018) proposed to learn embeddings with the objective to predict the future closure process of nodes that are separated by at most two hops. Later, Goyal et al. (2020) proposed to employ auto-encoders to predict nodes' direct neighbors with the neighbors from the previous timestamps via the long short-term memory (LSTM). Recently, Zuo et al. (2018) introduced the concept of neighborhood formation and used it to track the evolution of nodes with their direct neighbors from the previous timestamps. In all of these methods, only the direct (first-order) neighbors in the spatial dimension are leveraged. However, to capture the dynamics of networks, it is important to consider the high-order information of nodes in both the spatial and temporal dimensions simultaneously. As illustrated in Fig.1, the green node is isolated from the red node at t 1 and becomes its fourth-order neighbor at t 2 , and further evolves into its direct neighbor at t n . To capture this evolution process, the model should have the ability to be aware of nodes' high-order neighborhood information spatially and memorize the changes that occurred many timestamps before temporally.
In this paper, a dynamic attribute network embedding model (Dane) is developed to track the evolutions of dynamic networks. Specifically, an activeness-aware neighborhood embedding method is proposed to extract the high-order neighborhood information at each given timestamp. The activenessaware mechanism enables the model to emphasize more on nodes that are active in social activities. Then, an embedding prediction framework is developed to capture the temporal correlations of dynamic networks, in which the attention mechanism is employed instead of RNNs for its efficiency in computing and flexibilities in modeling. The methods are evaluated on the tasks of dynamic link prediction and node classification over four real-world datasets. It is shown that the proposed methods outperform all comparable embedding methods, including both static and dynamic ones, on the link prediction by a substantial margin. This demonstrates that the proposed methods are able to capture the correlations from the dynamic networks. Similar phenomena are observed on the task of dynamic node classification, which further confirms the effectiveness of the proposed methods on tracking the evolutions of dynamic networks.

Related Work
Representation learning for graphs has attracted considerable attention recently, since they can potentially benefit a wide range of applications. Specifically, (Perozzi et al., 2014) employs random walks to obtain sequences of nodes, and then uses the word2vec technique to represent the nodes into lowdimensional vectors. To preserve both the global and local structure information, (Tang et al., 2015) proposed to jointly optimize the first-and second-order proximity of nodes in network. Later, (Grover and Leskovec, 2016) introduced a biased random walk procedure under the BFS and DFS search strategies to further exploit the diversity of structure patterns of networks. However, in all of these methods, only the topology information of network is leveraged. But in many real-world networks, nodes are often associated with attributes or texts. To take the attributes into account, a mutual attention mechanism is designed to enhance the relationships between attributes of neighboring nodes in (Tu et al., 2017). To extract more semantic features from attributed data, (Shen et al., 2018;Xu et al., 2019b) modified the mutual attention mechanism in (Tu et al., 2017) and employed a fine-grained word alignment mechanism. However, all these works are overwhelmingly performed in the context of plain or static networks.
Various techniques have been proposed to learn deep representations in dynamic patterns. (Li et al., 2017) first provides an offline embedding method and then leverages matrix perturbation theory to maintain the freshness of the end embedding results in an online manner. In (Seo et al., 2018), the model was developed to combine convolutional neural networks on graphs to identify spatial structures and RNN to find dynamic patterns. Similarly, (Trivedi et al., 2017) proposed a deep recurrent architecture to model historical evolution of entity embeddings in a specific relationship space. To capture both structural properties and temporal evolutionary patterns, (Sankar et al., 2018) jointly employed self-attention layers along structural neighborhood and temporal dynamics. Differently, (Nguyen et al., 2018) employed a temporal version of traditional random walks to capture temporally evolving neighborhood information. Although significant efforts have been made to learn effective representations in evolutionary patterns, methods aiming to extract high-order spatio-temporal evolutionary information have been rarely explored. In our work, an activeness-aware neighborhood embedding method is developed to capture high-order neighbor evolutionary relationship, followed by a low-complexity attention-based mechanism, which efficiently apprehends temporal evolutionary patterns.

The Proposed Framework
A dynamic attributed network G consists of a sequence of snapshots G {G 1 , G 2 , . . . , G n }, where G t (V t , E t , A t ) is the attributed network at timestamp t; V t and E t are the set of vertices and edges in the network G t , and A t denotes the attribute matrix, with the v-th row representing the attribute associated with the node v. Additionally, we use V = V 1 ∪ ... ∪ V n to denote the nodes of the whole network. In this section, we first propose a method to effectively extract the high-order neighborhood information at a given timestamp, based on which a low-complexity attention-based model is then developed to capture the network's future evolvement.

Activeness-aware Neighborhood Embedding
To extract high-order neighborhood information in dynamic attributed networks, we can simply apply the GraphSAGE algorithm to the networks at each timestamp, which aggregates messages from neighboring nodes iteratively (Hamilton et al., 2017). However, all messages in GraphSAGE are treated equally. There is no problem if the embeddings are only used to represent the network at the current timestamp. But if they are used to predict future evolvement, it would be problematic. To see this, let us take the social network as an example. As illustrated in Fig 2, suppose that both node B and C are the direct neighbors of node A, but node B is much more active than node C in organizing various social activities. Obviously, node A is more likely to establish connections with the friends of node B than node C in future. If the goal is to predict the future evolving trends, it is useful to know the activeness of different nodes, and more emphases should be given to the messages from the nodes that are more active.
To this end, an activeness-aware neighborhood embedding method is proposed. Specifically, given the attributed network G t = {V t , E t , A t } at timestamp t, the embedding of node v is learned by updating the following equationsx where = 0, 1, . . . , L; x t,u ∈ R d represents the embedding of node u at the -th layer , with x 0 t,u initialized with the u-th row of the attributed matrix A t ; the values in p t,u represent the degree of activeness of node u, which will be discussed in detail later; N t (v) denotes the set of neighbors of node v at the timestamp t; the function Aggregate(·) collects messages p t,u x t,u from all neighbors u ∈ N t (v) to constitute a matrix; M ean(·) means taking the average of a matrix along its rows; is element-wise multiplication; [· ; ·] means the concatenation of two vectors; and W x is the parameters to be learned. From (1), we can see that the activeness vector p t,u play the role of gates. If the node u is very active, the values of the corresponding activeness vector p t,u will be large. Hence, a larger proportion of the embedding of node u will be flowed into its neighboring nodes, exerting a greater influence onto the other nodes. On the contrary, if the node u is not active in social activities, its importance will be lowered by diminishing the values in p t,u .
For the activeness vectors p t,v , it is computed from a randomly initialized time-invariant matrix P ∈ where σ(·) denotes the sigmoid function; and p 0 t,v is initialized by the v-th row of the matrix P , which, together with the model parameter W p , is learned from the training data. Although p t,v is computed from the time-invariant P , because its computation also depends on the time-evolving network topologies, the vector p t,v can still track the evolvement of networks.

Prediction of the Next-Timestamp Embedding
Given the embeddings x i,v for i = 1, 2, · · · , t, in this section, we focus on how to predict the network status at the next timestamp t + 1, e.g., the link connections and node categories at t + 1. In this paper, we predict the future network status by estimating the embeddings at the next-timestampx t+1,v using the previous ones, that is, finding the mapping The most direct way is to feed the previous embeddings of each layer into an RNN or LSTM and then combine the predictions of different layers linearly as the final prediction, i.e., where W y and b y are model parameters. It can be seen from (1) and (2) that as the number of layer increases, broader neighborhood information will be included in the embeddings, but at the same time, the local information around each node will be weakened. Thus, to retain both the global and local neighborhood information, the embeddings x t,v obtained from all intermediate layers = 1, 2, · · · , L are employed for the embedding prediction. RNNs or LSTMs are good at modeling the time dependencies of sequences, but their computations are also known to be time-consuming due to the difficulties of parallelizing. Actually, for many interesting dynamic networks, the changes are not dramatic for each timestamp. To better model the temporal correlation and speed up the computations, we further propose an attention-based model to predict the next-timestamp embedding as where x t,v denotes the summarized representation of the most recent K embeddings until timestamp t − 1; K is the number of used historical embeddings; α t−k is the attention coefficient and is computed as controls how much of the change x t,v − x t,v at the previous timestamp are used for the prediction of the next timestamp.

Training Objective
Suppose there exists a link e v,u such that e v,u ∈ E t+1 but e v,u / ∈ {E 1 ∪ E 2 ∪, · · · , ∪E t }. If our proposed method is able to capture the historical dynamics and make good prediction for timestamp t + 1, the prediction embedding for node u and v, i.e.,x t+1,v andx t+1,u , should be close to each other in the vector space. Thus, we define the objective function as as where p(x t,v |x t,u ) denotes the conditional probability of embeddingx t,u given the embeddingx t,v and is defined as To alleviate the computational burden of repeatedly evaluating the softmax function, as done in (Mikolov et al., 2013), the negative sample technique is employed by optimizing the alternative loss below where R is the number of negative samples and D(v) ∝ d

Experiments
In this section, we evaluate the performance of the proposed methods on two tasks: dynamic link prediction and node classification. For the link prediction, we predict the new links that appear at timestamp t + 1 for the first time based on the historical observations until t (Goyal et al., 2018). For the task of node classification, the categories of nodes at timestamp t + 1 are predicted, with only the nodes that change their categories at t + 1 considered (Zhou et al., 2018). In the experiments, the dimension of network embeddings is set to 100 for all considered methods. The negative samples are set to 1 and the mini-batch size is set to 50 to speed up the training process. Adam(Kingma and Ba, 2014) is employed to train the proposed model with a learning rate of 1 × 10 −4 .

Datasets, Baselines and Evaluation Metric
Datasets To evaluate our proposed methods, we collect four real-world dynamic attributed network datasets, ranging from user action network, brain activity network to academic citation network. The statistics of four datasets are summarized in Table 1. • MOOC is a user action dataset collected by (Kumar et al., 2019), in which users and course activities are represented as nodes, and actions by users on the course are represented as edges. The actions have attributes and timestamp, hence it can be recognized as a dynamic attributed graph. In our experiment, we split the dataset into 20 timestamps.
• Brain is a brain activity dataset collected by (Xu et al., 2019a). The tidy cubes of brain tissue and the connectivity are represented as nodes and edges respectively. PCA is applied to the functional magnetic resonance imaging data to generate note attributes. If two tidy cubes show similar degree of activation, they will be connected by an edge.
• DBLP is a citation network, consisting of bibliography data from computer science. In our experiment, only the authors with at least three publications between 1995 to 2010 are collected. Each author is viewed as a node, and the corresponding titles and abstracts are processed to be the attribute of nodes. Specifically, all titles and abstracts published by an author are concatenated in reverse chronological order. We then pass the concatenated words into the pre-trained BERT BASE (Devlin et al., 2019) and use the vectors of [CLS] in the last layer as the representations. Since the max length of input token for BERT BASE is 512, words that exceed this length limit are removed. The ground-truth category that an author belongs to is decided by the avenues where most of his/her papers are published 1 .
• ACM is similar to the DBLP dataset. Here, only the authors who published at least three papers over the years between 1991 to 2009 are taken into account. Similarly, BERT BASE (Devlin et al., 2019) is applied to generate attributes for each node.
Baselines For comparisons, several baseline methods are considered, including both static and dynamic methods.
Among the static baselines, the DeepWalk, LINE, and Node2Vec only use the network structure, while the CANE, WANE, and SAGE leverage both the network structure and attributes. When the static methods are used for link prediction of dynamic networks, we apply the static methods to the network observed until most recently, and the obtained embeddings are then used to predict the links at the next timestamp.
Evaluation Metrics For the task of dynamic link prediction, the widely used evaluation metrics, the area under the ROC curve (ROC-AUC) (Hanley and McNeil, 1982), PR curve (PR-AUC) (Davis and Goadrich, 2006) and F1 scores, are utilized to evaluate the performance of learned embeddings. For the dynamic node classification task, a logistic regression model is used to classify the embeddings into different categories. The classifier is trained with the provided labels of nodes. The weighted sum of F1 scores from different categories is used as the performance criteria of this task. All the experiments in this paper are repeated 10 times, and the average results are reported.   Table 3: Performance of dynamic link prediction in percentages on MOOC and Brain datasets.

Dynamic Link Prediction
For the performance evaluation of link prediction, as done in (Goyal et al., 2018), 20% of the new links at timestamp t + 1 are randomly selected to fine-tune the proposed model, and the rest 80% are held out for testing. To be fair, the selected 20% links are also included in training dataset for all the baseline methods. The ROC-AUC, PR-AUC, and F1 scores of different models on DBLP, ACM, MOOC and Brain datasets are shown in Table 2 and Table 3, respectively, with the best performance highlighted in bold. Note that the Dane-RNN, Dane-LSTM, and Dane-ATT represent the proposed network embedding models that employ RNN, LSTM, and attentions in the next-timestamp prediction, respectively. From Table 2, it can be seen that the proposed Dane models consistently outperform the baseline methods by a substantial margin on the two considered DBLP and ACM datasets on citation networks. The results suggest that our proposed methods successfully incorporate the network evolution into the embeddings and thus significantly improve the performance on the task of dynamic link prediction. We can also see that the simple attention-based model Dane-ATT achieves a comparable or even better performance than the more complicated Dane-RNN and Dane-LSTM models. This confirms our hypothesis that for the dynamic networks which do not evolve too fast, it is sufficient to employ the attention mechanism to model the temporal dynamics. By examing the static methods, it can be seen that the methods using attributes (e.g. CANE and SAGE) generally perform better than those that do not (e.g. DeepWalk), indicating that it is rewarding to incorporate the attributes into the embeddings. We can also observe that   the existing dynamic embedding methods Dyn-GEM and DynAERNN achieve better performance than static methods, although the attributes are not exploited in these dynamic methods, demonstrating the importance of modeling the temporal correlations for future evolvement prediction. By jointly considering the attributes and dynamics, we can see that the proposed Dane models perform best. To further test the model's generalization ability on datasets from other domains, experiments on user action network and brain activity network are conducted. It can be observed from Table 3 that the proposed Dane models perform best on MOOC and Brain datasets, which confirms the outstanding generalization capacity of our proposed model.

Impacts of Different Modules
To investigate the importance the activeness-aware mechanism module and temporal correlation modeling module, we evaluate the performance of models that exclude one or both of them, respectively. Specifically, in addition to the Dane-ATT, we consider another three variants: 1) Dane-ATT (w/o P), the model without using the activeness vector; 2) Dane-ATT (w/o D), the model without modeling the dynamics; 3) Dane-ATT (w/o DP), the model that does not use both. The ROC-AUCs evaluated on the four datasets are reported in Fig.3. It can be seen that without using the activeness vector and the dynamic modeling, an immediate performance drop is observed. This demonstrates the importance of considering both the node activeness spatially and the time correlations temporally. We can see that the drop caused by excluding the dynamic modeling is more significant, suggesting the importance of taking the historical information into account when embedding the dynamic networks. Moreover, if both the activeness vector and dynamic modeling are not used, the worst performance is observed. It is interesting to point out that the Dane-ATT (w/o DP) is actually the static GraphSAGE method, while the Dane-ATT (w/o D) is the static embedding method that has used the activeness of nodes. By comparing the performance of the two variants, the benefits of considering the activeness of nodes are confirmed again.
Impacts of the Parameters L and K The parameter L represents the number of layers used in the neighborhood embedding, while K means how many timestamps that we will look back for the next-time prediction. To investigate the impacts of the two parameters, performances of Dane-ATT with different number of layers L and lookback timestamps K on DBLP and ACM dataset are evaluated. The values of ROC-AUC as functions of L and K are illustrated in Fig.4. It can be seen that as L increases, the performance of proposed model increases rapidly at the beginning and then converges at around L = 3. The significant improvement at the beginning suggests that incorporating information from high-order neighborhood into the embedding is highly beneficial to the modeling of dynamic evolvement. But as L continues to increase, the improvement is lost. This may be because larger L also results in the decrease of local neighborhood information. Similar trend can be observed in experiments with different k. That is, the performance increases as k getting large initially, but then soon gets saturated. This reveals that in the dynamic citation network, it is sufficient to only look back several timestamps when predicting the future trends. This also explains why the attention-based model generally performs better than the RNNsor LSTM-based ones in the prediction of dynamic networks. That is because the dynamic networks often have a short memory and thereby simple temporal correlations. Thus, for applications of this kind, the more complicated RNN or LSTM may have a detrimental impact on the prediction performance. We believe that when more complex dynamic networks are considered, larger K should be used.

Dynamic Node Classification
In this section, the experiment of dynamic node classification on the DBLP dataset is conducted. The intuition is that if a model is able to capture the evolution of dynamic attributed networks, the evolution of nodes' categories should also be manifested in the predictive embeddings. To this end, nodes are split into a training and testing set randomly with a proportion of 50%-50%. Then a logistic regression classifier regularized by L 2 is trained on the node embeddings and their corresponding categories. By noticing that the categories are evolving over time in the DBLP dataset, only users whose categories are changed at the next timestamp are used for testing. We repeated the experiment 10 times and the average of weighted sum of F1 scores on different categories are reported. It can be seen from Fig.5 that Dane-ATT performs the best among all compared methods considered. To evaluate the quality of the obtained embeddings, we further visualize them on a 2-D plane with the t-SNE. Following (Zuo et al., 2018), a sample of 500 nodes for each category is randomly selected and the result is shown in Fig.6. As shown in Fig.6, nodes from different categories are separated pretty well, demonstrating that the obtained embeddings preserve the category information of the network well.

Conclusion
In this paper, a dynamic attribute network embedding framework is proposed to track the network evolution by modeling the high-order correlations in spatial and temporal dimensions jointly. To this end, an activeness-aware neighborhood embedding method is proposed to extract the high-order neighborhood information at each timestamp. Then, an embedding prediction framework is developed to capture the temporal correlations. Extensive experiments were conducted on four real-world datasets over the tasks of link prediction and node classification, confirming the ability of the model to track the evolutions of dynamic networks.