Variation Autoencoder Based Network Representation Learning for Classification

Network representation is the basis of many applications and of extensive interest in various ﬁelds, such as information retrieval, social network analysis, and recommendation systems. Most previous methods for network representation only consider the incomplete aspects of a problem, including link structure, node information, and partial integration. The present study introduces a deep network representation model that seamlessly integrates the text information and structure of a network. The model captures highly non-linear relationships be-tween nodes and complex features of a network by exploiting the variational au-toencoder (VAE), which is a deep unsupervised generation algorithm. The representation learned with a paragraph vector model is merged with that learned with the VAE to obtain the network representation, which preserves both structure and text information. Comprehensive experiments is conducted on benchmark datasets and ﬁnd that the introduced model performs better than state-of-the-art techniques.


Introduction
Information network representation is an important research issue because it is the basis of many applications, such as document classification in citation networks, functional label prediction in protein-protein interaction networks, and potential friend recommendations in social networks. Although there are not a few recent work proposed to study the issue (Belkin and Niyogi, 2003;Tenenbaum et al., 2001;Cao et al., 2015;Tian et al., 2014;Cao, 2016), it is still far from satisfactory because of the intrinsic difficulty. In essence, the rich and complex information (i.e., link structure and node contents) embedded in information networks poses a significant challenge in the effective representation of networks.
Network-distributed representation learning can be viewed as a problem using low-dimensional vectors to represent nodes in a network. Most network representation methods are based on a network structure. The traditional representation is based on matrix decomposition and uses eigenvectors as representation (Belkin and Niyogi, 2003;Roweis and Saul, 2000;Tenenbaum et al., 2001). Furthermore, they extend to high-order information (Cao et al., 2015). However, these methods are not applicable to large-scale networks, and although many approximate approaches have been developed to solve this problem, they are not effective enough. Some methods are based on optimization objective functions (Tang et al., 2015;Pan et al., 2016;Yang et al., 2015). Although they are suitable for large-scale network data, they adopt shallow models that are limited in terms of performance and are difficult to use to obtain highly non-linear relationships that are vital to the preservation of network structure. Inspired by deep learning techniques in natural language processing, (Perozzi et al., 2014;Grover and Leskovec, 2016) adopted several stunted random walks in networks to generate node sequences serving as sentence corpus and then applied the skip-gram model to these sequences to learn node representation. However, they cannot easily handle additional information during random walks in a network.
To capture highly non-linear structures for large-scale networks, (Tian et al., 2014;Cao, 2016) introduced an autoencoder to model training instead of using a sampling based method to generate linear sequences. Motivated by this model, we develop the variational autoencoder (VAE) (Kingma and Welling, 2014), which is a deep generation model, instead of a basic autoencoder. Most previous studies utilized only one type of information in networks. The work in (Le and Mikolov, 2014) focused on node content, and others (Grover and Leskovec, 2016;Perozzi et al., 2014) explored link structure. Although a few previous models (Pan et al., 2016;Yang et al., 2015) combined both text information and network structure, they did not preserve the complete network structure and only partially utilized node content. A straightforward method is to learn representations from text features and network structure independently, and then concatenate the two separate representations.
To address the above issues, we introduce a deep generative model to learn network representation by modeling both node content information and network structure comprehensively. First, the representation based on node content through the paragraph vector model is obtained. Then, we feed the network adjacency matrix and representation obtained into a deep generative model, the building block of which is the VAE. After stacking several layers of the VAE, the result of the first layer is chosen before decoding as the final representation. Intuitively, we can obtain the representation containing both content information and structure in a d-dimensional feature space. The experimental evaluation demonstrates the superior performance of the model on the benchmark datasets.

Preliminary
Notation: Let G = (V, E, C) denote a given network, where V = {v i } i=1...N is the node set and E = {e ij } is the edge set that indicates the relation of nodes. If a direct link exists between v i and v j then e ij = 1; otherwise, e ij = 0 when network is unweighted. C = {c i } is the set of content information. let A denote the adjacency matrix for a network, and let x = {e i,k , ..., e n,k } be an adjacency vector. Our goal is to seek a lowdimensional vector y j for each node v i of a given network. Autoencoder: We first provide a brief description of a basic autoencoder and the VAE. The basic autoencoder first compresses the input into a small form and then transforms it back into an approximation of the input. The encoding part aims to find the compression representation z of a given data x, and the decoding part is a reflection of the encoder used to reconstruct the original input x. The VAE (Kingma and Welling, 2014) imposes a prior distribution on the hidden layer vector of the autoencoder and re-parameterizes the network according to the parameters of the prior distribution. Through the parameterization process, the means and variance values of the input data can be learned. We extended VAE to generate two means and variances of input data, which can be considered correspond to the content and structure respectively.

Model Description
The architecture of the proposed model is shown in Fig. 1. The whole architecture consists of two main modules, namely, the content2vec module and the union training module. For an information network, such as a paper citation network, we can obtain the node link and content information (e.g., paper abstract). We learn an effective feature representation vector that preserves both structure information and node content information and can thus be applied to many tasks (e.g., paper classification).

Content2vec Module
We employ the state-of-the-art approach called doc2vec (Le and Mikolov, 2014), which utilizes text to learn vector representations of documents, as our content2vec module. Specifically, if one node contains other information (e.g., author name), we treat it as a word and merge it into the comprehensive text information (e.g., the abstract of the paper in the citation network) as the content of the node. A representation u i that includes the node content information is obtained from this module.

Union-training Module
The union training module is the core part of our model, in which content information and structure information are integrated. The details are shown in Fig 1. The VAE is adopted as the main block. Given a network, the adjacency matrix A can be obtained. A can describe the relationship among the nodes and reflect the overall structure of the network. We extract each adjacency vector a i and concatenate it with the corresponding u i as the input x i of our model. Therefore, the content and Figure 1: Architecture of our model. w i can be seen as a word of the content information, u i is a node in the network, u i is a representation vector learned by the Content2Vec Module, x i is a vector of the adjacency matrix. The input of the union-training module is combination of x i and u i , the encoder and decoder are stack full-connected layer, σ i1 , σ i2 , µ i1 , µ i2 can be seen the mean and variance of the distribution of the content and structure data, respectively. ε i1 and ε i2 are the sample data from two Gaussian distributions. structure information is able to be learned simultaneously.
During the encoding phase, we adapt several fully connected layers composed of multiple nonlinear mapping functions to map the input data to a highly nonlinear latent space. Therefore, given the input x i , the output h k for the k th layer is shown as follow: where π is the nonlinear activation function of each layer. The value of K varies with the data. In the last layer of encoder, we obtain four output: µ i1 , σ i1 µ i2 and σ i2 . They can be treated as the means and variances of the distribution of content information and structure information respectively. Furthermore, we sample two values ε i1 and ε i2 from two previous distributions (e.g., Gaussian distribution). Then we can obtain the reparameterized z i 1 and z 2 1. Through concatenate z i 1 and z 2 1, content and structure information can be integrated together, y i is the representation of the network. Nonlinear operations are not performed in this phase. Thus, the gradient descent method can be safely applied in optimization. The operations can be expressed as follows: where f is a linear function that can reparameterize y i , M erge concatenate the two vectors together directly. The decoding phase is a reflection of the encoder; its outputx i should be close to the input x i . The loss function of this module that should be minimized is as follows: where KL is the KL divergence which is always used as a measure of the difference between two distributions, H is a cross-entropy function that is used to measure the difference between x i andx i . Finally, We choose the output of the layer y i as the final representation of each node.

Experimental setup
Paper citation networks is a classical social information network. To evaluate the quality of the proposed model, we conduct three important tasks on two benchmark citation network datasets: (1)   (2)  We compare our approach with the following methods: • One-Hot uses adjacency matrix, which carries the structure information as the highdimension representation, and directly feed into the classifier.
• DeepWalk (Perozzi et al., 2014) is exploited by statistical models, which employs truncated random walks to learns nodes embedding by treating walk as the equivalent of sentences.
• Node2vec (Grover and Leskovec, 2016) learns the network representation by designing a biased random walk procedure which efficiently explores diverse neighborhoods.
• DW+D2V is simply to concatenate the representation result learned by DeepWalk and Doc2vec.
• TADW (Yang et al., 2015) is text-based DeepWalk, which incorporates text information into network structure by matrix factorization.
• TriDNR (Pan et al., 2016) uses node text, label, and structure to jointly learn node representation.

Performance on Node Classification
We conduct the paper classification task on two benchmark citation networks to evaluate the performance of our method. To reduce the impact of sophisticated classifiers on the performance, we employ a linear SVM, which is a common technique used by the exiting work (Pan et al., 2016). The results are shown in Table 1 and Table 2, respectively. The reported parameters for our model are set: dimension d=100 on CiteseerM10 and d=300 on DBLP. The dimension for other algorithms is the same as ours, and the other parameters are set as their papers report, i.e., window size b=10 in DeepWalk and Node2vec, in-out parameter q=2 in Node2vec, text weight ∂=0.8 in TADW and TriDNR. We use Macro-F1 which is the same as that adopted by other algorithms to measure the classification performance. The experiments are independently conducted 10 times for each setting, and the average values are reported. The proportion of training data with labels is range from 10% to 70%. Our model is evaluated by comparing it with seven approaches. One-Hot uses the original structure data, and its performance is poor because it is discrete and the context relation of nodes can not be captured. DeepWalk and Node2vec are structure-based methods that exhibit inferior performance mainly because they only use the shallow structure information and the network is rather sparse, while the information of the complex nonlinear structure cannot be employed. The performance of Doc2vec is not as good as ours which demonstrates the effectiveness of our proposed model. TADW and TriDNR are inferior to our approach, although these two methods also consider the text and structure. Nevertheless, they cannot capture the complex non-linear structure. The reason for the superior of our method is that our model can effectively capture the interrelationship between node content and link structure, and the intro-relationship among nodes and links, which are essential to learn the representation of networks. Furthermore, our model can capture the information of highly non-linear structure instead of the shallow structure (e.g., Deep-Walk) by exploiting VAE. Moreover, our approach does not require heavy text information which is utilized by the other state-of-the-art strategies (e.g.,TriDNR). Our model exhibits consistent superior performance, and is up to 16% better than the state-of-the-art methods (i.e., the Macro-F1 score of our model is 94% when the proportion of training data with labels is 70% conducted on the Citeseer-M10 Network dataset).

Parameter Setting
A significant hyperparameter in our model is the dimension d. The performance of different methods with varying dimensions has been evaluated. The result is illustrated in Fig. 2. We obtain very good performance on the CiteSeer-M10 dataset, i.e., the Macro-F1 score is 94% and the performance tends to be stable as b becomes larger. It validates the effectiveness of our algorithm and the reason is due to the ability of our model that can capture the complex network structure and the text information. From Fig. 2, we can see that the performance gets better when d increases from 100 to 600. We think the main reason is because more information can be preserved in higher dimensional space of the datasets.

Conclusions
In this paper, we have introduced an effective network representation model, which comprehensively integrates the text information and the network structure. We introduced Paragraph Model as a preliminary module. And we have exploited Variational Autoencoder as the main block of our model, that could capture highly non-linear structure of the network. The comprehensive experimental evaluation on two benchmark datasets has demonstrated the effectiveness of the model.