Discriminative Deep Random Walk for Network Classification

Deep Random Walk (DeepWalk) can learn a latent space representation for describing the topological structure of a network. However, for relational network classiﬁ-cation, DeepWalk can be suboptimal as it lacks a mechanism to optimize the ob-jective of the target task. In this paper, we present Discriminative Deep Random Walk (DDRW), a novel method for relational network classiﬁcation. By solving a joint optimization problem, DDRW can learn the latent space representations that well capture the topological structure and meanwhile are discriminative for the network classiﬁcation task. Our experimental results on several real social networks demonstrate that DDRW significantly outperforms DeepWalk on multi-label network classiﬁcation tasks, while retaining the topological structure in the latent space. DDRW is stable and consistently outperforms the baseline meth-ods by various percentages of labeled data. DDRW is also an online method that is scalable and can be naturally parallelized.


Introduction
Categorization is an important task in natural language processing, especially with the growing scale of documents in the Internet. As the documents are often not isolated, a large amount of the linguistic materials present a network structure such as citation, hyperlink and social networks. The large size of networks calls for scalable machine learning methods to analyze such data. Recent efforts have been made in developing statistical models for various network analysis tasks, such as network classification (Neville and Jensen, 2000), content recommendation (Fouss et al., 2007), link prediction (Adamic andAdar, 2003), and anomaly detection (Savage et al., 2014). One common challenge of statistical network models is to deal with the sparsity of networks, which may prevent a model from generalizing well.
One effective strategy to deal with network sparsity is to learn a latent space representation for the entities in a network (Hoff et al., 2002;Tang and Liu, 2011;Tang et al., 2015). Among various approaches, DeepWalk (Perozzi et al., 2014) is a recent method that embeds all the entities into a continuous vector space using deep learning methods. DeepWalk captures entity features like neighborhood similarity and represents them by Euclidean distances (See Figure 1(b)). Furthermore, since entities that have closer relationships are more likely to share the same hobbies or belong to the same groups, such an embedding by DeepWalk can be useful for network classification, where the topological information is explored to encourage a globally consistent labeling.
Although DeepWalk is effective on learning embeddings of the topological structure, when dealing with a network classification task, it lacks a mechanism to optimize the objective of the target task and thus often leads to suboptimal embeddings. In particular, for our focus of relational network classification, we would like the embeddings to be both representing the topological structure of the network actors and discriminative in predicting the class labels of actors.
To address the above issues, we present Discriminative Deep Random Walk (DDRW) for relational network classification. DDRW extends DeepWalk by jointly optimizing the classification objective and the objective of embedding entities in a latent space that maintains the topological structure. Under this joint learning framework, DDRM manages to learn the latent representations  (c) DDRW Embedding Figure 1: Different experimental results of embedding a network into a two dimensional real space. We use Karate Graph (Macskassy and Provost, 1977) for this example. Four different colors stand for the classes of the vertices. In (b), vertices which have stronger relations in the network are more likely to be closer in the embedding latent space. While in (c), besides the above-mentioned property, DDRW makes vertices in different classes more separated.
that are strongly associated with the class labels (See Figure 1(c)), making it easy to find a separating boundary between the classes, and the actors that are connected in the original network are still close to each other in the latent social space. This idea of combining task-specific and representation objectives has been widely explored in other regions such as MedLDA  and Supervised Dictionary Learning (Mairal et al., 2009).
Technically, to capture the topological structure, we follow the similar idea of Deep-Walk by running truncated random walks on the original network to extract sequences of actors, and then building a language model (i.e., Word2Vec (Mikolov et al., 2013b)) to project the actors into a latent space. To incorporate the supervising signal in network classification, we build a classifier based on the latent space representations. By sharing the same latent social space, the two objectives are strongly coupled and the latent social space is guided by both the network topology and class labels. DDRW optimizes the joint objective by using stochastic gradient descent, which is scalable and embarrassingly parallizable.
We evaluate the performance on several realworld social networks, including BlogCatalog, Flickr and YouTube. Our results demonstrate that DDRW significantly boosts the classification accuracy of DeepWalk in multi-label network classification tasks, while still retaining the topological structure in the learnt latent social space. We also show that DDRW is stable and consistently outperforms the baseline methods by various percentages of labeled data. Although the networks we use only bring topological informa-tion for clarity, DDRW is flexible to consider additional attributes (if any) of vertices. For example, DDRW can be naturally extended to classify documents/webpages, which are often represented as a network (e.g., citation/hyperlink network), by conjoining with a word2vec component to embed the documents/webpages into the same latent space, similar as previous work on extending DeepWalk to incorporate attributes (Yang et al., 2015).

Problem Definition
We consider the network classification problem, which classifies entities from a given network into one or more categories from a set Y. Let G = (V, E, Y ) denote a network, where V is the set of vertices, representing the entities of the network; E ⊆ (V × V ) is the set of edges, representing the relations between the entities; and Y ⊆ R |V |×|Y| denotes the labels of entities. We also consider Y U as a set of unknown labels in the same graph G. The target of the classification task is to learn a model from labeled data and generate a label set Y P to be the prediction of Y U . The difference between Y P and Y U indicates the classification quality.
When classifying elements X ∈ R n , traditional machine learning methods learn a weight matrix H to minimize the difference between Y P = F(X, H) and Y U , where F is any known fixed function. In network aspect, we will be able to utilize well-developed machine learning methods if adequate information of G is embedded into a corresponding form as X. By this motivation, relational learning (Getoor and Taskar, 2007;Neville and Jensen, 2000) methods are pop-ularly employed. In network classification, the internal structure of a network is resolved to extract the neighboring features of the entities (Macskassy and Provost, 2007;Wang and Sukthankar, 2013). Accordingly, the core problem is how to describe the irregular networks within formal feature spaces. A variety of approaches have been proposed with the purpose of finding effective statistical information through the network Henderson et al., 2011;Tang and Liu, 2011).
DeepWalk (Perozzi et al., 2014) is an outstanding method for network embedding, which uses truncated random walks to capture the explicit structure of the network and applies language models to learn the latent relationships between the actors. When applied to the network classification task, DeepWalk first learns X which describes the topological structure of G and then learns a subsequent classifier H. One obvious shortcoming of this two-step procedure is that the embedding step is unaware of the target class label information and likely to learn embeddings that are suboptimal for classification.
We present Discriminative Deep Random Walk (DDRW) to enhance the effect of DeepWalk by learning X ∈ R |V |×d and H ∈ R d×|Y| jointly. By using topological and label information of a certain network simultaneously, we will show that DDRW improves the classification accuracy significantly compared with most recent related methods. Furthermore, we will also show that the embedded result X produced by DDRW is able to retain the structure of G well.

Discriminative Deep Random Walk
In this section, we present the details of Discriminative Deep Random Walk (DDRW). DDRW has both embedding and classification objectives. We optimize the two objectives jointly to learn latent representations that are strongly associated with the class labels in the latent space. We use stochastic gradient descent (Mikolov et al., 1991) as our optimization method.

Embedding Objective
Let θ = (θ 1 , θ 2 , . . . , θ |V | ) denote the embedded vectors in the latent space, and α denote the topological structure of the graph. The embedding objective can be described as an optimization prob- lem as follows: where L r indicates the difference between the embedded representations θ and original topological structure α. For this objective, we use truncated random walks to capture the topological structure of the graph and the language model Word2Vec (Mikolov et al., 2013b) to learn the latent representations. Below, we explain each in turn.

Random Walk
Random Walk has been used in different regions in network analysis to capture the topological structure of graphs (Fouss et al., 2007;Andersen et al., 2006). As the name suggests, Random Walk chooses a certain vertex in the graph for the first step and then randomly migrates through the edges. Truncated random walk defines a maximum length s for all walk streams. In our implementation, we shuffle the whole vertices V in the graph for τ times to build the sample set W . After each time of shuffling, we take the permutation list of vertices as the starting points of walks. Every time a walk stream starts at one element in order, randomly chooses an adjacent vertex to move, and ends when this stream reaches s vertices. By this procedure we get totally τ |V | samples (i.e. walk streams) from the graph. Thus our sample set W ∈ R τ |V |×s is obtained as the training materials.

Word2Vec
Existing work has shown that both the vertices in truncated random walks and the words in text articles follow similar power-law distributions in frequency, and then the idea of reshaping a social network into a form of corpus is very straightforward (Perozzi et al., 2014). Corresponding to linguistic analysis region, the objective is to find an embedding for a corpus to show the latent significances between the words. Words which have closer meanings are more likely to be embedded into near positions. Word2Vec (Mikolov et al., 2013b) is an appropriate tool for this problem. We use the Skip-gram (Mikolov et al., 2013a) strategy in Word2Vec, which uses the central word in a sliding window with radius R to predict other words in the window and make local optimizations. Specifically, let ω = rw(α) denote the full walk streams obtained from truncated random walks in Section 3.1.1. Then by Skip-gram we can get the objective function The standard Skip-gram method defines p(ω i,t+j |ω i,j ) in Eq.(2) as follows: whereθ i and θ i are the input and output representations of the ith vertex, respectively. One shortcoming of the standard form is that the summation in Eq.(3) is very inefficient. To reduce the time consumption, we use the Hierarchical Softmax (Mnih and Hinton, 2009;Morin and Bengio, 2005) which is included in Word2Vec packages * . In Hierarchical Softmax, the Huffman binary tree is employed as an alternative representation for the vocabulary. The gradient descent step will be faster thanks to the Huffman tree structure which allows a reduction of output units necessarily evaluated. * https://code.google.com/archive/p/word2vec/

Classification Objective
Let y = (y 1 , y 2 , . . . , y |V | ) denote the labels, and β denote the subsequent classifier. The classification objective can be described as an optimization problem: min In DDRW, we use existing classifiers and do not attempt to extend them.
Although SVM multicalss (Crammer and Singer, 2002) often shows good performance in multi-class tasks empirically, we choose the classifier being referred to as L2-regularized and L2-loss Support Vector Classification (Fan et al., 2008) to keep pace with the baseline methods to be mentioned in Section 4.
In L2-regularized and L2-loss SVC, the loss function is where C is the regularization parameter, σ(x) = x if x > 0 and σ(x) = 0 otherwise. Eq. (5) is for binary classification problems, and is extended to multi-class problems following the one-againstrest strategy (Fan et al., 2008).

Joint Learning
The main target of our method is to classify the unlabeled vertices in the given network. We achieve this target with the help of intermediate embeddings which latently represent the network structure. We simultaneously optimize two objectives in Section 3.1 and 3.2. Specifically, let L(θ, β, α, y) = ηL r (θ, α) + L c (θ, β, y), where η is a key parameter that balances the weights of the two objectives. We solve the joint optimization problem: min θ,β L(θ, β, α, y).
We use stochastic gradient descent (Mikolov et al., 1991) to solve the optimization problem in Eq.(6). In each gradient descent step, we have where δ is the learning rate for stochastic gradient descent. In our implementation, δ is initially set to 0.025 and linearly decreased with the steps, same as the default setting of Word2Vec. The derivatives in Eq. (7) are estimated by local slopes. In Eq. (7), the latent representations adjust themselves according to both topological information (∂L r /∂θ) and label information (∂L c /∂θ). This process intuitively makes vertices in the same class closer and those in different classes farther, and this is also proved by experiments ( See Figure 1). Thus by joint learning, DDRW can learn the latent space representations that well capture the topological structure and meanwhile are discriminative for the network classification task.
We take each sample W i from walk streams W to estimate the local derivatives of the loss function for a descent step. Stochastic gradient descent enables DDRW to be an online algorithm, and thus our method is easy to be parallelized. Besides, a vertex may repeatedly appear for numerous times in W produced by random walks. This repeat is superfluous for classifiers and there is a considerable possibility to arise overfitting. Inspired from DropOut (Hinton et al., 2012) ideas, we randomly ignore the label information to control the optimization process in an equilibrium state.

Experimental Setup
In this section we present an overview of the datasets and baseline methods which we will compare with in the experiments.

Datasets
We use three popular social networks, which are exactly same with those used in some of the baseline methods. Table 1 summarizes the statistics of the data.
• BlogCatalog: a network of social relationships provided by blog authors. The labels of this graph are the topics specified by the uploading users.
• Flickr: a network of the contacts between users of the Flickr photo sharing website. The labels of this graph represent the interests of users towards certain categories of photos.
• YouTube: a network between users of the Youtube video sharing website. The labels stand for the groups of the users interested in different types of videos.

Baseline Methods
We evaluate our proposed method by comparing it with some significantly related methods.
• LINE (Tang et al., 2015) † : This method takes the edges of a graph as samples to train the first-order and second-order proximity seprately and integrate the results as an embedding of the graph. This method can handle both graphs with unweighted and weighted and is especially efficient in large networks.
• DeepWalk (Perozzi et al., 2014): This method employs language models to learn latent relations between the vertices in the graph. The basic assumption is that the closer two vertices are in the embedding space, the deeper relationships they have and there is higher possibility that they are in the same categories.
• SpectralClustering (Tang and Liu, 2011): This method finds out that graph cuts are useful for the classification task. This idea is implemented by finding the eigenvectors of a normalized graph Laplacian of the original graph.
• EdgeCluster (Tang and Liu, 2009b): This method uses k-means clustering algorithm to segment the edges of the graph into pieces. Then it runs iterations on the small clusters to find the internal relationships separately. The core idea is to scale time-consuming work into tractable sizes.
• Majority: This baseline method simply chooses the most frequent labels. It does not use any structural information of the graph. † Although LINE also uses networks from Flickr and YouTube in its experiments, the networks are different from this paper.
As the datasets are not only multi-class but also multi-label, we usually need a thresholding method to test the results. But literature gives a negative opinion of arbitrarily choosing thresholding methods because of the considerably different performances. To avoid this, we assume that the number of the labels is already known in all the test processes.

Experiments
In this section, we present the experimental results and analysis on both network classification and latent space learning. We thoroughly evaluate the performance on the three networks and analyze the sensitivity to key parameters.

Classification Task
We first represent the results on multi-class classification and compare with the baseline methods.
To have a direct and fair comparison, we use the same data sets, experiment procedures and testing points as in the reports of our relevant baselines (Perozzi et al., 2014;Tang and Liu, 2011;Tang and Liu, 2009b). The training set of a specified graph consists of the vertices, the edges and the labels of a certain percentage of labeled vertices. The testing set consists of the rest of the labels. We employ Macro-F 1 and Micro-F 1 (Yang, 1999) as our measurements. Micro-F 1 computes F 1 score globally while Macro-F 1 caculates F 1 score locally and then average them globally. All the results reported are averaged from 10 repeated processes.

BlogCatalog
BlogCatalog is the smallest dataset among the three. In BlogCatalog we vary the percentage of labeled data from 10% to 90%. Our results are presented in Table 2. We can see that DDRW performs consistently better than all the baselines on both Macro-F 1 and Micro-F 1 with the increasing percentage of labeled data. When compared with DeepWalk, DDRW obtains larger improvement when the percentage of labeled nodes is high. This improvement demonstrates the significance of DDRW on learning discriminative latent embeddings that are good for classification tasks.

Flickr
Flickr is a larger dataset with quite a number of classes. In this experiment we vary the percentage of labeled data from 1% to 10%. Our results are presented in Table 3. We can see that DDRW still performs better than the baselines significantly on both Macro-F 1 and Micro-F 1 , and the results are consistent with what in BlogCatalog.

YouTube
YouTube is an even larger dataset with fewer classes than Flickr. In YouTube we vary the percentage of labeled data from 1% to 10%. Our results are presented in Table 4. In YouTube, LINE shows its strength in large sparse networks, probably because the larger scale of samples reduces the discrepancy from actual distributions. But from a general view, DDRW still performs better at most of the test points thanks to the latent representations when links are not sufficient.

Parameter Sensitivity
We now present an analysis of the sensitivity with respect to several important parameters. We measure our method with changing parameters to evaluate its stability. Despite the parameters which are unilateral to classification performance, the two main bidirectional parameters are η and the dimension d of embedding space in different percentages of labeled data. We use BlogCatalog and Flickr networks for the experiments, and fix parameters of random walks (τ = 30, s = 40, R = 10). We do not represent the effects of changing parameters of random walks because results usually show unilateral relationships with them.

Effect of η
The key parameter η in our algorithm adjusts the weights of two objectives (Section 3.3). We represent the effect of changing η in Figure 3(a) and 3(b). We fix d = 128 in these experiments. Although rapid gliding can be observed on either sides, there are still sufficient value range where DDRW keeps the good performance. These experiments also show that η is not very sensitive towards the percentage of labeled data.

Effect of Dimensionality
We represent the effect of changing dimension d of the embedding space in Figure 3(c) and 3(d). We fix η = 1.0 in these experiments. There is decline when the dimension is high, but this decrease is not very sharp. Besides, when the dimension is high, the percentage of labeled data has more effect on the performance.

Representation Efficiency
Finally, we examine the quality of the latent embeddings of entities discovered by DDRW. For network data, our major expectation is that the embedded social space should maintain the topological structure of the network. A visualization of the topological structure in a social space is showed in Figure 1. Besides, we examine the neighborhood structure of the vertices. Specifically, we check the top-K nearest vertices for each vertex in the embedded social space and calculate how many of the vertex pairs have edges between them in the observed network. We call this Adjacency Predict Accuracy.

Related Work
Relational classification (Geman and Geman, 1984;Neville and Jensen, 2000;Getoor and Taskar, 2007) is a class of methods which involve the data item relation links during classification. A number of researchers have studied different methods for network relational learning.
(Macskassy and Provost, 2003) present a simple weighted vote relational neighborhood classifier. (Xu et al., 2008) leverage the nonparametric infinite hidden relational model to analyze social networks. (Neville and Jensen, 2005) propose a latent group model for relational data, which discovers and exploits the hidden structures responsible for the observed autocorrelation among class labels. (Tang and Liu, 2009a) propose the latent social dimensions which are represented as continuous values and allow each node to involve at different dimensions in a flexible manner.  propose a method that learn sparsely labeled network data by adding ghost edges between neighbor vertices, and (Lin and Cohen, 2010) by using PageRank. (Wang and Sukthankar, 2013) extend the conventional relational classification to consider more additional features.  propose a complimentary approach to within-network classification based on the use of label-independent features. (Henderson et al., 2011) propose a regional feature generating method and demonstrate the usage of the regional feature in within-network and across-network classification. (Tang and Liu, 2009b) propose an edge-centric clustering scheme to extract sparse social dimensions for collective behavior prediction. (Tang and Liu, 2011) propose the concept of social dimensions to represent the latent affiliations of the entities. (Vishwanathan et al., 2010) propose Graph Kernels to use relational data during classification process and (Kang et al., 2012) propose a faster approximated method of Graph Kernels.

Conclusion
This paper presents Discriminative Deep Random Walk (DDRW), a novel approach for relational multi-class classification on social networks. By simultaneously optimizing embedding and classification objectives, DDRW gains significantly better performances in network classification tasks than baseline methods. Experiments on different real-world datasets represent adequate stability of DDRW. Furthermore, the representations produced by DDRW is both an intermediate variable and a by-product. Same as other embedding methods like DeepWalk, DDRW can provide wellformed inputs for statistical analyses other than classification tasks. DDRW is also naturally an online algorithm and thus easy to parallel. The future work has two main directions. One is semi-supervised learning. The low proportion of labeled vertices is a good platform for semisupervised learning. Although DDRW has already combined supervised and unsupervised learning together, better performance can be expected after introducing well-developed methods. The other direction is to promote the random walk step. Literature has represented the good combination of random walk and language models, but this combination may be unsatisfactory for classification. It would be great if a better form of random walk is found.