Large-scale spectral clustering using diffusion coordinates on landmark-based bipartite graphs

Spectral clustering has received a lot of attention due to its ability to separate nonconvex, non-intersecting manifolds, but its high computational complexity has significantly limited its applicability. Motivated by the document-term co-clustering framework by Dhillon (2001), we propose a landmark-based scalable spectral clustering approach in which we first use the selected landmark set and the given data to form a bipartite graph and then run a diffusion process on it to obtain a family of diffusion coordinates for clustering. We show that our proposed algorithm can be implemented based on very efficient operations on the affinity matrix between the given data and selected landmarks, thus capable of handling large data. Finally, we demonstrate the excellent performance of our method by comparing with the state-of-the-art scalable algorithms on several benchmark data sets.


Introduction
Given a data set X = {x 1 , . . . , x n } ⊂ R d and a similarity function δ(·, ·) such as the Gaussian radial basis function (RBF), spectral clustering (von Luxburg, 2007) first constructs a pairwise similarity matrix and then uses the top eigenvectors of W (after certain kind of normalization) to embed X into a low-dimensional space where k-means is employed to group the data into k clusters. Though mathematically quite simple, spectral clustering can easily adapt to nonconvex geometries and accurately separate various non-intersecting shapes. As a result, it has been successfully applied to many practical tasks, e.g., image segmentation (Shi and Malik, 2000) and document clustering (Dhillon, 2001), often significantly outperforming traditional methods (such as k-means). Furthermore, spectral clustering has a very rich theory (von Luxburg, 2007), with interesting connections to kernel k-means (Dhillon et al., 2004), random walk (Meila and Shi, 2001), graph cut (Shi and Malik, 2000) (and the underlying spectral graph theory (Chung, 1996)), and matrix perturbation analysis (Ng et al., 2001). However, spectral clustering is known to suffer from a high computational cost associated with the n × n matrix W , especially when n is large. Consequently, there has been considerable effort to develop fast, approximate algorithms that can handle large data sets (Fowlkes et al., 2004;Yan et al., 2009;Sakai and Imiya, 2009;Wang et al., 2009;Chen and Cai, 2011;Wang et al., 2011;Tasdemir, 2012;Choromanska et al., 2013;Cai and Chen, 2015;Moazzen and Tasdemir, 2016;Chen, 2018). Interestingly, a considerable fraction of them use a landmark set to help reduce the computational complexity of spectral clustering. Specifically, they first find a small set of data representatives (called landmarks), Y = {y 1 , . . . , y m } ⊂ R d (with m n), from the given data in X and then form an affinity matrix between X and Y (see Fig. 1): Afterwards, different scalable methods use the matrix A in different ways to cluster the given data. For example, the column-sampling spectral clustering (cSPEC) algorithm (Wang et al., 2009) regards A as a column-reduced version of W and correspondingly use the left singular vectors of A to approximate the eigenvectors of W . However, they seem to consider only unnormalized spectral clustering, and it is unclear how they extend their technique to normalized spectral clustering (Shi and Malik, 2000;Ng et al., 2001). Another example is the landmark-based spectral 28 Figure 1: Illustration of landmark-based spectral clustering. Left: given data (in black color) and selected landmarks (in red); right: the affinity matrix A between the two sets of points with the blue squares indicating the largest entries in each row of A. Here, we assume that both the given data and the landmarks have been sorted according to the true clusters, so as to reveal the approximately block diagonal structure of A). clustering (LSC) algorithm (Cai and Chen, 2015) which uses a row-sparsified version of the matrix A as approximate sparse representations of the input data while bypassing the expensive dictionary learning and sparse coding tasks. It then applies the L 1 normalization to each row of A, followed by a square-root L 1 column normalization. This method empirically works quite well but clearly there is a gap between its sparse coding motivation and the actual implementation. A third example is the k-means-based approximate spectral clustering (KASP) algorithm (Yan et al., 2009) which first applies the k-means algorithm to partition the given data into m small clusters and then performs spectral clustering to divide their centroids (which are the landmark points) into k groups. Next, they extend the clustering of the landmarks to the original data by performing 1 nearest neighbor (1NN) classification. This algorithm runs very fast, but is sensitive to the k-means clusters as it aggressively reduces the given data to a small set of centroids.
In this work we propose a novel landmark-based scalable spectral clustering approach by adapting the co-clustering framework by Dhillon (Dhillon, 2001) for landmark-based clustering and combining it with diffusion maps (Coifman and Lafon, 2006). Specifically, with the given data X = {x i } and a selected landmark set Y = {y j }, we first construct a bipartite graph G 2 with X and Y being the two parts, and form edges between each x i and its s nearest neighbors y j in the landmark set with weights a ij = δ(x i , y j ). We then compute the transition probabilities for all the vertices of G 2 and use them to define a random walk on the bipartite graph, which (when being iterated for- Figure 2: A bipartite graph whose two components are the given data (in black) and a landmark set learned from it (in red). Here, we form edges between each given data point and its two closest landmark points. Initially, no landmark point is connected to the two points a and b. However, if one simulates a random walk on the bipartite graph, then through a sequence of steps between the two components, a, b will be identified as belonging to the same cluster. ward) further generates a diffusion process on G 2 . We expect the resulting diffusion coordinates to be able to capture the global geometry of the clusters at different scales and, as a result, the connectivity of each cluster will be significantly strengthened (see Fig. 2). We will show that the diffusion coordinates may be computed directly from the n × m matrix A = (a ij ). Lastly, we propose three different ways to use the diffusion coordinates for clustering the data in X (depending on the length of the random walk).
The rest of the paper is organized as follows. First, in Section 2, we review some necessary background. We then present our methodology in Section 3. Experiments are conducted in Section 4 to test our proposed algorithms. Finally, we conclude the paper in Section 5.

Background
In this section, we first review the Normalized Cut (Ncut) algorithm (Shi and Malik, 2000) and its connections to random walk (Meila and Shi, 2001) and diffusion maps (Coifman and Lafon, 2006). Next, we will review the co-clustering framework in the setting of documents data by Dhillon (2001).

The Ncut algorithm
Given a data set X = {x 1 , x 2 , ..., x n } ⊂ R d and a notion of similarity δ, we may construct a weighted graph G by using the data points in X as vertices and assigning an edge between any two points x i , x j with associated weight w ij = δ(x i , x j ). Let W = (w ij ) ∈ R n×n , which is the weight matrix of G. The degree of x i is the total 29 edge weight at that vertex: d i = j w ij , which measures the connectivity of x i . The diagonal matrix D = diag(d 1 , . . . , d n ) is called the degree matrix. The Laplacian of the graph G is defined as L = D − W , which is a positive semidefinite matrix. One way to normalize L is the following For any subset of vertices S ⊂ X, we define the cut between S and its complementS as The volume of S is defined as the total connectivity of the vertices in S: The Ncut algorithm (Shi and Malik, 2000) finds k clusters from the given data X by minimizing the following objective function: over all possible partitions X = S 1 ∪ · · · ∪ S k . Such a formulation of clustering is very convenient, however, the resulting optimization problem is intractable due to its combinatorial nature. Fortunately, by using a continuous relaxation, the above problem is reduced to finding the bottom k − 1 eigenvectors of the normalized graph Laplacian L rw (corresponding to its smallest positive eigenvalues) 1 : One then regards the rows of V as a lowdimensional embedding of the original data and clusters them by using k-means.

Random walk and diffusion maps
Let P = D −1 W, whose row sums are all one. We can then write L rw = I − P . Algebraically, the bottom eigenvectors of L rw are just the top eigenvectors of P . However, since P is row-stochastic, it can be used as a transition probability matrix to define a random walk on the graph G (Meila and Shi, 2001). Under such a model, clustering can be interpreted as a way of finding a partition of the graph such that the random walk stays long inside the clusters and rarely moves between them. If the random walk is moved forward for many iterations, then a diffusion process is generated on the graph G. For every integer α ≥ 1, P α is the α-step transition matrix. Different values of α integrate the local connectivity information of the graph at different scales, with larger α yielding more global descriptions. The rows of P α define a family of discrete distributions, one at each vertex of G, and are called the α-step diffusion coordinates (Coifman and Lafon, 2006). For practical purposes, it suffices to use low-dimensional approximations of them (by exploiting the fast decay of the α-th power of the spectrum of P ): where p ∈ Z + and λ i , v i are the largest eigenvalues and eigenvectors of P (excluding the eigenvalue 1 and associated eigenvector). In this work, we use the rows of V (α) as an embedding of the data for clustering purposes.

Dhillon's co-clustering framework
Dhillon (2001) proposed a spectral clustering based co-clustering framework in the setting of documents clustering. Specifically, given a document-term frequency matrix A ∈ R n×m , they first construct a bipartite graph G 2 whose two parts are the documents and terms, respectively, and then use A to define a weight matrix for G 2 as follows: The two empty blocks of W are zero matrices of appropriate sizes (which have been omitted for simplicity), indicating no connection among documents or terms. Afterwards, they apply the Ncut algorithm (along with the above weight matrix W ) to co-cluster documents and terms, and they derived an efficient way to implement Ncut solely based on operations on the n × m matrix A (thus effectively avoiding the larger matrix W in all calculations). We review their derivation below. We start with some definitions. First, let D 1 , D 2 be two diagonal matrices consisting (resp.) of the row and column sums of A: Here, and in what follows, we abuse notation to use 1 to denote the column vector (with appropriate dimension) with all entries equal to one. Next, we define three different normalized version of A: It is easy to see that A 1 , A 2 are respectively rowand column-stochastic, while A is closely related to both of them in the following ways: The degree matrix of the bipartite graph is from which we may obtain the following transition probability matrix for the bipartite graph: (16) The following result, first proved by Dhillon (2001), indicates the close connection between the eigenvalue decomposition of P and the SVD of A. Proof. Suppose v is an eigenvector of P corresponding to some eigenvalue λ, that is, P v = λv, or equivalently, From this we obtain and also (after plugging in (13) and (14)) This shows that D 1/2 1 v 1 and D 1/2 2 v 2 are the left and right singular vectors of A (corresponding to the same singular value λ). It is easy to verify that the converse is also true. Therefore, to obtain an eigenvector of P , we just need to first perform the SVD of A to find a pair of its left and right singular vectors and then apply the following formula where Lastly, to complete the co-clustering task, one just stacks the top k − 1 eigenvectors of P as columns to form an embedding matrix V ∈ R (n+m)×(k−1) and applies k-means to its rows to group the documents and terms simultaneously.

Methodology
In the previous section we reviewed the coclustering framework by Dhillon (2001) which employs the Ncut algorithm (Shi and Malik, 2000) to partition a bipartite graph consisting of documents and terms by directly working on the document-term matrix. In this section, we extend their work in two ways. First, we adapt their bipartite graph model for landmark-based clustering by using instead the given data and a selected landmark set as its two parts. Second, we simulate a diffusion process on the bipartite graph to gather global information about the graph, however, our focus is still on clustering the given data for which we will introduce several ways of using such a bipartite graph model.

Derivation of diffusion coordinates on a bipartite graph
Given a data set, X = {x 1 , . . . , x n } ⊂ R d , to be partitioned into k clusters, we select from X a set of landmark points, Y = {y 1 , . . . , y m } ⊂ R d (with m n), by some sampling method, such as uniform sampling or k-means clustering. Let A ∈ R n×m be the affinity matrix between X and Y , computed by using a pre-specified similarity function δ as in (2). Like LSC (Cai and Chen, 2015), we preserve only the largest s (s m) entries in each row of A, but our motivation is to focus on the most similar landmark points for each given data point. We then construct a bipartite graph G X,Y with X, Y as its two parts and a weight matrix W of the form in (9) but based on the affinity matrix A defined above. Again, the two empty blocks of W indicate no connection inside each component of G X,Y . Also, the s-sparse rows of A imply that each point in X is only connected to its s nearest neighbors in Y on the bipartite graph. See Fig. 2 for an illustration of our bipartite graph model.
Using the computational framework laid out in Subsec. 2.3, we may easily obtain various quantities like A 1 , A 2 , A, D, P . In particular, we may use the transition matrix P to define a random walk on the bipartite graph and run it forward continuously to generate a diffusion process. The αstep transition matrix, P α has the following form.
(1) If α = 2q is even, then (2) If α = 2q + 1 is odd, then This result indicates that after an even number of steps, the random walk (no matter in which component of the bipartite graph it is initiated) is always back to the original component, so that the original bipartite graph becomes two disconnected subgraphs, while after an odd number of steps, the random walk always ends in the other component, and hence the graph remains bipartite. See Fig. 3 for an illustration. Also, when α = 2q, the α-step random walk on the bipartite graph G X,Y is equivalent to a q-step random walk within each component of G X,Y with the transition matrix A 1 A t 2 or A t 2 A 1 . This implies that for even integers α, one should focus on the two components X, Y of the bipartite graph separately and in principle may use the corresponding blocks of P α as new transition matrices for clustering the two sets of data individually. In contrast, for odd integers α, one still needs to consider the two components X, Y together as a bipartite graph and simultaneously cluster the original data and the landmark set.
The actual algorithm we will propose uses a family of diffusion coordinates (corresponding to different time steps of the diffusion process) to embed the input data and/or the landmark points into low-dimensional spaces for clustering by kmeans. The next result shows that one may obtain such diffusion coordinates directly from the matrix A (defined in (12)). Theorem 1. Let α, p ≥ 1 be two integers. The p-dimensional diffusion coordinates for G X,Y at time step α are where Λ = diag(λ 1 , . . . , λ p ) ∈ R p×p contains the largest p singular values of A (excluding the singular value 1), and V ∈ R (n+m)×p the corresponding pairs of left and right singular vectors of A (one pair in each column).
Proof. This is a direct consequence of Lemma 1 combined with the formula in (8).
Remark. We fix p = k − 1 (where k is the number of clusters) in the rest of the paper (so as to be consistent with the Ncut algorithm (Shi and Malik, 2000)), but other values of p could be used too (e.g. those determined based on the actual power decay of the eigenvalues of P ).
Remark. If we extend α to zero in (25), then we can obtain the embedding used by Dhillon (2001). This shows that our work extends (Dhillon, 2001).
Remark. When α is even, the top n × p and bottom m × p blocks of V (α) , denoted as V (α) Y , may be used separately as diffusion coordinates for the two sets of data X, Y .

Proposed algorithm
We present several different ways to use the αstep diffusion coordinates V (α) ∈ R (n+m)×(k−1) in (25), computed from a landmark-based bipartite graph G X,Y , for clustering the input data X.
We consider the following two cases: (1) α even: In this case, the initial bipartite graph becomes two disjoint subgraphs corresponding to Algorithm 1 Landmark-based Bipartite Diffusion Maps (LBDM) Input: Data set X = {x 1 , x 2 , ..., x n } ⊂ R d , # clusters k, similarity function δ, landmark selection method, # landmarks m, # nearest landmarks s, # diffusion steps α, clustering method: direct or landmark (for even α), or co-clustering (for odd α). Output: A partition of X into k clusters.
1: Find m landmarks Y = {y 1 , . . . , y m } ⊂ R d using the given method. 2: Form the affinity matrix A between the input data X and their respective s nearest landmark points in Y by using the given similarity function δ. 3: Calculate the row and column sums of A and use them to normalize A to obtain A (as in (12)). 4: Find the largest k − 1 singular values (excluding 1) and corresponding left and right singular vectors of A. 5: Compute the diffusion coordinates matrix V (α) by (25) (with p = k − 1). 6: Use the indicated clustering method to divide the input data set X into k clusters.
X and Y , respectively. A direct way of clustering the input data X is to focus on V (α) X , the α-step diffusion coordinates for X, and apply k-means to group them into k clusters. Alternatively, we can focus on V (α) Y , the α-step diffusion coordinates for the landmark set Y , and use k-means to group the landmark points into k subsets. Afterwards, we extend the landmark clustering to the input data through s nearest neighbors (sNN) classification, where s is the number of closest landmark points connected to each data point.
(2) α odd: In this case, we are still left with a bipartite graph. To cluster the input data X, we propose to run k-means with the full set of diffusion coordinates V (α) to divide X ∪ Y into k clusters, and later remove the landmark points from them. We refer to the three above-mentioned methods for clustering the given data respectively as direct clustering, landmark clustering, and coclustering.
We now present our scalable spectral clustering algorithm in Alg. 1.
Remark. We mention the work of Liu et al. (2013) in the setting of large graph data, which constructs a bipartite graph between the original graph nodes and "supernodes" generated through graph coarsening and then uses the plain coclustering algorithm by Dhillon (2001) to partition the graph. Though the idea is somewhat similar, our method directly operates in the Euclidean data domain and extracts diffusion coordinates from the bipartite graph for clustering.

Run time analysis
The

Experiments
In this section, we conduct extensive experiments to evaluate the practical performance of LBDM with α = 1, 2. For the odd value α = 1, for which the co-clustering method has to be used, we denote the corresponding implementation by LBDM (1) . For the even value α = 2, we use both the direct clustering and landmark clustering methods and denote them as LBDM (2,X) and LBDM (2,Y ) , respectively.

Experimental setup
We compare the different LBDM versions with the following algorithms: KASP (Yan et al., 2009), LSC (Cai and Chen, 2015), cSPEC (Wang et al., 2009), and the co-clustering algorithm by Dhillon (2001) (used similarly as LBDM (1) ), in the setting of Gaussian similarity. We also include the plain Ncut algorithm (Shi and Malik, 2000) in our study (as a baseline). We implement all the methods in MATLAB 2016b (except LSC 2 ) and conduct all the experiments on a compute server with 48GB of RAM and 2 CPUs with 12 total cores.
We choose six benchmark data sets -usps, pendigits, letter, protein, shuttle, mnist -from the LIBSVM website 3 (see Table 1 for their summary statistics). They are originally partitioned into training and test parts for classification purposes, but for each data set we have merged the two parts together for our unsupervised setting. Also, we provide the true number of clusters k to all algorithms to focus on the clustering task. 4 In order to have a fair comparison between the different algorithms, we use the same values for the shared parameters. In particular, we fix m = 500 (for all methods) and s = 5 (for LSC, Dhillon and LBDM with α = 1, 2). Also, we feed all the algorithms with the same landmark set found by k-means (with only 10 iterations), which is initialized with the centroids obtained by preliminary k-means clustering on 10% of the data (with 100 iterations, 10 restarts). In the last step of each algorithm (where k-means is applied to cluster data in the respective embedding space), we use 100 iterations and 10 restarts.
We evaluate the different methods in terms of clustering accuracy and CPU time (averaged over 50 replications), with the former being calculated by first finding the best match between the output cluster labels and the ground truth and then computing the fraction of correctly assigned labels.

Results
We report the experimental results in Tables 2 (accuracy) and 3 (time).
The following observations on the clustering accuracy are at hand: (1) LBDM (2,Y ) achieved the highest accuracy on three data sets (usps, protein, mnist), while LBDM (2,X) achieved the highest accuracy only on letter; (2) LSC and cSPEC each obtained the best accuracy once but each of them also performed very badly in at least one case; (3) The two co-clustering methods (Dhillon, LBDM with α = 1) were very close and all performed reasonably well but never achieved the highest accuracy; (4) KASP never achieved the best accuracy either and performed very poorly in three cases (pendigits, letter, mnist). Overall, the LBDM family exhibited very stable performance (which demonstrates the power of diffusion coordinates), and they also outperformed the plain Ncut algorithm most of the time.
Regarding running time, LBDM (2,Y ) and KASP are the two fastest methods because they both cluster the landmark points first and then extend the clustering to the input data through nearest neighbor classification. KASP is even faster because it applies spectral clustering directly to the landmark points in R d (the corresponding weight matrix is only m × m), but it is at the expense of accuracy. The cSPEC algorithm, on the other hand, is the slowest among the scalable methods (because it does not sparsify the matrix A), but it is still much faster than plain Ncut.

Parameter sensitivity study
We study in this section the effects of the parameters of LBDM (and relevant methods): m (number of landmark points), s (number of nearest landmark points), and α (diffusion time), using four data sets from Table 1: usps, letter, protein, and mnist.
In the first experiment, we focus on the parameter m by fixing s = 5 and varying m from 100 to 1000 with a step size of 100 in order to study its influence on all the scalable methods in Table  2. For each data set and each value of m, we apply k-means to sample m landmark points from the data set and provide the same landmark set to all the methods being compared to obtain their clustering accuracy and run time. This is then repeated 30 times and we report the average accuracy and time for each method in Fig. 4. In general, all methods except KASP and cSPEC improve their accuracy rates as more landmark points are used, with LBDM (2,Y ) achieving the highest accuracy most of the time for three data sets (usps, protein, mnist). The CPU time of each algorithm seems to   depend linearly on m, with KASP always being the fastest two method.
In our second experiment about the parameter s (which is only needed by LBDM, LSC and Dhillon), we use the same setup as in the first experiment, except to fix m = 500 while varying s from 2 to 10 continuously. We plot the average clustering accuracy and run time of the different methods against the parameter s in Fig. 5. We see that increasing the value of s tends to decrease the clustering accuracy of each algorithm (with LBDM (2,Y ) being the best in three cases), while increasing their run time linearly (but very little for LBDM (2,Y ) ).
Lastly, we study the α parameter of LBDM by varying it from 1 to 40 continuously (with m = 500 and s = 5 fixed). Recall that for odd values of α, we have to use the co-clustering method LBDM (α) , while for each even value of α, we can use either the direct clustering method LBDM (α,X) or the landmark clustering method LBDM (α,Y ) . Their average accuracy (over 30 replications) for each value of α is displayed in Fig. 6. We can see that increasing the time scale α may further improve the clustering accuracy for all three methods on some data sets, demonstrating the power of diffusion maps. Since KASP fixes s = 1 and cSPEC requires no sparsification, we have respectively plotted their accuracy rates at s = 1 and 0 in each plot. and Dhillon's method (shown at α = 0) on two text data sets.

LBDM with bipartite graphs between documents and terms
In this section we conduct an extra experiment to show that we can easily adapt LBDM (Alg. 1) for the original bipartite graph model by Dhillon (2001), which consists of documents and terms, by simply treating the terms as the "landmarks" and using the document-term frequency matrix A as the affinity matrix between the two components of the bipartite graph. We then carry out the remaining steps of Alg. 1, using either the direct clustering method (for even α) or the coclustering method (for odd α), and still denote them as LBDM (α,X) and LBDM (α) . We compare these two methods for 1 ≤ α ≤ 20 with Dhillon's co-clustering algorithm using two news data sets, TDT2 and Reuters21578. 5 Because of the much varied cluster sizes, we focus on the top 30 categories in each data set. The clustering accuracy of the three methods on both data sets is reported in Fig. 7. It is clear that the use of diffusion coordinates on the bipartite graph (for small α) considerably improves the documents clustering accuracy.

Conclusions
We presented a landmark-based scalable spectral clustering approach by a novel combination of diffusion maps and bipartite graphs. Our experiments showed that the proposed algorithm achieved very stable and competitive accuracy while running fast. We conclude that LBDM can be used as a very promising new alternative to current largescale spectral clustering methods.