Exploiting Node Content for Multiview Graph Convolutional Network and Adversarial Regularization

Network representation learning (NRL) is crucial in the area of graph learning. Recently, graph autoencoders and its variants have gained much attention and popularity among various types of node embedding approaches. Most existing graph autoencoder-based methods aim to minimize the reconstruction errors of the input network while not explicitly considering the semantic relatedness between nodes. In this paper, we propose a novel network embedding method which models the consistency across different views of networks. More specifically, we create a second view from the input network which captures the relation between nodes based on node content and enforce the latent representations from the two views to be consistent by incorporating a multiview adversarial regularization module. The experimental studies on benchmark datasets prove the effectiveness of this method, and demonstrate that our method compares favorably with the state-of-the-art algorithms on challenging tasks such as link prediction and node clustering. We also evaluate our method on a real-world application, i.e., 30-day unplanned ICU readmission prediction, and achieve promising results compared with several baseline methods.


Introduction
Over the last few years, network representation learning, or node embedding, has gained increasing interest in the community of machine learning, due to the popularity of the special data form. In reality, datasets from different fields are often in the form of networks, such as social networks, drug-targetinteraction networks, mobile phone networks, citation networks, etc. It is therefore very important to find a way to well represent the networks, which is challenging because there is no direct way to encode the high-dimensional data into low-dimensional feature vectors efficiently (Hamilton et al., 2017b). Moreover, network embedding techniques benefit a variety of downstream applications like link prediction, node classification and node clustering.
In recent years, researchers have developed different kinds of network embedding approaches, many of which have shown great performance in analytical evaluation and have been quite effective in downstream applications. These studies range from traditional machine learning techniques like matrix factorization to recent deep-learning-based methods like graph autoencoders.
Traditional models, or shallow models, usually optimize the embeddings of nodes directly. For these shallow models, the mapping from networks to vectors is simply a embedding lookup, i.e., each node corresponds to a unique embedding vector (Hamilton et al., 2017b). Factorization-based approaches like GraRep (Cao et al., 2015), HOPE (Ou et al., 2016) and random walk-based approaches like DeepWalk (Perozzi et al., 2014), node2vec (Grover and Leskovec, 2016) all fall into this category. Shallow models generally suffer from computational inefficiency and lack of ability to well represent complex networks.
More recently, deep models, or autoencoder-based approaches, have been gaining more and more attention, and have shown superior performance in many applications. Compared with shallow models which use a simple lookup table as the encoder function, deep models usually use deep neural networks as the encoder. For example, SDNE (Wang et al., 2016) and DNGR (Cao et al., 2016) use deep neural networks as the encoder and decoder functions to generate low-dimensional representations. GAE and VGAE (Kipf and Welling, 2016b) aggregate neighborhood messages based on convolutional encoders, e.g., graph convolutional networks (GCN) (Kipf and Welling, 2016a) and its variants, to generate node embeddings. The encoders share parameters across nodes and it leads to better efficiency. Note that GCN variants like GraphSAGE (Hamilton et al., 2017a) and GAT (Veličković et al., 2017) are not discussed as they mostly focus on message passing which is not the main focus of this work.
Another successful variant of graph autoencoders incorporates the generative adversarial networks (GAN) for representation learning. For example, ARGA and ARVGA (Pan et al., 2018) enforce the latent node embeddings to match a prior normal distribution based on an adversarial training mechanism. The adversarial training procedure usually provides regularization and results in more robust and meaningful representations (Makhzani et al., 2015). DBGAN (Zheng et al., 2020) estimates the prior distribution of latent representations by prototype learning and aims to balance both sample-level and distribution-level consistency via a novel bidirectional adversarial learning framework.
A common theme among most of the aforementioned approaches is that they do not explicitly consider the semantic relatedness between nodes. For shallow models like DeepWalk (Perozzi et al., 2014), they mostly only focus on preserving the topological structure of the network while neglecting the rich information in node content. For deep models, they implicitly incorporate node content by aggregating neighborhood node features using powerful encoders like graph convolutional networks.
In this paper, we propose a novel network embedding method based on multiview graph convolutional network and adversarial regularization. The method aims to preserve the distribution consistency across two views of the network, as well as shape the output representations to match an arbitrary prior distribution, by incorporating a multiview adversarial regularization module. More specifically, we regard the topological structure as the first and main view of the network, and create a second view that captures the relatedness between nodes based on node content. Different from DBGAN (Zheng et al., 2020) which tries to reconstruct the node features directly, the proposed method relaxes this requirement and focuses on preserving the semantic relatedness between them. A multiview reconstruction loss function is leveraged to optimize the model jointly. We evaluate the proposed method on three diverse applications. The experimental results on benchmark datasets demonstrate that the method outperforms the state-of-the-art algorithms in link prediction and node clustering. We also evaluate our method on a real-world downstream application, i.e., ICU readmission prediction, and the method compares favorably with several baseline methods. Our contributions can be summarized as follows: • We propose a novel network embedding method, i.e., Multiview Adversarially Regularized Graph Autoencoder (MRGAE). Unlike previous studies that either neglect node content or aim to reconstruct the entire node feature matrix, we focus on the semantic relatedness between nodes and aim to preserve the consistency of node presentations across two specific views of the network. We incorporate a multiview adversarial regularization module to achieve the objective and enforce the output representations to match a prior distribution.
• We conduct extensive and diverse experiments for evaluation. The experimental studies demonstrate the superb performance of our method, by updating the state-of-the-art results in link prediction and node clustering on benchmark datasets. Our method also compares favorably with baselines in the task of ICU readmission prediction.

Graph Convolutional Networks
Most recent graph neural network models usually use a common architecture, i.e., graph convolutional networks (GCN) (Kipf and Welling, 2016a), to encode the input networks. Essentially, graph convolutional networks transform the original graph or network into a lower-dimensional representation matrix Z, given the adjacency matrix A and the feature matrix X as the input. Each of the transformations can be written as a non-linear convolution function: where H (0) = X which is the input feature matrix, and H (l) refers to the output representation matrix (i.e., embeddings) Z (l) for the l-th layer convolutional neural network. Essentially, different types of the convolution function f usually correspond to variants of the GCN model. The standard convolution function can be written as: whereÂ = A + I, and I is the identity matrix of A.D is the diagonal degree matrix ofÂ, and W (l) is the weight matrix for the l-th layer neural network, which is also the parameter to optimize. We use the ReLU function as the activation function σ in this paper, and adopt a two-layer GCN as the encoder for all the experiments.

Adversarial Regularization
Adversarial regularization has proven effective in various network representation learning approaches (Makhzani et al., 2015;Pan et al., 2018;Dai et al., 2018). Generally, in the encoder-decoder framework, one can view the encoder as a generator, and incorporate a discriminator (e.g., a multi-layer perceptron) to distinguish whether a latent representation is from the encoder or from an arbitrary prior distribution. By incorporating this module, one can shape the learned representations to match an arbitrary prior distribution, e.g., Gaussian distribution. This is similar in spirit to VGAE, which uses KL divergence instead of adversarial training to achieve the same purpose (Makhzani et al., 2015). In this work, we extend the adversarial regularization module to a multiview scenario, where we aim to enforce the learned representations from the two views to be distribution consistent and to match a prior distribution.

Overview
The overall framework of the proposed method contains three main parts, as depicted in Figure 1. First, we consider the topological structure as the first and main view of the input network, and create a second view of it. Next, we use two graph convolutional networks (GCN) as the encoders to separately encode the two views of the input network. Then, we incorporate two discriminators, one to distinguish between the representations from the main view and the prior distribution, the other to distinguish between the representations from the two views. In this paper, we use the Gaussian distribution as the prior distribution since the Gaussian assumption has been widely adopted in various previous studies (Makhzani et al., 2015;Kipf and Welling, 2016b;Pan et al., 2018). Finally, we design a specific multiview reconstruction loss function, combine it with the two discriminators, and optimize the model jointly.

Notations
Given the undirected input network G = (V, E), we regard it as the first view G 1 and create a second view G 2 from it. Specifically, we denote the two views of the network G as G 1 = (V, E 1 ) and G 2 = (V, E 2 ), respectively. Note that we have E 1 = E. Each view G i (i = 1, 2) has the same node set V with N nodes (N = |V |) and a different set of edges E i . Each view has its own adjacency matrix A i and degree matrix D i . We further introduce a N × D feature matrix X for V , where each row corresponds to the input features of D dimensions for each node. For featureless networks, we use the identity matrix as a replacement for X. The goal is to learn a unified representation matrix Z for the nodes.

Second View Construction
We aim to construct a second view G 2 = (V, E 2 ) of the network that captures the semantic relatedness between nodes. To define E 2 , we adopt a straightforward strategy to calculate cosine similarities between node content. Essentially, if the cosine similarity between two nodes is greater than a threshold α prox , then we create a link between them in the second view.

Encoder-Decoder Framework
In this paper, we follow the generalized encoder-decoder framework (Hamilton et al., 2017b) for learning network representations. More specifically, we adopt two-layer GCNs as the encoders, and each of them encodes one single view of the input multiview network. Essentially, the encoder model transforms the nodes in the network into low-dimensional feature representations (i.e., embeddings), and this encoding process can be written as: where Z i refers to the representation matrix learned from the i-th view G i . Along with Equation 2, the encoding process can then be further explained as: whereÂ i andD i refer to the adjacency matrix and degree matrix of the i-th view G i , respectively. Similarly, W i (l) represents the parameter matrix for the l-th layer graph convolutional network with G i . Thus, in general this encoding process with Equation 3 can be written as: With regard to the decoder model, essentially it decodes the learned low-dimensional representations, and transforms them into some information that can be evaluated in some way, for example, the existence of edges between nodes or label predictions on specific downstream tasks. The evaluations are a good way to measure the quality of the learned representations of nodes. In this paper, we use a simple yet effective pair-wise inner-product decoder to reconstruct the edges of the original network, which is shown as follows: The inner-product decoder model aims to reconstruct the edge set between nodes in the input network, where the reconstructed edge set should be as similar as the original one. In our case, the reconstruction loss is calculated based on each of the views, i.e., the decoder aims to reconstruct each view from the learned representations from that view, respectively. The decoding process is shown as follows: where (Â i ) pq refers to the edges between nodes, and σ s here is the logistic sigmoid function.

Multiview Adversarial Regularization
The intuition is that we want the latent embeddings learned from different views are consistent, i.e., the same nodes from different views are close in the embedding space, and the learned latent embeddings from different views fit a similar distribution. Thus, we propose the loss function should be in the following form: where α i are the balancing coefficients. Intuitively, the first term corresponds to the addition of the individual reconstruction loss from each view. The second term S is the term that models the consistency across different views, and the specific methods differ in how this term is chosen and parameterized. We then introduce a multiview reconstruction loss (MRL) function: where the first term refers to the addition of the individual reconstruction loss from each view, and the second term is the loss of reconstructing the graph structure of the main view G 1 with the encoded representations from both views, i.e., Z 1 and Z 2 . Here instead of only adding the individual reconstruction loss together, we use the encoded representations from both views to jointly reconstruct the main structure, thus achieving better consistency and robustness. More specifically, we have p((Â 1 ) pq = 1|z 1p , z 2q ) = σ s (z 1p z 2q ).
Unlike previous work, we incorporate two discriminators, namely the normal discriminator D n and the view discriminator D v , to distinguish between the representations from the main view and the Gaussian distribution, and to distinguish between the representations from the two views, as depicted in Figure 1. We share weights between them. The adversarial loss for the two discriminators is defined as: where G 1 and G 2 refer to the two GCN encoders, respectively. And finally, we use a weighted sum of the above losses: where L reg is a regularization term and we have L reg = E Z 1 ∼q(X,Â 1 ) [− log D n (Z 1 )]. We then jointly train the model by minimizing L 1 , and finally take the encoded representations from the main view, i.e., Z 1 , as the output representations.

Experiments
In this section, we evaluate our proposed method based on three tasks. First, we conduct the experiment of link prediction on the benchmark dataset of three citation networks. We also report the experiment of node clustering on these networks. Finally, we apply the proposed method to a real-world medical application, i.e., 30-day unplanned ICU readmission prediction. 1

Link Prediction
Link prediction is a popular task in evaluating network embedding methods. Essentially, a small portion of the edges are removed for generating the validation and test sets, and the same number of pairs of unconnected nodes are randomly picked as negative samples. The goal of the task is to predict whether or not there exists an edge between two nodes.

Dataset and Second View Construction
We conduct the experiment on three popular citations networks, i.e., Cora, Citeseer and Pubmed (Sen et al., 2008). The nodes represent scientific publications from different areas, and the edges represent the citation links between them. The nodes are represented with feature vectors, which are described by 0/1-valued word vectors indicating the absence/presence of the corresponding word (Cora and Citeseer) or tf-idf weighted word vectors (Pubmed). Each node has a corresponding class label.
In this experiment, we take the original edge set of the input network, i.e., the citation links, as the first and main view. We construct the second view based on textual similarities. Essentially, if the cosine similarity between two publications is greater than the empirical threshold 0.7, then we create a link between them in the second view.

Experiment Settings
For all the experiments, we split each of the datasets into the training set (85%), the validation set (5%), and the test set (10%). To reduce the influence of randomness, we average the results over five randomly selected splits as in (Zheng et al., 2020).
We use the same set of hyperparameters for the GCN encoder with the baselines (Kipf and Welling, 2016b;Pan et al., 2018;Zheng et al., 2020). More specifically, we use a 32-dim hidden layer and 16dim latent representations for the GCN encoder in the link prediction task. We also use two multi-layer perceptrons (MLP) as the discriminators, each of which consists of two 128-dim hidden layers. We set the balancing factors α 1 , α 2 , γ to 1.0, and set β to 0.8 in all experiments. The performance of our method is recorded as MRGAE in Table 1.
Note that DBGAN uses a larger embedding size in their experiments. For a fair comparison, we also set the representation size to 32-dim (Cora) and 64-dim (Citeseer and Pubmed), the results of which are  recorded as MRGAE*.

Results
We use the same evaluation metrics with the previous work, i.e., area under the Receiver Operating Characteristics curve (AUC) and average precision (AP) scores. As shown in Table 1, the proposed method (MRGAE) achieves the best performance on all three citation networks, outperforming the state-of-the-art method, i.e., DBGAN, indicating the effectiveness of exploiting node content by incorporating multiview adversarial regularization.

Node Clustering
In this experiment, we consider another unsupervised task of clustering nodes in the network. We first compute the embeddings of Cora and Citeseer and perform K-means clustering on them, where K is set to be the number of node classes in each network. Then we follow the same procedure of previous work (Shi et al., 2019;Xia et al., 2014;Pan et al., 2018) and match the predicted class labels with the groundtrue labels using the Munkres assignment algorithm (Munkres, 1957). The results are evaluated based on accuracy (Acc), normalized mutual information (NMI), precision (Prec), F-score (F1) and average rand index (ARI).

Baselines
Except for the baselines we use in the link prediction task, we include four more baseline algorithms which are designed for clustering: RTM (Chang and Blei, 2009), RMSC (Xia et al., 2014), TADW (Yang et al., 2015), and GALA (Park et al., 2019).

Results
For a fair comparison, we first set the size of the output representations to 16-dim and record the results as MRGAE, and since GALA and DBGAN only report high dimensional performance, we then report the performance of our method with the same dimensions with DBGAN (i.e., 128-dim for Cora, 64-dim for Citeseer) and record the result as MRGAE*.
As shown in Table 2, our proposed method MRGAE outperforms the other methods on both datasets across all metrics. For the Cora dataset, the proposed method MRGAE* shows superior performance to GALA and DBGAN in almost all metrics except NMI. For the Citeseer dataset, MRGAE* and DBGAN perform similarly well while GALA gives the best results. It is mainly because GALA uses a 500-dim node representation which is much larger than DBGAN and MRGAE*.  Table 3: Effectiveness evaluation of D v and MRL.

Ablation Study
In this section, we validate the effectiveness of the multiview adversarial regularization module in our proposed method. We conduct the ablation experiments on both link prediction and node clustering tasks with the Cora dataset. We first remove the view discriminator D v . By removing this part, the proposed method loses the ability to preserve the distribution consistency across the two specific views. We then remove the multiview reconstruction loss (MRL) and replace it with a simple GAE-based reconstruction loss. By removing this, the method losses rich information from the generated second view. Finally we remove both parts. The three ablated methods are recorded as "w/o D v ", "w/o MRL" and "w/o both", respectively.
As shown in Table 3, removing either part would cause performance decrease on both link prediction and node clustering tasks, indicating the effectiveness and necessity of D v and MRL. The ablated method "w/o both" shows the biggest performance decrease, which consistently validates the claim.

30-day Unplanned ICU Patient Readmission Prediction
In real-world networks, node content usually carries rich and important information for downstream applications, which highlights the practical value of the proposed method. Therefore, to better evaluate, we apply our method to a real-world application, i.e., unplanned ICU patient readmission prediction, to test if any performance gain can be achieved, compared with several baseline embedding methods.
We conduct this experiment based on Lin et al.'s work (Lin et al., 2018), which leverages the embeddings of medical concepts (in the form of ICD-9 codes) in their method and achieves the state-of-the-art performance. According to their claim, incorporating embeddings of medical concepts can benefit the prediction performance greatly. In this experiment, we test the 30-day unplanned ICU patient readmission prediction performance with different network embeddings for the ICD-9 ontology.

Dataset and Second View Construction
In this experiment, we follow the data preprocessing procedure of previous work (Harutyunyan et al., 2017;Lin et al., 2018;Lu et al., 2019), and generate a dataset of 48, 410 ICU stay records out of the freely available MIMIC-III database (Johnson et al., 2016). The task is to predict whether or not a patient in an ICU stay will be readmitted within 30 days after discharge.
We take the transitive closure of ICD-9 as the first and main view. We first transform the short textual descriptions of nodes into one-hot representations, and compute the cosine similarities between them. If the cosine similarity between two nodes is greater than an empirical threshold 0.7, we create a link between them in the second view of ICD-9.

Baselines
Apart from the baselines used in the link prediction and node clustering task, we add one more strong baseline method, i.e., Poincaré (Nickel and Kiela, 2017), as the Poincaré method proves to be particularly effective in embedding hierarchical data, such as the ICD-9 ontology.

Experiment Settings
We use the same metrics with Lin et al.'s work (Lin et al., 2018). The area under the Receiver Operating Characteristics curve (AUC or A.R) is the main metric for evaluation. The recall rate of positive cases (Re-1), i.e., sensitivity, is also important in screening real patients. Additional metrics are reported, but they can be unstable and better be used for additional evaluation. Lin et al. use the embeddings for ICD-9 codes as part of their input. We replace the embeddings for ICD-9 with different methods.

Results
As shown in Table 4, our proposed method achieves the best AUC score of 0.7807 with the highest sensitivity score of 0.7259. It is worth mentioning that the best reported AUC of Lin et al. is 0.791, but this is unfair to compare with since all the embeddings in the table are trained from the ICD-9 only, while the Claims embeddings (Choi et al., 2016) they use are trained from millions of textual data.

Related Work
Recently, researchers use specifically designed encoders to aggregate the local neighborhood information of nodes, to generate low-dimensional embeddings. For example, GAE and VGAE (Kipf and Welling, 2016b) are two methods that use graph convolution networks (GCN) (Kipf and Welling, 2016a) as the encoder. VGAE uses the Gaussian distribution as a prior and pushes the learned representations close to this prior by incorporating a KL divergence penalty. ARGA and ARVGA incorporate an adversarial regularization framework for the same purpose, which is essentially similar in spirit to VGAE. Actually, incorporating adversarial regularization terms and matching the latent representations to a prior distribution are particularly useful for generating robust and meaningful representations when dealing with real-world complex graph data, which is first proposed by Adversarial Autoencoder (AAE) (Makhzani et al., 2015). We extend the adversarial regularization framework to a multiview scenario where the distribution consistency across graph space and node content space are to be preserved. But unlike previous methods like DBGAN (Zheng et al., 2020) and DANE (Gao and Huang, 2018) which aim to reconstruct the node content directly, we focus on the semantic relatedness among them.

Conclusion
In this study, we propose a novel network embedding method, i.e., MRGAE, which models the consistency of node representations across two specific views of networks. To achieve this, we incorporate a multiview adversarial regularization module and specifically design a loss function for joint optimization. We conduct extensive and diverse experiments for evaluation, and the results demonstrate the superb performance of the proposed method.