A Deep Neural Information Fusion Architecture for Textual Network Embeddings

Textual network embeddings aim to learn a low-dimensional representation for every node in the network so that both the structural and textual information from the networks can be well preserved in the representations. Traditionally, the structural and textual embeddings were learned by models that rarely take the mutual influences between them into account. In this paper, a deep neural architecture is proposed to effectively fuse the two kinds of informations into one representation. The novelties of the proposed architecture are manifested in the aspects of a newly defined objective function, the complementary information fusion method for structural and textual features, and the mutual gate mechanism for textual feature extraction. Experimental results show that the proposed model outperforms the comparing methods on all three datasets.


Introduction
Networks provide an effective way to organize heterogeneous relevant data, which can often be leveraged to facilitate downstream applications. For example, the huge amount of textual and relationship data in social networks contains abundant information on people's preferences, and thus can be used for personalized advertising and recommendation. To this end, traditionally a matrix representing the network structure is often built first, then subsequent tasks proceed. However, matrix methods are computationally expensive and cannot be applied to large-scale networks.
Network embedding (NE) maps every node of a network into a low-dimensional vector, while seeking to retain the original network information. Subsequent tasks (e.g. similar vertices search, linkage prediction) can proceed by leveraging these low-dimensional features. To obtain * Corresponding author. network embeddings, (Perozzi et al., 2014) proposed to first generate sequences of nodes by randomly walking along connected nodes. Word embedding methods are then employed to produce the embeddings for nodes by noting the analogies between node sequences and sentences in natural languages. Second-order proximity information is further taken into account in (Tang et al., 2015). To gather the network connection information more efficiently, the random walking strategy in (Perozzi et al., 2014) is modified to favor the important nodes in (Grover and Leskovec, 2016). However, all these methods only took the network structure into account, ignoring the huge amount of textual data. In Twitter social network, for example, tweets posted by a user contain valuable information on the user's preferences, political standpoints, and so on (Bandyopadhyay et al., 2018).
To include textual information into the embeddings, (Tu et al., 2017) proposed to first learn embeddings for the textual data and network structure respectively, and then concatenate them to obtain the embeddings of nodes. The textual and structural embeddings are learned with an objective that encourages embeddings of neighboring nodes to be as similar as possible. Attention mechanism is further employed to highlight the important textual information by taking the impacts of texts from neighboring nodes into account. Later, (Shen et al., 2018b) proposed to use the fine-grained word alignment mechanism to replace the attention mechanism in (Tu et al., 2017) in order to absorb the impacts from neighboring texts more effectively. However, both methods require the textual and structural embeddings from neighboring nodes to be as close as possible even if the nodes share little common contents. This could be problematic since a social network user may be connected to users who post totally different viewpoints because of different political standpoints. If two nodes are similar, it is the node embeddings, rather than the individual textual or structural embeddings, that should be close. Forcing representations of dissimilar data to be close is prone to yield bad representations. Moreover, since the structural and textual embeddings contain some common information, if they are concatenated directly, as done in (Tu et al., 2017;Shen et al., 2018b), the information contained in the two parts is entangled in some very complicated way, increasing the difficulties of learning representative network embeddings.
In this paper, we propose a novel deep neural Information Fusion Architecture for textual Network Embedding (NEIFA) to tackle the issues mentioned above. Instead of forcing the separate embeddings of structures and texts from neighboring nodes to be close, we define the learning objective based on the node embeddings directly. For the problem of information entanglement, inspired by the gating mechanism of long short-term memory (LSTM) (Hochreiter and Schmidhuber, 1997), we extract the complementary informations from texts and structures and then use them to constitute the node embeddings. A mutual gate is further designed to highlight the node's textual information that is consistent with neighbors' textual contents, while diminishing those that contracdict to each other. In this way, the model provides a mechanism to only allow the information that is consistent among neighboring nodes to flow into the node embeddings. The proposed network embedding method is evaluated on the tasks of link prediction and vertex classification, using three realworld datasets from different domains. It is shown that the proposed method outperforms state-ofthe-art network embedding methods on the task of link prediction by a substantial margin, demonstrating that the obtained embeddings well retain the information in original networks. Similar phenomenons can also be observed in the vertex classification task. These results suggest the effectiveness of the proposed neural information fusion architecture for textual network embeddings.

Related Work
Text Embedding There has been various methods to embed textual information into vector representations for NLP tasks. The classical method for embedding textual information could be onehot vector, term frequency inverse document fre-quency (TF-IDF), etc. Due to the high-dimension and sparsity problems in here, (Mikolov et al., 2013a) proposed a novel neural network based skip-gram model to learn distributed word embeddings via word co-occurrences in a local window of textual content. To exploit the internal structure of text, convolutional neural networks (CNNs) (Blunsom et al., 2014;Kim, 2014) is applied to obtain latent features of local textual content. Then, by following a pooling layer, fixedlength representations are generated. To have the embeddings better reflect the correlations among texts, soft attention mechanisms (Bahdanau et al., 2014;Vaswani et al., 2017) is proposed to calculate the relative importances of words in a sentence by evaluating their relevances to the content of comparing sentences. Alternatively, gating mechanism is applied to strengthen the relevant textual information, while weakening the irrelevant one by controlling the information-flow path of a network in (Dauphin et al., 2017;Zhou et al., 2017).

Network Embedding
Network embedding methods can be categorized into two classes: (1) methods that solely utilize structure information; and (2) methods that consider both structure and textual content associated with vertices. For the first type of methods, DeepWalk (Perozzi et al., 2014) was the first to introduce neural network technique into the network embedding field. In DeepWalk, node sequences are generated via randomly walking on the network, dense latent representations are by feeding those node sequences into the skip-gram model. LINE (Tang et al., 2015) exploited the first-order and second-order proximity information of vertices in network by optimizing the joint and condition probability of edges. Further, Node2Vec (Grover and Leskovec, 2016) proposed a biased random walk to search a network and generate node sequences based on the depth-first search and width-first search. However, those methods only embed the structure information into vector representations, while ignoring the informative textual contents associated with vertices. To address this issue, some recent work seeks to the joint impact of structure and textual contents to obtain better representations. TADW (Yang et al., 2015) proved that DeepWalk is equivalent to the matrix factorization and the textual information can be incorporated by simply adding the textual feature into the matrix factorization. CENE (Sun et al., 2016)  works by transforming the textual content into another kinds of vertices, and the vertices are embedded into low-dimensional representations on the extended network. CANE (Tu et al., 2017) proposed to learn separate embeddings for the textual and structural information, and obtain the network embeddings by simply concatenating them, in which a mutual attention mechanism is used to model the semantic relationship between textual contents. WANE (Shen et al., 2018b) modified the semantic extraction strategy in CANE by introducing a fine-grained word alignment technique to learn word-level semantic information more effectively. However, most of recent methods force the textual and structural embeddings of two neighboring nodes close to each other irrespective of their underlying contents.

The Proposed Method
A textual network is defined as G = {V , E, T }, where V , E and T denote the vertices in the graph, edges between vertices and textual content associated with vertices, respectively. Each edge e i,j ∈ E suggests there is a relationship between vertex v i and v j .

Training Objective
Suppose the structural and textual features of node i are given and are denoted as s i and t i , respectively. Existing methods are built on the objectives that encourage the structural and textual features of neighboring nodes to be as similar as possible. As discussed in the previous sections, this may make the node embeddings deviate from the true information in the nodes. In this paper, we define the objective based on the node embeddings directly, that is, where h i is the network embedding of node i, and is constructed from s i and t i by F(·, ·) is the fusion function that maps the structural and textual features into the network embeddings; and p(h i |h j ) denotes the conditional probability of network embedding h i given the network embedding h j . Following LINE (Tang et al., 2015), the conditional probability in (1) is defined as: . (3) Note that the structural feature s i is randomly initialized and will be learned along with the other model parameters. For the textual feature t i , it is obtained via a trainable feature extraction function from the given texts, i.e., where x i represents the texts associated with node i. From the definition of objective function (1), it can be seen that it is the network embeddings of nodes, rather than the individual structural or textual embeddings, that are encouraged to be close for neighboring nodes. Details on how to realize the fusion function F(·, ·) and feature extraction function T (·) are deferred to Section 3.2 and Section 3.3, respectively. The overall framework of our proposed NEIFA is shown in Fig.1.

Fusion of Structural and Textual
Features F(s i , t i ) In this section, we will present how to fuse the structural and textual features to yield the embeddings for nodes. The fusion module F(s i , t i ) is illustrated in Fig.2. The simplest way to obtain network embeddings is to concatenate them di- However, it is known that the structural and textual features are not fully exclusive, and often contain some common information. Thus, if the network embeddings are generated by simply concatenating the two features, different parts of the embeddings become entangled to each other in some unknown but complex way. This may make the process of optimizing the objective function more difficult and hinder the model to learn representative embeddings for the nodes. In this paper, we instead distill the information that is complementary to the textual feature t i from s i first, and then concatenate the two complementary information to constitute the embeddings of nodes.
To distill the complementary information from the structural feature s i , inspired by LSTM, an input gate is designed to eliminate the information in s i that has already appeared in t i . Specifically, the gate is designed as where σ(·) is the sigmoid function; denotes the element-wise multiplication; P and b g are used to align the structural feature s i to the space of textual features t i . From the definition of g i , it can be seen that if the values on some specific dimension of Ps i + b g and t i are both large, which indicates the same information appears in both s i and t i , the gate g i will be closed. So, if (Ps i + b g ) t i is multiplied to the gate g i , only the information that is not contained in both s i and t i is allowed to pass through. Thus, ((Ps i + b g ) t i ) g i can be un-derstood as the information in s i that is complementary to t i . In practice, we untie the values of P and b g , and use a new trainable matrix Q and bias b c instead. The complementary information is eventually computed as Then, we concatenate complementary information z i to the textual features t i to produce the final network embedding In this way, given the structural and textual features s i and t i , we successfully extract the complementary information from s i and generate the final network embedding h i .

Textual Feature Extraction T (x i )
When extracting textual features for the embeddings of nodes, the impacts from neighboring nodes should also be taken into account, i.e. highlighting the consistent information, while dampening the inconsistent ones. To this end, we first repersent words with their corresponding embeddings, and then apply a one-layer CNN followed by an average pooling operator to extract the raw features for texts (Tu et al., 2017;Shen et al., 2018a). Given the raw textual features r i and r j of two neighboring nodes i and j, we diminish the information that are not consistent in the two raw features. Specifically, the final textual features are computed for nodes i and j as where σ(·) serves as the role of gating. Since the raw textual feature r i often exhibits specific meanings on different dimensions, the expressions (8) and (9) can be understood as a way to control which information is allowed to flow into the embeddings. Only the information that is consitent among neighboring nodes can appear in the textual feature t i , which is then fused into the network embeddings. There are a variety of other nonlinear functions that can serve as the role of gating, but in this work, the simplest but effective sigmoid function is employed.

Training Details
Maximizing the objective function in (1) requires to compute the expensive softmax function repeatedly, in which the summation over all nodes of the networks is needed for iteration. To address this issue, for each edge e i,j ∈ E, we introduce negative sampling (Mikolov et al., 2013b) to simplify the optimization process. Therefore, the conditional distribution p(h i |h j ) into the following form:

Experiments
To evaluate the quality of the network embeddings generated by the proposed method, we apply them in two tasks: link prediction and vertex classification. Link prediction aims to predict whether there exists a link between two randomly chosen nodes based on the similarities of embeddings of the two nodes. Vertex classification, on the other hand, tries to classify the nodes into different categories based on the embeddings, provided that there exists some supervised information. Both tasks can achieve good performances only when the embeddings retain important information of the nodes, including both the structural and textual information. In the following, we will first introduce the datasets and baselines used in this paper, then describe the evaluation metric and experimental setups, and lastly report the performance of the proposed model on the tasks of link prediction and vertex classification, respectively.

Datasets and Baselines
Experiments are conducted on three real-world datasets: Zhihu (Sun et al., 2016), Cora (McCallum et al., 2000) and HepTh (Leskovec et al., 2005). Below shows the detailed descriptions of the three datasets, with their statistics summaries given in Table 1. The preprocessing procedure of the above datasets is the same as that in (Tu et al.,  2017) 1 .
• Zhihu (Sun et al., 2016) is a Q&A based community social network. In our experiment, 10000 active users and the descriptions of their interested topics are collected as the vertices and texts of the social network to be studied. There are total 43894 edges which indicate the relationship between active users.
• Cora (McCallum et al., 2000) is a citation network that consists of 2277 machine learning papers with text contents divided into 7 categories. The citation relations among the papers are reflected in the 5214 edges.
• HepTh (Leskovec et al., 2005) (High Energy Physics Theory) is a citation network from the e-print arXiv. In our experiment, 1038 papers with abstract information are collected, among which 1990 edges are observed.
To evaluate the effectiveness of our proposed model, several strong baseline methods are compared with, which are divided into two categories as follows: • Structure-only: DeepWalk (Perozzi et al., 2014), LINE (Tang et al., 2015), Node2vec (Grover and Leskovec, 2016).

Evaluation Metrics and Experimental Setups
In link prediction, the performance criteria of area under the curve ( (Tu et al., 2017) and (Shen et al., 2018b), respectively.
1982) is used, which represents the probability that vertices in a random unobserved link are more similar than those in a random non-existent link. For the vertex classification task, a logistic regression model is first trained to classify the embeddings into different categories based on the provided labels of nodes. Then, the trained model is used to classify the network embeddings in test set, and the classification accuracy is used as the performance criteria of this task.
To have a fair comparison with competitive methods, the dimension of network embeddings is set to 200 for all considered methods. The number of negative samples is set to 1 and the mini-batch size is set to 64 to speed up the training processes. Adam (Kingma and Ba, 2014) is employed to train our model with a learning rate of 1 × 10 −3 .

Link Prediction
We randomly extract a portion of edges from the whole edges to constitute the training datasets, and use the rest as the test datasets. The AUC scores of different models under proportions ranging from 15% to 95% on Zhihu, Cora and HepTh datasets are shown in Table 2, Table 3 and Table 4, respectively, with the best performance highlighted in bold.
As can be seen from Table 2, our proposed method outperforms all other baselines in Zhihu dataset substantially, with approximately a 10 percent improvement over the current state-of-the-art WANE model. This may be partially attributed to the complicated Zhihu dataset, in which both the structures and texts contain important informations. If the two individual features are concatenated directly, there may be sever information overlapping problem, limiting the models to learning good embeddings. The proposed complementary information fusion method alleviate the issue by disentangling the structural and textual features. In adition, the proposed mutual gate mechanism that removes inconsistent textual information from a node's textual feature also contribute to the performance gains. On the other hand, the substantial gain may also be partially attributed to the objective function that is directly defined on the % Training Edges 15% 25% 35% 45 % 55% 65% 75% 85% 95%   (Tu et al., 2017) and (Shen et al., 2018b), respectively. network embeddings. That is because the inconsistencies of the structural or textual information among neighboring nodes are more likely to happen in complex networks. For the other two datasets, as shown in Table  3 and Table 4, our proposed method outperforms baseline methods overall. The results strongly demonstrate that the network embeddings generated by the proposed model are easier to preserve the original information in the nodes. It can be also seen that the performance gains observed in the Cora and HepTh datasets are not as substantial as that in Zhihu dataset. The relatively small improvement may be attributed to the fact that the number of edges and neighbors in Cora and HepTh datasets are much smaller that Zhihu datasets. We speculate that the information in structures of the two datasets is far less than that in texts, implying that the overlapping issue is not as sever as that in Zhihu. Hence, direct concatenation will not induce significant performance loss. Ablation Study To demonstrate the effectiveness of proposed fusion method and mutual gate mechanism, three variants of the proposed model are evaluated: (1)NEIFA(w/o FM): NEIFA with-out both fusion process and mutual gated mechanism where the raw textual features r are directly regarded as network embeddings. (2) NEIFA(w/o F): NEIFA without fusion process where the textual features t are directly regarded as network embeddings. (3) NEIFA(w/o M): NEIFA without mutual gated mechanism where the network embeddings are obtained by fusing the structural features and raw textual features. The three variants are compared with original NEIFA model on the three datasets above. The results are showed in Fig.3. It can be seen that for networks with very sparse structure, such as Hepth, the method that simply uses the raw textual features as their network embeddings can achieve pretty good performance. In the simple datasets, the proposed model even exhibits worse performance in the case of small proportion of training edges. As the datasets become larger and more complex network structure is included, the performance of only using the textual embeddings decreases rapidly. The reason may be that as the networks grow, the differences of structural or textual data among neighboring nodes become more apparent, and the advantages of the mutual gate mechanism and information fu-

Vertex Classification
To demonstrate the superiority of proposed method, the vertex classification experiment is also considered on the Cora dataset. This experiment is established on the basis that if the original network contains different types of nodes, good embeddings mean that they can be classified into specific classes by a simple classifier easily. For the proposed method, the embedding of a node varies as it interacts with different nodes. To have the embedding fixed, we follow procedures in (Tu et al., 2017) to yield a node's embedding by averaing the embeddings that are obtained when the node interacts with different neighbors.
To this end, we randomly split the node embeddings of all nodes with a proportion of 50%-50% into a training and testing set, respectively. A logistic regression classifier regularized by L 2 distance (Fan et al., 2008) is then trained on the node embeddings from training set. The classification performance is tested on the hold-out testing set. The above procedures are repeated 10 times and their average value is reported as the final performance. It can be seen from Fig.4 that methods considering both structural and textual information show better classification accuracies than methods leveraging only structural information, demonstrating the importance of incorporating textual information into the embeddings. Moreover, NEIFA outperforms all methods considered, which further proves the superiority of our proposed model.
To intuitively understand the embeddings produced by the proposed model, we employ t-SNE to map our learned embeddings to a 2-D space. The result is shown in Fig.5, where different colors indicate that the nodes belong to different cat- egories. Note that, although the mapping in t-SNE is trained without using any category labels, the latent label information is still partially extracted out. As shown in Fig.5, the points with the same color are closer to each other, while the ones with different colors are far apart.

Conclusions
In this paper, a novel deep neural architecture is proposed to effectively fuse the structural and textual informations in networks. Unlike existing embeddings methods which encourage both textual and structural embeddings of two neighboring nodes close to each other, we define the training objective based on the node embeddings directly.
To address the information duplication problem in the structural and textual features, a complementary information fusing method is further developed to fuse the two features. Besides, a mutual gate is designed to highlight the textual information in a node that is consistent with the textual contents of neighboring nodes, while diminishing those that are conflicting to each other. Exhaustive experimental results on several tasks manifest the advantages of our proposed model.