Improved Semantic-Aware Network Embedding with Fine-Grained Word Alignment

Network embeddings, which learns low-dimensional representations for each vertex in a large-scale network, have received considerable attention in recent years. For a wide range of applications, vertices in a network are typically accompanied by rich textual information such as user profiles, paper abstracts, etc. In this paper, we propose to incorporate semantic features into network embeddings by matching important words between text sequences for all pairs of vertices. We introduce an word-by-word alignment framework that measures the compatibility of embeddings between word pairs, and then adaptively accumulates these alignment features with a simple yet effective aggregation function. In experiments, we evaluate the proposed framework on three real-world benchmarks for downstream tasks, including link prediction and multi-label vertex classification. The experimental results demonstrate that our model outperforms state-of-the-art network embedding methods by a large margin.


Introduction
Networks are ubiquitous, with prominent examples including social networks (e.g., Facebook, Twitter) or citation networks of research papers (e.g., arXiv). When analyzing data from these real-world networks, traditional methods often represent vertices (nodes) as one-hot representations (containing the connectivity information of each vertex with respect to all other vertices), usually suffering from issues related to the inherent sparsity of large-scale networks. This results in models that are not able to fully capture the relationships between vertices of the network (Perozzi et al., 2014;Tu et al., 2016). Alternatively, network embedding (i.e., network representation learning) has been considered, representing each vertex of a network with a low-dimensional vector that preserves information on its similarity rel-This paper investigates random walk graphs in high dimensional space.
We propose an algorithm for multidimensional random walk problems. citation Figure 1: Example of the text information (abstracts) associated to two papers in a citation network. Key words for matching are highlighted.
Traditional network embedding approaches focus primarily on learning representations of vertices that preserve local structure, as well as internal structural properties of the network. For instance, Isomap (Tenenbaum et al., 2000), LINE (Tang et al., 2015), and Grarep (Cao et al., 2015) were proposed to preserve first-, second-, and higher-order proximity between nodes, respectively. DeepWalk (Perozzi et al., 2014), which learns vertex representations from random-walk sequences, similarly, only takes into account structural information of the network. However, in realworld networks, vertices usually contain rich textual information (e.g., user profiles in Facebook, paper abstracts in arXiv, user-generated content on Twitter, etc.), which may be leveraged effectively for learning more informative embeddings.
To address this opportunity, Yang et al. (2015) proposed text-associated DeepWalk, to incorporate textual information into the vectorial representations of vertices (embeddings).  employed deep recurrent neural networks to integrate the information from vertex-associated text into network representations. Further, Tu et al. (2017) proposed to more effectively model the semantic relationships between vertices using a mutual attention mechanism.
Although these methods have demonstrated performance gains over structure-only network embeddings, the relationship between text sequences for a pair of vertices is accounted for solely by comparing their sentence embeddings. However, as shown in Figure 1, to assess the similarity between two research papers, a more effective strategy would compare and align (via localweighting) individual important words (keywords) within a pair of abstracts, while information from other words (e.g., stop words) that tend to be less relevant can be effectively ignored (downweighted). This alignment mechanism is difficult to accomplish in models where text sequences are first embedded into a common space and then compared in pairs (He and Lin, 2016;Parikh et al., 2016;Wang and Jiang, 2017;Wang et al., 2017b;Shen et al., 2018a).
We propose to learn a semantic-aware Network Embedding (NE) that incorporates wordlevel alignment features abstracted from text sequences associated with vertex pairs. Given a pair of sentences, our model first aligns each word within one sentence with keywords from the other sentence (adaptively up-weighted via an attention mechanism), producing a set of fine-grained matching vectors. These features are then accumulated via a simple but efficient aggregation function, obtaining the final representation for the sentence. As a result, the word-by-word alignment features (as illustrated in Figure 1) are explicitly and effectively captured by our model. Further, the learned network embeddings under our framework are adaptive to the specific (local) vertices that are considered, and thus are context-aware and especially suitable for downstream tasks, such as link prediction. Moreover, since the word-by-word matching procedure introduced here is highly parallelizable and does not require any complex encoding networks, such as Long Short-Term Memory (LSTM) or Convolutional Neural Networks (CNNs), our framework requires significantly less time for training, which is attractive for large-scale network applications.
We evaluate our approach on three real-world datasets spanning distinct network-embeddingbased applications: link prediction, vertex classi-fication and visualization. We show that the proposed word-by-word alignment mechanism efficiently incorporates textual information into the network embedding, and consistently exhibits superior performance relative to several competitive baselines. Analyses considering the extracted word-by-word pairs further validate the effectiveness of the proposed framework.

Problem Definition
A network (graph) is defined as G = {V , E}, where V and E denote the set of N vertices (nodes) and edges, respectively, where elements of E are two-element subsets of V . Here we only consider undirected networks, however, our approach (introduced below) can be readily extended to the directed case. We also define W , the symmetric R N ×N matrix whose elements, w ij , denote the weights associated with edges in V , and T , the set of text sequences assigned to each vertex. Edges and weights contain the structural information of the network, while the text can be used to characterize the semantic properties of each vertex. Given network G, with the network embedding we seek to encode each vertex into a low-dimensional vector h (with dimension much smaller than N ), while preserving structural and semantic features of G.

Framework Overview
To incorporate both structural and semantic information into the network embeddings, we specify two types of (latent) embeddings: (i) h s , the structural embedding; and (ii) h t , the textual embedding. Specifically, each vertex in G is encoded into a low-dimensional embedding h = [h s ; h t ].
To learn these embeddings, we specify an objective that leverages the information from both W and T , denoted as where L struct , L text and L joint denote structure, text, and joint structure-text training losses, respectively. For a vertex pair {v i , v j } weighted by w ij , L struct (v i , v j ) in (1) is defined as (Tang et al., 2015) where p(h i s |h j s ) denotes the conditional probability between structural embeddings for vertices {v i , v j }. To leverage the textual information in T , similar text-specific and joint structure-text training objectives are also defined where p(h i t |h j t ) and p(h i t |h j s ) (or p(h i s |h j t )) denote the conditional probability for a pair of text embeddings and text embedding given structure embedding (or vice versa), respectively, for vertices {v i , v j }. Further, α 1 , α 2 and α 3 are hyperparameters that balance the impact of the different training-loss components. Note that structural embeddings, h s , are treated directly as parameters, while the text embeddings h t are learned based on the text sequences associated with vertices.
For all conditional probability terms, we follow Tang et al. (2015) and consider the second-order proximity between vertex pairs. Thus, for vertices {v i , v j }, the probability of generating h i conditioned on h j may be written as Note that (6) can be applied to both structural and text embeddings in (2) and (3). Inspired by Tu et al. (2017), we further assume that vertices in the network play different roles depending on the vertex with which they interact. Thus, for a given vertex, the text embedding, h t , is adaptive (specific) to the vertex it is being conditioned on. This type of contextaware textual embedding has demonstrated superior performance relative to context-free embeddings (Tu et al., 2017). In the following two sections, we describe our strategy for encoding the text sequence associated with an edge into its adaptive textual embedding, via word-by-context and word-by-word alignments.

Word-by-Context Alignment
We first introduce our base model, which reweights the importance of individual words within a text sequence in the context of the edge being considered. Consider text sequences associated with two vertices connected by an edge, de- Figure 2: Schematic of the proposed fine-grained word alignment module for incorporating textual information into a network embedding. In this setup, word-by-word matching features are explicitly abstracted to infer the relationship between vertices. noted t a and t b and contained in T . Text sequences t a and t b are of lengths M a and M b , respectively, and are represented by X a ∈ R d×Ma and X Our goal is to encode text sequences t a and t b into counterpart-aware vectorial representations h a and h b . Thus, while inferring the adaptive textual embedding for sentence t a , we propose reweighting the importance of each word in t a to explicitly account for its alignment with sentence t b . The weight α i , corresponding to the i-th word in t a , is generated as: where W 1 and W 2 are model parameters and c b =

Fine-Grained Word-by-Word Alignment
With the alignment in the previous section, wordby-context matching features α i are modeled; however, the word-by-word alignment information (fine-grained), which is key to characterize the relationship between two vertices (as discussed in the above), is not explicitly captured. So motivated, we further propose an architecture to explicitly abstract word-by-word alignment information from t a and t b , to learn the relationship between the two vertices. This is inspired by the recent success of Relation Networks (RNs) for relational reasoning (Santoro et al., 2017). As illustrated in Figure 2, given two input embedding matrices X a and X b , we first compute the affinity matrix A ∈ R M b ×Ma , whose elements represent the affinity scores corresponding to all word pairs between sequences t a and t b Subsequently, we compute the context-aware matrix for sequence t b as where the softmax(·) function is applied columnwise to A, and thus A b contains the attention weights (importance scores) across sequence t b (columns), which account for each word in sequence t a (rows). Thus, matrix X b ∈ R d×Ma in (10) constitutes an attention-weighted embedding for X b . Specifically, the i-th column of X b , denoted as x (i) b , can be understood as a weighted average over all the words in t b , where higher attention weights indicate better alignment (match) with the i-th word in t a .
To abstract the word-by-word alignments, we compare x obtain the corresponding matching vector where f align (·) represents the alignment function. Inspired by the observation in Wang and Jiang (2017) that simple comparison/alignment functions based on element-wise operations exhibit excellent performance in matching text sequences, here we use a combination of element-wise subtraction and multiplication as where denotes the element-wise Hadamard product, then these two operations are concatenated to produce the matching vector m (i) a . Note these operators may be used individually or combined as we will investigate in our experiments.
Subsequently, matching vectors from (11) are aggregated to produce the final textual embedding h a t for sequence t a as where f aggregate denotes the aggregation function, which we specify as the max-pooling pooling operation. Notably, other commutative operators, such as summation or average pooling, can be otherwise employed. Although these aggregation functions are simple and invariant to the order of words in input sentences, they have been demonstrated to be highly effective in relational reasoning (Parikh et al., 2016;Santoro et al., 2017). To further explore this, in Section 5.3, we conduct an ablation study comparing different choices of alignment and aggregation functions. The representation h b can be obtained in a similar manner through (9), (10), (11) and (12), but replacing (9) with A = X T a X b (its transpose). Note that this word-by-word alignment is more computationally involved than word-by-context; however, the former has substantially fewer parameters to learn, provided we no longer have to estimate the parameters in (7).

Training and Inference
For large-scale networks, computing and optimizing the conditional probabilities in (1) using (6) is computationally prohibitive, since it requires the summation over all vertices V in G. To address this limitation, we leverage the negative sampling strategy introduced by Mikolov et al. (2013), i.e., we perform computations by sampling a subset of negative edges. As a result, the conditional in (6) can be rewritten as: where σ(x) = 1/(1 + exp(−x)) is the sigmoid function. Following Mikolov et al. (2013), we set the noise distribution P (v) ∝ d The number of negative samples K is treated as a hyperparameter. We use Adam (Kingma and Ba, 2014) to update the model parameters while minimizing the objective in (1).

Related Work
Network embedding methods can be divided into two categories: (i) methods that solely rely on the structure, e.g., vertex information; and (ii) methods that leverage both the structure the network and the information associated with its vertices.
For the first type of models, DeepWalk (Perozzi et al., 2014) has been proposed to learn node representations by generating node contexts via truncated random walks; it is similar to the concept of Skip-Gram (Mikolov et al., 2013), originally introduced for learning word embeddings. LINE (Tang et al., 2015) proposed a principled objective to explicitly capture first-order and second-order proximity information from the vertices of a network. Further, Grover and Leskovec (2016) introduced a biased random walk procedure to generate the neighborhood for a vertex, which infers the node representations by maximizing the likelihood of preserving the local context information of vertices. However, these algorithms generally ignore rich heterogeneous information associated with vertices. Here, we focus on incorporating textual information into network embeddings.
To learn semantic-aware network embeddings, Text-Associated DeepWalk (TADW) (Yang et al., 2015) proposed to integrate textual features into network representations with matrix factorization, by leveraging the equivalence between Deep-Walk and matrix factorization. CENE (Content-Enhanced Network Embedding)  used bidirectional recurrent neural networks to abstract the semantic information associated with vertices, which further demonstrated the advantages of employing textual information. To capture the interaction between sentences of vertex pairs, Tu et al. (2017) further proposed Context-Aware Network Embedding (CANE), that employs a mutual attention mechanism to adaptively account for the textual information from neighboring vertices. Despite showing improvement over structure-only models, these semantic-aware methods cannot capture word-level alignment information, which is important for inferring the relationship between node pairs, as previously discussed. In this work, we introduce a Word-Alignment-based Network Embedding (WANE) framework, which aligns and aggregates word-byword matching features in an explicit manner, to obtain more informative network representations.

Experimental Setup
Datasets We investigate the effectiveness of the proposed WANE model on two standard networkembedding-based tasks, i.e., link prediction and multi-label vertex classification. The following three real-world datasets are employed for quantitative evaluation: (i) Cora, a standard paper citation network that contains 2,277 machine learning papers (vertices) grouped into 7 categories and connected by 5,214 citations (edges) (ii) HepTh, another citation network of 1,038 papers with abstract information and 1,990 citations; (iii) Zhihu, a network of 10,000 active users from Zhihu, the largest Q&A website in China, where 43,894 vertices and descriptions of the Q&A topics are available. The average lengths of the text in the three datasets are 90, 54, and 190, respectively. To make direct comparison with existing work, we employed the same preprocessing procedure 1 of Tu et al. (2017).
Training Details For fair comparison with CANE (Tu et al., 2017), we set the dimension of network embedding for our model to 200. The number of negative samples K is selected from {1, 3, 5} according to performance on the validation set. We set the batch size as 128, and the model is trained using Adam (Kingma and Ba, 2014), with a learning rate of 1 × 10 −3 for all parameters. Dropout regularization is employed on the word embedding layer, with rate selected from {0.5, 0.7, 0.9}, also on the validation set. Our code will be released to encourage future research.
Baselines To evaluate the effectiveness of our framework, we consider several strong baseline methods for comparisons, which can be categorized into two types: (i) models that only exploit structural information: MMB (Airoldi et al., 2008), DeepWalk (Perozzi et al., 2014), LINE (Tang et al., 2015), and node2vec (Grover and Leskovec, 2016). (ii) Models that take both structural and textual information into account: Naive combination (which simply concatenates the structure-based embedding with CNN-based text embeddings, as explored in (Tu et al., 2017), TADW (Yang et al., 2015), CENE (Sun et al.,   2016), and CANE (Tu et al., 2017). It is worth noting that unlike all these baselines, WANE explicitly captures word-by-word interactions between text sequence pairs.

Evaluation Metrics
We employ AUC (Hanley and McNeil, 1982) as the evaluation metric for link prediction, which measures the probability that vertices within an existing edge, randomly sampled from the test set, are more similar than those from a random pair of non-existing vertices, in terms of the inner product between their corresponding embeddings. For multi-label vertex classification and to ensure fair comparison, we follow Yang et al. (2015) and employ a linear SVM on top of the learned network representations, and evaluate classification accuracy with different training ratios (varying from 10% to 50%). The experiments for each setting are repeated 10 times and the average test accuracy is reported.

Experimental Results
We experiment with three variants for our WANE model: (i) WANE: where the word embeddings of each text sequence are simply average to obtain the sentence representations, similar to (Joulin et al., 2016;Shen et al., 2018c). (ii) WANE-wc: where the textual embeddings are inferred with word-by-context alignment. (iii) WANE-ww: where the word-by-word alignment mechanism is leveraged to capture word-by-word matching features between available sequence pairs. Table 1 presents link prediction results for all models on Cora dataset, where different ratios of edges are used for training. It can be observed that when only a small number of edges are available, e.g., 15%, the performances of structure-only methods is much worse than semantic-aware models that have taken textual information into consideration The perfromance gap tends to be smaller when a larger proportion of edges are employed for training. This highlights the importance of incorporating associated text sequences into network embeddings, especially in the case of representing a relatively sparse network. More importantly, the proposed WANE-ww model consistently outperforms other semantic-aware NE models by a substantial margin, indicating that our model better abstracts word-by-word alignment features from the text sequences available, thus yields more informative network representations.

Link Prediction
Further, WANE-ww also outperforms WANE or WANE-wc on a wide range of edge training pro-   portions. This suggests that: (i) adaptively assigning different weights to each word within a text sequence (according to its paired sequence) tends to be a better strategy than treating each word equally (as in WANE). (ii) Solely considering the context-by-word alignment features (as in WANE-wc) is not as efficient as abstracting word-by-word matching information from text sequences. We observe the same trend and the superiority of our WANE-ww models on the other two datasets, HepTh and Zhihu datasets, as shown in Table 2 and 3, respectively.

Multi-label Vertex Classification
We further evaluate the effectiveness of proposed framework on vertex classification tasks with the Cora dataset. Similar to Tu et al. (2017), we generate the global embedding for each vertex by taking the average over its context-aware embeddings with all other connected vertices. As shown in Figure 3(c), semantic-aware NE methods (including naive combination, TADW, CENE, CANE) exhibit higher test accuracies than semantic-agnostic models, demonstrating the advantages of incorporating textual information. Moreover, WANEww consistently outperforms other competitive semantic-aware models on a wide range of labeled proportions, suggesting that explicitly capturing word-by-word alignment features is not only use-ful for vertex-pair-based tasks, such as link prediction, but also results in better global embeddings which are required for vertex classification tasks. These observations further demonstrate that WANE-ww is an effective and robust framework to extract informative network representations.
Semi-supervised classification We further consider the case where the training ratio is less than 10%, and evaluate the learned network embedding with a semi-supervised classifier. Following Yang et al. (2015), we employ a Transductive SVM (TSVM) classifier with a linear kernel (Joachims, 1998) for fairness. As illustrated in Table 4, the proposed WANE-ww model exhibits superior performances in most cases. This may be due to the fact that WANE-ww extracts information from the vertices and text sequences jointly, thus the obtained vertex embeddings are less noisy and perform more consistently with relatively small training ratios (Yang et al., 2015).

Ablation Study
Motivated by the observation in Wang and Jiang (2017) that the advantages of different functions to match two vectors vary from task to task, we further explore the choice of alignment and aggregation functions in our WANE-ww model. To match the word pairs between two sequences, we experimented with three types of operations: sub- traction, multiplication, and Sub & Multi (the concatenation of both approaches). As shown in Figure 3(a) and 3(b), element-wise subtraction tends to be the most effective operation performancewise on both Cora and Zhihu datasets, and performs comparably to Sub & Multi on the HepTh dataset. This finding is consistent with the results in Wang and Jiang (2017), where they found that simple comparison functions based on elementwise operations work very well on matching text sequences.
In terms of the aggregation functions, we compare (one-layer) CNN, mean-pooling, and maxpooling operations to accumulate the matching vectors. As shown in Figure 3(b), max-pooling has the best empirical results on all three datasets. This may be attributed to the fact that the maxpooling operation is better at selecting important word-by-word alignment features, among all matching vectors available, to infer the relationship between vertices.

Qualitative Analysis
Embedding visualization To visualize the learned network representations, we further employ t-SNE to map the low-dimensional vectors of the vertices to a 2-D embedding space. We use the Cora dataset because there are labels associated with each vertex and WANE-ww to obtain the network embeddings.
As shown in Figure 4 where each point indicates one paper (vertex), and the color of each point indicates the category it belongs to, the embeddings of the same label are indeed very close in the 2-D plot, while those with different labels are relatively farther from each other. Note that the model is not trained with any label information, indicating that WANE-ww has extracted meaningful patterns from the text and vertex information available.  Case study The proposed word-by-word alignment mechanism can be used to highlight the most informative words (and the corresponding matching features) wrt the relationship between vertices. We visualize the norm of matching vector obtained in (11) in Figure 5 for the Cora dataset. It can be observed that matched key words, e.g., 'MCMC', 'convergence', between the text sequences are indeed assigned higher values in the matching vectors. These words would be selected preferentially by the final max-pooling aggregation operation. This indicates that WANEww is able to abstract important word-by-word alignment features from paired text sequences.

Conclusions
We have presented a novel framework to incorporate the semantic information from vertexassociated text sequences into network embeddings. An align-aggregate framework is introduced, which first aligns a sentence pair by capturing the word-by-word matching features, and then adaptively aggregating these word-level alignment information with an efficient max-pooling function. The semantic features abstracted are further encoded, along with the structural information, into a shared space to obtain the final network embedding. Compelling experimental results on several tasks demonstrated the advantages of our approach. In future work, we aim to leverage abundant unlabeled text data to abstract more informative sentence representations (Dai and Le, 2015;Tang and de Sa, 2018) . Another interesting direction is to learn binary and compact network embedding, which could be more efficient in terms of both computation and memory, relative to its continuous counterpart (Shen et al., 2018b).