Improving Textual Network Embedding with Global Attention via Optimal Transport

Constituting highly informative network embeddings is an essential tool for network analysis. It encodes network topology, along with other useful side information, into low dimensional node-based feature representations that can be exploited by statistical modeling. This work focuses on learning context-aware network embeddings augmented with text data. We reformulate the network embedding problem, and present two novel strategies to improve over traditional attention mechanisms: (i) a content-aware sparse attention module based on optimal transport; and (ii) a high-level attention parsing module. Our approach yields naturally sparse and self-normalized relational inference. It can capture long-term interactions between sequences, thus addressing the challenges faced by existing textual network embedding schemes. Extensive experiments are conducted to demonstrate our model can consistently outperform alternative state-of-the-art methods.


Introduction
When performing network embedding, one maps network nodes into vector representations that reside in a low-dimensional latent space. Such techniques seek to encode topological information of the network into the embedding, such as affinity (Tang and Liu, 2009), local interactions (e.g, local neighborhoods) (Perozzi et al., 2014), and high-level properties such as community structure (Wang et al., 2017). Relative to classical network-representation learning schemes (Zhang et al., 2018a), network embeddings provide a more fine-grained representation that can be easily repurposed for other downstream applications (e.g., node classification, link prediction, content recommendation and anomaly detection).
For real-world networks, one naturally may have access to rich side information about each node. Of particular interest are textual networks, where the side information comes in the form of natural language sequences (Le and Lauw, 2014). For example, user profiles or their online posts on social networks (e.g., Facebook, Twitter), and documents in citation networks (e.g., Cora, arXiv). The integration of text information promises to significantly improve embeddings derived solely from the noisy, sparse edge representations (Yang et al., 2015).
Recent work has started to explore the joint embedding of network nodes and the associated text for abstracting more informative representations. Yang et al. (2015) reformulated DeepWalk embedding as a matrix factorization problem, and fused text-embedding into the solution, while  augmented the network with documents as auxiliary nodes. Apart from direct embedding of the text content, one can first model the topics of the associated text (Blei et al., 2003) and then supply the predicted labels to facilitate embedding (Tu et al., 2016).
Many important downstream applications of network embeddings are context-dependent, since a static vector representation of the nodes adapts to the changing context less effectively (Tu et al., 2017). For example, the interactions between social network users are context-dependent (e.g., family, work, interests), and contextualized user profiling can promote the specificity of recommendation systems. This motivates context-aware embedding techniques, such as CANE (Tu et al., 2017), where the vector embedding dynamically depends on the context. For textual networks, the associated texts are natural candidates for context. CANE introduced a simple mutual attention weighting mechanism to derive context-aware dynamic embeddings for link prediction. Following the CANE setup, WANE  further improved the contextualized embedding, by considering fine-grained text alignment.
Despite the promising results reported thus far, we identify three major limitations of existing context-aware network embedding solutions. First, mutual (or cross) attentions are computed from pairwise similarities between local text embeddings (word/phrase matching), whereas global sequence-level modeling is known to be more favorable across a wide range of NLP tasks (Mac-Cartney and Manning, 2009;Liu et al., 2018;Malakasiotis and Androutsopoulos, 2007;Guo et al., 2018). Second, related to the above point, low-level affinity scores are directly used as mutual attention without considering any high-level parsing. Such an over-simplified operation denies desirable features, such as noise suppression and relational inference (Santoro et al., 2017), thereby compromising model performance. Third, mutual attention based on common similarity measures (e.g., cosine similarity) typically yields dense attention matrices, while psychological and computational evidence suggests a sparse attention mechanism functions more effectively (Martins and Astudillo, 2016;Niculae and Blondel, 2017). Thus such naive similarity-based approaches can be suboptimal, since they are more likely to incorporate irrelevant word/phrase matching.
This work represents an attempt to improve context-aware textual network embedding, by addressing the above issues. Our contributions include: (i) We present a principled and moregeneral formulation of the network embedding problem, under reproducing kernel Hilbert spaces (RKHS) learning; this formulation clarifies aspects of the existing literature and provides a flexible framework for future extensions. (ii) A novel global sequence-level matching scheme is proposed, based on optimal transport, which matches key concepts between text sequences in a sparse attentive manner. (iii) We develop a high-level attention-parsing mechanism that operates on top of low-level attention, which is capable of capturing long-term interactions and allows relational inference for better contextualization. We term our model Global Attention Network Embedding (GANE). To validate the effectiveness of GANE, we benchmarked our models against state-of-theart counterparts on multiple datasets. Our models consistently outperform competing methods.

Problem setup
We introduce basic notation and definitions used in this work.
Textual networks. Let G = (V, E, T ) be our textual network, where V is the set of nodes, E ⊆ V × V are the edges between the nodes, and T = {S v } v∈V are the text data associated with each node. We use S v = [ω 1 , · · · , ω nv ] to denote the token sequence associated with node v ∈ V, of length n v = |S v | where | · | denotes the counting measure. To simplify subsequent discussion, we assume all tokens have been pre-embedded in a p-dimensional feature space. As such, S v can be directly regarded as a R p×nv matrix tensor. We use {u, v} to index the nodes throughout the paper. We consider directed unsigned graphs, meaning that for each edge pair (u, v) ∈ E there is a nonnegative weight w uv associated with it, and w uv does not necessarily equal w vu .
Textual network embedding. The goal of textual network embedding is to identify a ddimensional embedding vector z v ∈ R d for each node v ∈ V, which encodes network topology (E) via leveraging information from the associated text (T ). In mathematical terms, we want to learn an encoding (embedding) scheme Z G {z v = Enc(v; G)} v∈V and a probabilistic decoding model with likelihood p θ (E; Z), where E ⊆ V × V is a random network topology for node set V, such that the likelihood for the observed topology p θ (E|Z G ) is high. Note that for efficient coding schemes, the embedding dimension is much smaller than the network size (i.e., d |V|). In a more general setup, the decoding objective can be replaced with p θ (A|Z), where A denotes observed attributes of interest (e.g., node label, community structure, etc.).
Context-aware embedding. One way to promote coding efficiency is to contextualize the embeddings. More specifically, the embeddings additionally depend on an exogenous context c. To distinguish it from the context-free embedding z u , we denote the context-aware embedding as z u|c , where c is the context. For textual networks, when the embedding objective is network topology reconstruction, a natural choice is to treat the text as context (Tu et al., 2017). In particular, when modeling the edge w uv , S v and S u are respectively treated as the context for context-aware embeddings z u|c and z v|c , which are then used in the prediction of edge likelihood.
Attention & text alignment. Much content can be contained in a single text sequence, and retrieving them with a fixed length feature vector can be challenging. A more flexible solution is to employ an attention mechanism, which only attends to content that is relevant to a specific query (Vaswani et al., 2017). Specifically, attention models leverage a gating mechanism to de-emphasize irrelevant parts in the input; this method pools information only from the useful text, which is also a fixed length vector but that only encodes information with respect to one specific content (Santos et al., 2016). Popular choices of attention include normalized similarities in the feature space (e.g., Softmax normalized cosine distances). For two text sequences, one can build a mutual attention by cross-relating the content from the respective text (Santoro et al., 2017). In text alignment, one further represents the content from one text sequence using the mutual attention based attentive-pooling on the other sequence .
, a set of locations and their associated nonnegative mass (we assume i µ i = j ν j = 1). We call π ∈ R n×m + a valid transport plan if it properly redistributes mass from µ to ν, i.e., i π ij = ν j and j π ij = µ i . In other words, π breaks mass at {x i } into smaller parts and transports π ij units of x i to y j . Given a cost function c(x, y) for transporting unit mass from x to y, discretized OT solves the following constrained optimization for an optimal transport plan π * (Peyré et al., 2017): where Π(µ, ν) denotes the set of all viable transport plans. Note that c(x, y) is a distance metric on X , and D c (µ, ν) induces a distance metric on the space of probability distributions supported on X , commonly known as the Wasserstein distance (Villani, 2008). Popular choices of cost include Euclidean cost x − y 2 2 for general probabilistic learning (Gulrajani et al., 2017) and cosine similarity cost cos(x, y) for natural language models (Chen et al., 2018). Computationally, OT plans are often approximated with Sinkhorn-type iterative schemes (Cuturi, 2013). Algorithm 1 summarizes a particular variant used in our study (Xie et al., 2018).

Model framework overview
To capture both the topological information (network structure E) and the semantic information (text content T ) in the textual network embedding, we explicitly model two types of embeddings for each node v ∈ V: (i) the topological embedding z t u , and (ii) the semantic embedding z s u . The final embedding is constructed by concatenating the topological and semantic embeddings, i.e., We consider the topological embedding z t as a static property of the node, fixed regardless of the context. On the other hand, the semantic embedding z s dynamically depends on the context, which is the focus of this study.
Motivated by the work of (Tu et al., 2017), we consider the following probabilistic objective to train the network embeddings: where e = (u, v) represents sampled edges from the network and Θ = {Z, θ} is the collection of model parameters. The edge loss (e; Θ) is given by the cross entropy where p Θ (u|v) denotes the conditional likelihood of observing a (weighted) link between nodes u and v, with the latter serving as the context. More specifically, is the normalizing constant and ·, · is an inner product operation, to be defined momentarily. Note here we have suppressed the dependency on Θ to simplify notation.
To capture both the topological and semantic information, along with their interactions, we propose to use the following decomposition for our inner product term: Here we use z a u , z b v ab , a, b ∈ {s, t} to denote the inner product evaluation between the two feature embeddings z a u and z b v , which can be defined by a semi-positive-definite kernel function et al., 2012), e.g., Euclidean kernel, Gaussian RBF, IMQ kernel, etc. Note that for a = b, z a u and z b v do not reside on the same feature space. As such, embeddings are first mapped to the same feature space for inner product evaluation. In this study, we use the Euclidean kernel Here A is a trainable parameter, and throughout this paper we omit the bias terms in linear maps to avoid notational clutter.
Note that our solution differs from existing network-embedding models in that: (i) our objective is a principled likelihood loss, while prior works heuristically combine the losses of four different models (Tu et al., 2017), which may fail to capture the non-trivial interactions between the fixed and dynamic embeddings; and (ii) we present a formal derivation of network embedding in a reproducing kernel Hilbert space. Negative sampling. Direct optimization of (3) requires summing over all nodes in the network, which can be computationally infeasible for largescale networks. To alleviate this issue, we consider other more computationally efficient surrogate objectives. In particular, we adopt the negative sampling approach (Mikolov et al., 2013), which replaces the bottleneck Softmax with a more tractable approximation given by where σ(x) = 1 1+exp(−x) is the sigmoid function, and p n (v) is a noise distribution over the nodes. Negative sampling can be considered as a special variant of noise contrastive estimation (Gutmann and Hyvärinen, 2010), which seeks to recover the ground-truth likelihood by contrasting

Aggregation
Optimal transport Figure 1: Schematic of the proposed mutual attention mechanism. In this setup, bag-of-words feature matchings are explicitly abstracted to infer the relationship between vertices.
data samples with noise samples, thereby bypassing the need to compute the normalizing constant. As the number of noise samples K goes to infinity, this approximation becomes exact 1 (Goldberg and Levy, 2014). Following the practice of Mikolov et al. (2013), we set our noise distribution Context matching. We argue that a key to the context-aware network embedding is the design of an effective attention mechanism, which crossmatches the relevant content between the node's associated text and the context. Over-simplified dot-product attention limits the potential of existing textual network embedding schemes. In the following sections, we present two novel, efficient attention designs that fulfill the desiderata listed in our Introduction. Our discussion follows the setup used in CANE (Tu et al., 2017) and WANE , where the text from the interacting node is used as the context. Generalization to other forms of context is straightforward.

Optimal-transport-based matching
We first consider reformulating content matching as an optimal transport problem, and then repurpose the transport plan as our attention score to aggregate context-dependent information. More specifically, we see a node's text and context as two (discrete) distributions over the content space. Related content will be matched in the sense that they yield a higher weight in the optimal transport plan π * . The following two properties make the optimal transport plan more appealing for use as attention score. (i) Sparsity: when solved exactly, π * is a sparse matrix with at most (2m − 1) non-zero elements, where m is the number of contents (Brualdi et al. (1991), §8.1.3); (ii) Selfnormalized: row-sum and column-sum equal the respective marginal distributions.
Implementation-wise, we first feed embedded text sequence S u and context sequence S v into our OT solver to compute the OT plan, Note that here we treat pre-embedded sequence S u as n u point masses in the feature space, each with weight 1/n u , and similarly for S v . Next we "transport" the semantic content from context S v according to the estimated OT plan with matrix multiplication where we have treated S v as a R nv×p matrix. Intuitively, this operation aligns the context with the target text sequence via averaging the context semantic embeddings with respect to the OT plan for each content element in S u . To finalize the contextualized embedding, we aggregate the information from both S u and the aligned S u←v with an operator F agg , In this case, we practice the following simple aggregation strategy: first concatenate S u and the aligned S u←v along the feature dimension, and then take max-pooling along the temporal dimension to reduce the feature vector into a 2p vector, followed by a linear mapping to project the embedding vector to the desired dimensionality.

Attention parsing
Direct application of attention scores based on a low-level similarity-based matching criteria (e.g., dot-product attention) can be problematic in a number of ways: (i) low-level attention scores can be noisy (i.e., spurious matchings), and (ii) similarity-matching does not allow relational inference. To better understand these points, consider the following cases. For (i), if the sequence embeddings used do not explicitly address the syntactic structure of the text, a relatively dense attention score matrix can be expected. For (ii), consider the case when the context is a query, and the matching appears as a cue in the node's text data; then the information needed is actually in the vicinity rather than the exact matching location (e.g., shifted a few steps ahead). Inspired by the work of , we propose a new mechanism called attention parsing to address the aforementioned issues.
As the name suggests, attention parsing recalibrates the raw low-level attention scores to better integrate the information. To this end, we conceptually treat the raw attention matrix T raw as a two-dimensional image and apply convolutional filters to it: where W F ∈ R h×w×c denotes the filter banks with h, w and c respectively as window sizes and channel number. We can stack more convolutional layers, break sequence embedding dimensions to allow multi-group (channel) low-level attention as input, or introduce more-sophisticated model architectures (e.g., ResNet (He et al., 2016), Transformer (Vaswani et al., 2017), etc.) to enhance our model. For now, we focus on the simplest model described above, for the sake of demonstration.
With H ∈ R nu×nv×c as the high-level representation of attention, our next step is to reduce it to a weight vector to align information from the context S v . We apply a max-pooling operation with respect to the context dimension, followed by a linear map to get the logits h ∈ R nu×1 of the weights where B ∈ R c×1 is the projection matrix. Then the parsed attention weight w is obtained by which is used to compute the aligned context embedding Note that here we compute a globally aligned context embedding vector s u←v , rather than one for each location in S u as described in the last section (S u←v ). In the subsequent aggregation operation, s u←v is broadcasted to all the locations in S u . We call this global alignment, to distinguish it from the local alignment strategy described in the last section. Both alignment strategies have their respective merits, and in practice they can be directly combined to produce the final context-aware embedding.

Related Work
Network embedding models. Prior network embedding solutions can be broadly classified into two categories: (i) topology embedding, which only uses the link information; and (ii) fused embedding, which also exploits side information associated with the nodes. Methods from the first category focus on encoding high-order network interactions in a scalable fashion, such as LINE (Tang et al., 2015), DeepWalk (Perozzi et al., 2014). However, models based on topological embeddings alone often ignore rich heterogeneous information associated with the vertices. Therefore, the second type of model tries to incorporate text information to improve network embeddings. For instance, TADW (Yang et al., 2015), CENE , CANE (Tu et al., 2017), WANE , and DMTE (Zhang et al., 2018b).
Optimal Transport in NLP. OT has found increasing application recently in NLP research. It has been successfully applied in many tasks, such as topic modeling (Kusner et al., 2015), text generation (Chen et al., 2018), sequence-to-sequence learning (Chen et al., 2019), and word-embedding alignment (Alvarez-Melis and Jaakkola, 2018). Our model is fundamentally different from these existing OT-based NLP models in terms of how OT is used: these models all seek to minimize OT distance to match sequence distributions, while our model used the OT plan as an attention mechanism to integrate context-dependent information. Attention models. Attention was originally proposed in QA systems (Weston et al., 2015) to overcome the limitations of the sequential computation associated with recurrent models (Hochreiter et al., 2001). Recent developments, such as the Transformer model (Vaswani et al., 2017), have popularized attention as an integral part of compelling sequence models. While simple attention mechanisms can already improve model performance (Bahdanau et al., 2015;, significant gains can be expected from more delicate designs (Yang et al., 2016;Li et al., 2015). Our treatment of attention is inspired by the LEAM model , which significantly improves mutual attention in a computationally efficient way.   (2000). We prune the dataset so that it only has papers on the topic of machine learning. (ii) Hepth 3 , a paper citation network from Arxiv on high energy physics theory, with paper abstracts as text information. (iii) Zhihu, a Q&A network dataset constructed by (Tu et al., 2017), which has 10,000 active users with text descriptions and their collaboration links. Summary statistics of these three datasets are summarized in Table 1. Preprocessing protocols from prior studies are used for data preparation Zhang et al., 2018b;Tu et al., 2017).
For quantitative evaluation, we tested our model on the following tasks: (a) Link prediction, where we deliberately mask out a portion of the edges to see if the embedding learned from the remaining edges can be used to accurately predict the missing edges. (b) Multi-label node classification, where we use the learned embedding to predict the labels associated with each node. Note that the label information is not used in our embedding. We also carried out ablation study to identify the gains. In addition to the quantitative results, we also visualized the embedding and the attention matrices to qualitatively verify our hypotheses.
Evaluation metrics. For the link prediction task, we adopt the area under the curve (AUC) score to evaluate the performance, AUC is employed to measure the probability that vertices in existing edges are more similar than those in the nonexistent edge. For each training ratio, the experiment is executed 10 times and the mean AUC scores are reported, where higher AUC indicates better performance. For multi-label classification, we evaluate the performance with Macro-F1 scores. The experiment for each training ratio is also executed 10 times and the average Macro-F1 scores are reported, where a higher value indicates better performance. Baselines. To demonstrate the effectiveness of the proposed solutions, we evaluated our model along with the following strong baselines. (i) Topology only embeddings: MMB (Airoldi et al.,   2008), DeepWalk (Perozzi et al., 2014), LINE (Tang et al., 2015), Node2vec (Grover and Leskovec, 2016). (ii) Joint embedding of topology & text: Naive combination, TADW (Yang et al., 2015), CENE , CANE (Tu et al., 2017), WANE , DMTE (Zhang et al., 2018b). A brief summary of these competing models is provided in the Supplementary Material (SM).

Results
We consider two variants of our model, denoted as GANE-OT and GANE-AP. GANE-OT employs the most basic OT-based attention model, specifically, global word-by-word alignment model; while GANE-AP additionally uses a one-layer convolutional neural network for the attention parsing. Detailed experimental setups are described in the SM. Link prediction. Tables 2 and 3 summarize the results from the link-prediction experiments on all three datasets, where a different ratio of edges are used for training. Results from models other than GANE are collected from Tu et al. (2017),  and Zhang et al. (2018b). We have also repeated these experiments on our own, and the results are consistent with the ones reported. Note that Zhang et al. (2018b) did not report results on DMTE. Both GANE variants consistently outperform competing solutions. In the low-training-sample regime our solutions lead by a large margin, and the performance gap closes as the number of training samples increases. This indicates that our OT-based mutual attention framework can yield more informative textual representations than other methods. Note that GANE-AP delivers better results compared with GANE-OT, suggesting the attention parsing mechanism can further improve the low-level mutual attention matrix. More results on Cora and Hepth are provided in the SM.
Multi-label Node Classification. To further evaluate the effectiveness of our model, we consider multi-label vertex classification. Following the setup described in (Tu et al., 2017), we first computed all context-aware embeddings. Then we averaged over each node's context-aware embeddings with all other connected nodes, to obtain a global embedding for each node, i.e., z u =    sophisticated deep classifier, to predict the label attribute of a node. We randomly sample a portion of labeled vertices with embeddings (10%, 30%, 50%, 70%) to train the classifier, using the rest of the nodes to evaluate prediction accuracy. We compare our results with those from other state-of-the-art models in Table 4. The GANE models delivered better results compared with their counterparts, lending strong evidence that the OT attention and attention parsing mechanism promise to capture more meaningful representations.
Ablation study. We further explore the effect of n-gram length in our model (i.e., the filter size for the covolutional layers used by the attention parsing module). In Figure 2 we plot the AUC scores for link prediction on the Cora dataset against varying n-gram length. The performance peaked around length 20, then starts to drop, indicating a moderate attention span is more preferable. Similar results are observed on other datasets (results not shown). Experimental details on the ablation study can be found in the SM.

Qualitative Analysis
Embedding visualization. We employed t-SNE (Maaten and Hinton, 2008) to project the network embeddings for the Cora dataset in a twodimensional space using GANE-OT, with each node color coded according to its label. As shown in Figure 3, papers clustered together belong to the same category, with the clusters well-separated from each other in the network embedding space. Note that our network embeddings are trained without any label information. Together with the label classification results, this implies our model is capable of extracting meaningful information from both context and network topological.
Attention matrix comparison. To verify that our OT-based attention mechanism indeed produces sparse attention scores, we visualized the OT attention matrices and compared them with those simarlity-based attention matrices (e.g., WANE). Figure 4 plots one typical example. Our OT solver returns a sparse attention matrix, while dot-product-based WANE attention is effectively dense. This underscores the effectiveness of OTbased attention in terms of noise suppression.

Conclusion
We have proposed a novel and principled mutualattention framework based on optimal transport (OT). Compared with existing solutions, the attention mechanisms employed by our GANE model enjoys the following benefits: (i) it is naturally sparse and self-normalized, (ii) it is a global sequence matching scheme, and (iii) it can capture long-term interactions between two sentences. These claims are supported by experimental evidence from link prediction and multi-label vertex classification. Looking forward, our attention mechanism can also be applied to tasks such as relational networks (Santoro et al., 2017), natural language inference (MacCartney and Manning, 2009), and QA systems (Zhou et al., 2015).