Learning Graph Embeddings from WordNet-based Similarity Measures

We present path2vec, a new approach for learning graph embeddings that relies on structural measures of pairwise node similarities. The model learns representations for nodes in a dense space that approximate a given user-defined graph distance measure, such as e.g. the shortest path distance or distance measures that take information beyond the graph structure into account. Evaluation of the proposed model on semantic similarity and word sense disambiguation tasks, using various WordNet-based similarity measures, show that our approach yields competitive results, outperforming strong graph embedding baselines. The model is computationally efficient, being orders of magnitude faster than the direct computation of graph-based distances.


Introduction
Developing applications making use of large graphs, such as networks of roads, people, or word senses, often involves the design of a domainspecific similarity measure sim : V × V → R defined on a set of nodes V of a graph G = (V, E). For instance, it can represent the shortest distance from home to work, a community of interest in a social network for a user, or a semantically related sense to a given synset in WordNet. There exist a wide variety of such measures greatly ranging in their complexity and design from simple deterministic ones, e.g. based on shortest paths in a network (Leacock and Chodorow, 1998) to more complex ones, e.g. based on random walks (Fouss et al., 2007). Naturally, the majority of such measures rely on walks along edges E of the graph, often resulting in effective, but inefficient measures requiring complex and computationally expensive graph traversals. We propose a solution to this problem by decoupling design and implementation of graph-based measures. Namely, once a node similarity measure is defined, we propose to learn vector representations of nodes which enable an efficient computation of this measure.
We propose an approach for representing nodes in a graph with dense vectors (embeddings) which are good in approximating pairwise similarities between the nodes, e.g. calculated based on the shortest path distance. The benefit is that dot product calculation is computationally cheaper as compared to shortest path computation (and other complex walks) on a large graph. Also, lowdimensional vector representations of nodes take much less space than the pairwise similarities between all the nodes in a large graph. As a case study, we use WordNet graph (Miller, 1995) and pairwise similarities between its synsets calculated using several different methods.
First, we show the effectiveness of the proposed approach intrinsically with a standard dataset of word similarities by learning synset vectors of a WordNet graph based on two similarity measures: one based on shortest-paths (LCH) (Leacock and Chodorow, 1998) and another based on lowest common subsumers in the hierarchy (JCN) (Jiang and Conrath, 1997). We show that our model not only is able to closely approximate these measures, but also improve the results of the original measures in terms of (1) correlation with human judgments and (2) computational efficiency, with gains up to four orders of magnitude w.r.t. to the original graph-based measures. Our method compares favorably to strong baselines based on random walks over graphs. Second, we evaluate the quality of the learned embeddings extrinsically in a Word Sense Disambiguation (WSD) task (Navigli, 2009) by replacing the original LCH and JCN structural measures with their vectorized counterparts in a graph-based WSD algorithm, reaching comparable levels of WSD performance.
The contribution of our work is a new simple effective and efficient approach to learning graph embedding based on a similarity measure defined on a set of nodes. The model is called path2vec as it can approximate various measures of node similarities, based on network structure, e.g. shortest path distance. In contrast to existing approaches, our model does not rely on random walks or autoencoder architectures (link prediction) and provides a way to supply additional information for the target measure, not only the graph structure itself. Our code 1 and datasets 2 are available online.

Related work
Representation learning on graphs received much attention recently in various research communities. We refer the reader to (Hamilton et al., 2017) for a thorough survey on the existing methods for this task. All of them (including ours) are based on the idea of projecting graph nodes into a latent space with dimensionality much lower than the number of nodes.
The method described in this paper falls into the category of 'shallow embeddings', meaning that we do not attempt to embed communities or neighborhoods: our aim is to approximate distances or similarities between nodes. The existing approaches to solving this task mostly use either factorization of the graph adjacency matrix (Cao et al., 2015;Ou et al., 2016) or random walks over the graph as in Deepwalk (Perozzi et al., 2014) and node2vec (Grover and Leskovec, 2016). A completely different approach is taken by (Subercaze et al., 2015) who directly embed the WordNet tree graph into Hamming hypercube binary representations. Their model is dubbed 'Fast similarity embedding' (FSE) and resembles ours in its aim: that is, to provide a much quicker way of calculating semantic similarities based on WordNet knowledge. At the same time, it should be noted that the FSE embeddings are not differentiable, which can limit its use in deep neural architectures.
The method we propose in this paper is different from the previous approaches. We learn embeddings with stochastic optimization and multiple iterations over the training data, as in random walk based algorithms. But we use as an input the pairs of graph nodes with pre-calculated similarities between them. In this way, we avoid the necessity to load into memory and factorize large matrices, and at the same time, we do not employ computationally expensive random walks.

Approximating Structural Node
Similarities with Node Embeddings Our algorithm learns low-dimensional vectors for the graph nodes (synsets in our case) such that the dot products between pairs of vectors are as close as possible to the pre-defined similarities between the corresponding nodes. Our objective function is to minimize the mean squared error loss of the following form: where sim(v i , v j ) is the value of a 'gold' similarity measure between a pair of nodes v i and v j , while v i and v j are the embeddings of the first and the second node in the training batch T . Positioned within the schema described in (Hamilton et al., 2017), our model is similar to encoder-decoder architecture, where the encoder is an embedding look-up function, and the decoder is pairwise cosine similarity between embeddings. In other words, it is a simple feed-forward neural network with negative sampling where the input consists of embedding pairs and the output is their cosine similarities.
Conceptually, this architecture is also very similar to the classical word2vec CBOW or Skip-gram algorithms (Mikolov et al., 2013). In them, word pairs actually seen in the training corpus are optimized to have their corresponding vectors dot product close to 1, while randomly selected word pairs ('negative samples') are optimized to have their corresponding vectors dot product close to 0. Our architecture does almost the same, with the exception that the target values for the dot product are not always 1 or 0, but can take arbitrary values in the [0...1] range, depending on the pathbased measure on the input graph, e.g. the shortest path length in the WordNet. One can imagine this as a kind of 'degenerated' word2vec, where the training corpus is already transformed into word-context pairs, like 'motor.n.01 rocket.n.02 0.38987'. Note that in our case, there is no difference between 'word' and 'context' (the first and the second synsets in each pair are treated equally), thus, unlike in word2vec, we use only one embedding matrix instead of two. It has the number of rows equal to the number of nodes in the graph (plus one technical row for unknown 'words') and the number of columns depends on the desired embedding dimensionality. Because of being highly inspired by the word2vec architecture, we dub our model 'path2vec' meaning it encodes paths (or other similarities) between graph nodes into dense vectors.
As already mentioned, we also followed (Mikolov et al., 2013) in using negative sampling. In the simplest case, it means that during the training, each real synset pair (A, B) with ground-truth similarity value S is accompanied with two 'negative' synset pairs (A, X) and (B, Y ) with zero similarities, where X and Y are randomly sampled synsets not equal to A or B. In this way, we make use of the synsets underrepresented in the pruned training datasets (see Section 4 for details) and optimize the model to move random synsets apart from each other. In practice, we used 3 negative samples for each synset in the pair.
To actually train the embeddings (which are initialized randomly), we used Keras library (Chollet et al., 2015) with Adam optimizer (Kingma and Ba, 2014) and a learning rate of 0.005. The models were trained for 10 epochs with early stopping. Different combinations of embedding dimensionality (vector size) and batch size were tested.

Datasets used in the Experiments
As mentioned before, our aim is to produce node embeddings that would capture similarities between nodes in a large graph. In our case, the graph is WordNet, and the nodes are its noun synsets (82,115 total). There exist several methods to calculate synset similarities on the Word-Net. We followed (Budanitsky and Hirst, 2006) who found Leacock-Chodorow (further 'LCH') and Jiang-Conrath (further 'JCN') symmetric similarity measures to be superior to others, and choose them for our experiments. LCH similarity (Leacock and Chodorow, 1998) is based on the shortest path between two synsets in the WordNet hypernym/hyponym taxonomy, while JCN similarity (Jiang and Conrath, 1997) uses the lowest common parent of two synsets in the same taxonomy. JCN is significantly faster but additionally requires a corpus as a source of probabilistic data about the distributions of synsets ('information content'). To this end, we employed the Brown corpus (Kucera and Francis, 1982) or its SemCor subset, which is considerably smaller, but  manually annotated with word senses 3 . We used the NLTK (Bird et al., 2009) implementations of all the aforementioned similarity functions. Directly employing WordNet graph to find semantic similarities between synsets is computationally expensive. The time complexity of calculating the shortest path between graph nodes (as in LCH) is in the best case linear in the number of nodes and edges. JCN is linear in the height of the graph tree, which is much better. However, it still cannot leverage highly-optimized routines and hardware capabilities, which make vectorized representations so efficient. As an example, let us consider the popular problem of ranking the graph nodes by their similarity to one particular node of interest (finding the 'nearest neighbors'). Table 1 shows the typical speeds (in Python) for computing similarities of one node to all other WordNet noun nodes, using either standard similarity functions from NLTK, Hamming distance between 128D binary embeddings produced by the FSE algorithm (Subercaze et al., 2015), or dot product between a 300D float vector and all rows of a 82115 × 300 matrix. Using float vectors is 4 orders of magnitude faster than LCH, 3 orders faster than JCN, and 2 orders faster than FSE.
One can pre-calculate similarities for all possible synset pairs. For the 82,115 noun synsets in the WordNet, this would result in 3,371,395,555 unique synset pairs. Producing similarity scores for all these 3 billion pairs takes about 30 hours on a Intel Xeon E5-2603v4@1.70GHz CPU for LCH, and about 5 hours for JCN using 10 threads. Besides, the resulting similarities lists are very large (about 45 GB gzipped each) making it difficult to use them in applications.
But they can be used once as training data to learn dense embeddings R d for these 82,115 synsets, such that d << 82115 and the cosine similarities between the embeddings approximate • Jiang-Conrath similarities calculated over the SemCor corpus (further JCN-SemCor); • Jiang-Conrath similarities calculated over the Brown corpus (further JCN-Brown); • Leacock-Chodorow similarities (LCH).
In principle, one can use all unique synset pairs with their WordNet similarities as the training data. However, this seems impractical. As expected due to the small-world nature of the Word-Net graph, most synsets are not similar at all: with JCN, the overwhelming majority of pairs feature similarity very close to zero; with LCH, most pairs have similarity below 1.0 (see Figure 1), and pairs with a human-perceivable similarity start to appear only near the values of 1.5 and above. Lowsimilarity pairs are not of much use in our training workflow, and thus we filtered them out from the datasets 4 . We used similarity threshold of 0.1 for the JCN datasets and 1.5 for the LCH dataset. This dramatically reduced the size of the training data (to less than 1.5 million pairs for the JCN dataset and to 125 million pairs for the LCH dataset), thus making the training much faster and at the same time improving the quality of the resulting embeddings (see the description of our evaluation setup below). With this being the case, we additionally pruned these reduced datasets by keeping only the first 50 nearest neighbors of each synset (the ratio behind this is that some nodes in the WordNet 4 But they are still implicitly used as negative samples. graph are very central and thus have many neighbours with high similarity, but for us it is enough to get data only about the nearest ones). This again reduced the training time and improved the results, so we hypothesize that such a pruning makes the models more general and more focused on the meaningful relations between synsets. The final sizes of the pruned training datasets are 694,762 pairs for the JCN-SemCor, 720,440 pairs for the JCN-Brown and 4,008,446 pairs for the LCH.
Note also that as can be seen from Figure 1, the LCH similarity can take values well above 1.0. Our embedding models use cosine similarity between vectors, thus after the pruning, we scaled the LCH similarities to the [0...1] range by unity-based normalization. Also, in some rare cases, NLTK produces JCN similarities of infinitely large values (probably due to the absence of particular synsets in the Brown or SemCor corpora). We clipped these similarities to the value of 1. All the datasets were shuffled prior to training.

Experimental Setting
It is in principle possible to evaluate the models by calculating the rank correlation of their cosine similarities with the corresponding similarities for all the unique pairs from the training dataset, or at least a large part of them. Subercaze et al. (2015) evaluated their approach on LCH similarities for all unique noun synset pairs from WordNet Core (about 5 million similarities total); their model achieves Spearman rank correlation of 0.732 on this task. However, we argue that this kind of evaluation does not measure the ability of the model to produce meaningful predictions, at least for language data. As mentioned, the overwhelming part of these unique pairs contains synsets not related to each other at all. For most tasks, it is useless to 'know' that, e.g., 'ambulance' and 'general' are less similar than 'ambulance' and 'president'. Due to structural properties of the WordNet graph, the shortest path lengths between these node pairs are indeed different, but we hardly want to learn these idiosyncrasies. It is much more important for the model to be able to robustly tell really similar pairs from the unrelated ones. Therefore, to evaluate how good the resulting models are in approximating WordNet similarities, we needed a more balanced and relevant test set, not biased towards 0-similarity synset pairs. To this end, we used noun pairs (666 total) from the SimLex999 semantic similarity dataset (Hill et al., 2015). SimLex999 contains lemmas; as some lemmas may map to several WordNet synsets, for each word pair we chose the synset pair maximizing the WordNet similarity. Then, we measured Spearman rank correlation between these 'gold' scores and the similarities produced by the path2vec models we have trained on the corresponding datasets. Further on, we call this measures 'correlation with the WordNet similarities'.
This evaluation method directly measures how well the model fits the training objective. We also would like to check whether our models generalize to extrinsic tasks. Thus, we additionally used human-annotated semantic similarities from the same SimLex999, as a test set unavailable for the models during the training. This is a more fair evaluation strategy, as it directly tests the models' correspondence to human judgments independently of WordNet. These correlations with human scores were tested in two synset selection setups, important to distinguish: • WordNet-based synset selection (static synsets): this setup uses the same lemmato-synset mappings, based on maximizing WordNet similarity for each SimLex999 word pair with the corresponding similarity function. It means that all the models are tested on exactly the same set of synset pairs (but the scores themselves are taken from SimLex999, not from the WordNet).
• Model-based synset selection (dynamic synsets): in this setup, lemmas are converted to synsets dynamically as a part of the evaluation workflow. We choose the synsets that maximize word pair similarity using the vectors from the model itself, not similarity functions on the WordNet. Then the resulting ranking is evaluated against the original Sim-Lex999 ranking.
The second (dynamic) setup in principle allows the models to find better lemma-to-synset mappings than those provided by the WordNet similarity functions. This setup essentially evaluates two abilities of the model: 1) to find the best pair of synsets for a given pair of lemmas (sort of a disambiguation task), and 2) to produce the similarity score for the chosen synsets. We are not aware of any 'gold' lemma-to-synset mapping for Sim-Lex999, thus we directly evaluate only the second part, but implicitly the first one still influences the resulting scores. Models often choose different synsets: for example, for the word pair 'atom -carbon', the synset pair maximizing the JCN-SemCor similarity calculated on the 'raw' WordNet would be 'atom.n.02 -carbon.n.01' with the similarity 0.11, while in a 300D path2vec model trained on the same gold similarities, the synset pair with the highest similarity 0.14 would be 'atom.n.01 -carbon.n.01' (which seems to be at least as good a decision as the one from the raw WordNet).

Baselines
We compare our model against four baselines: • Raw WordNet similarities by Leacock-Chodorow or Jiang-Conrath measures.
• Fast Similarity Embedding (FSE) (Subercaze et al., 2015) of the WordNet graph. The first baseline (raw/original WordNet) theoretically should be the upper threshold for path2vec, since we train our embeddings on exactly this data. However, as we show below, in some cases the path2vec models manage to outperform this baseline. Deepwalk and node2vec are recent algorithms based on random walks over graphs and providing state-of-the-art performance in tasks such as link prediction (Hamilton et al., 2017). These models were trained on the same WordNet graph. We used all 82,115 noun synsets as vertices and hypernym/hyponym relations between them as edges. Since the node2vec C++ implementation accepts an edge list as input, we had to add a self-connection for each of the 7,714 nodes (synsets) which lack edges in the WordNet. During the training, we tried different values for such hyperparameters as the number of random walks, the length of the random walk, the window size, and the number of dimensions. FSE embeddings of the WordNet noun synsets were provided to us by the authors of the FSE algorithm, and consist of binary strings of the length 128.   (the raw WordNet baseline, obviously always gets the perfect correlation in this evaluation setup). All the reported rank correlation values in this and other tables are statistically significant. We report the results for the best models for each method, all of them (except FSE) using vector size 300 for comparability. This vector size is a well-known and often used 'sweet spot' for dense vector space models of semantics, so we limit ourselves to 300D models' scores in the tables, but in the plots below the results from models with other vector sizes can be found as well.
As expected, vector dimensionality greatly influences the performance of graph embedding models. Figure 2 illustrates this for the path2vec models trained on JCN-SemCor and LCH datasets, evaluated on SimLex999 human judgments with the 'static synsets' evaluation setup, that is mapping SimLex999 lemmas to synsets based on raw WordNet similarities. The red horizontal line is the correlation of WordNet similarities with SimLex999 human scores.
For the path2vec models, there is a clear tendency to improve the performance when the vector size is increased (though never achieving the performance of raw WordNet similarities). However, for JCN-Brown and LCH this slows down after vector size 300, and using 600D embeddings brings little improvement (if any). This is not the case for JCN-SemCor which seems to make good use of the increased dimensionality (probably, because of its reliance on a sense-annotated corpus). Note that Deepwalk 5 and (especially) node2vec 6 do not benefit much from increased vector size.
The effect of the batch size for path2vec models is more complicated, with different values taking the lead for different datasets and vector sizes. Large batches (like 100) provided comparable performance with small batches, at the same time linearly increasing the training speed. This paves a way to efficient path2vec implementations using very large batch sizes (1000 and more), especially with GPUs. However, the optimal values for this hyperparameter are clearly highly dependent on the nature of the training dataset and should be tested for each particular task separately. Figure 3 plots the performance of the same models when using 'dynamic synset selection' evaluation setup (that is, each model can decide for itself how to map SimLex999 lemmas to Word-Net synsets). General trends are the same as in the 'static synsets' setup, with one important ex-  ception: path2vec and Deepwalk models start to consistently outperform the raw WordNet. This means these embeddings are in some sense 'regularized', leading to better 'disambiguation' of senses behind SimLex999 word pairs and eventually to better similarities ranking. Considering path2vec, it is quite interesting that a dense vector model trained only on graph distances can (at least for some tasks) not only approximate its 'teacher' but actually be superior to it, at the same time being much smaller and faster.
In Tables 3 and 4 we select the best 300D path2vec models from the experiments described above and compare them against the best 300D baseline models (and 128D FSE embeddings) in static and dynamic evaluation setups. When WordNet-defined lemma-to-synset mappings are used (Table 3), the raw WordNet similarities are non-surprisingly the best, although FSE embeddings achieve nearly the same performance (even slightly better for the JCN-SemCor-defined map-pings). Following them, the path2vec and the Deepwalk models are almost on par with each other, and both outperform node2vec.
In the dynamic synset selection setup (Table 4), all the models except node2vec are superior to raw WordNet, and FSE is again the best. Deepwalk is marginally better than path2vec, but we still believe it to be quite an achievement for our model to perform on par with the state-of-the-art graph embedding algorithms (actually better in the case of node2vec), while at the same not using random walks on graphs and being conceptually much simpler than FSE.
6 Experiment 2: Extrinsic Evaluation based on Word Sense Disambiguation 6.1 Experimental Setting As an additional extrinsic evaluation step, we turned to word sense disambiguation (WSD) task on the Senseval-2 test set (Palmer et al., 2001), us- ing the WSD approach from (Sinha and Mihalcea, 2007). The original algorithm uses raw WordNet similarities; we tested how using cosine similarities between the trained embeddings instead will influence the WSD performance.
The employed WSD algorithm starts with building a graph where the nodes are the WordNet synsets of the words in the input sentence. The nodes are then connected by edges weighted with the similarity values between the synset pairs. The final step is selecting the most likely sense for each word based on a centrality score for each node (synset) using measures like in-degree, closeness, betweenness and PageRank. In-degree proved to be the best in our experiments as well as in (Sinha and Mihalcea, 2007), so we report the results for it only (and disambiguate only nouns). Figure 4 shows an example of a graph generated by this approach for the sentence 'The parishioners of St. Michael and All Angels stop to chat at the church door, as members here always have' (the weights are not shown for clarity). Table 5 presents the WSD micro-F1 scores using raw WordNet similarity functions, 300D path2vec (with different batch sizes), Deepwalk and node2vec models, and the 128D FSE model. While the raw WordNet similarities still demonstrate the best performance, the path2vec models are the second best, leaving Deepwalk and node2vec significantly and FSE marginally behind. It is also interesting that in this task, path2vec models with smaller batch sizes perform somewhat better, probably because of more 'focused' weights updates during the training. To sum up, path2vec models outperform other graph embedding methods on a practical word sense disambiguation task.

Conclusion
We presented path2vec, a simple, effective, and efficient model for embedding graph similarity measures into dense vector representations. It can be used to learn vector representations of nodes approximating shortest path distances or other node similarity measures of interest. We tested this approach on the task of learning embeddings for noun synsets in the WordNet graph and showed that the resulting representations perform on par or better than the state-of-the-art graph embedding approaches based on random walks. Interestingly, in the semantic similarity task, the learned embeddings can even outperform the original WordNet similarities on which they were trained, being at the same time much more efficient (3 or 4 orders of magnitude faster depending on the similarity measures used). Thus, in applications one could replace path-based measures with our node embeddings, gaining significant speedup in yielding distances and nearest neighbors.
path2vec can be trained on arbitrary graph measures and is not restricted to the shortest path or to only tree-structured graphs in contrast to prior art, e.g. (Subercaze et al., 2015). The LCH and JCN similarity measures are very different, but path2vec is able to fit to each better than the baseline methods. Additionally, it outperforms the state-of-the-art graph embedding algorithms on the WSD task.