Embedding Words in Non-Vector Space with Unsupervised Graph Learning

It has become a de-facto standard to represent words as elements of a vector space (word2vec, GloVe). While this approach is convenient, it is unnatural for language: words form a graph with a latent hierarchical structure, and this structure has to be revealed and encoded by word embeddings. We introduce GraphGlove: unsupervised graph word representations which are learned end-to-end. In our setting, each word is a node in a weighted graph and the distance between words is the shortest path distance between the corresponding nodes. We adopt a recent method learning a representation of data in the form of a differentiable weighted graph and use it to modify the GloVe training algorithm. We show that our graph-based representations substantially outperform vector-based methods on word similarity and analogy tasks. Our analysis reveals that the structure of the learned graphs is hierarchical and similar to that of WordNet, the geometry is highly non-trivial and contains subgraphs with different local topology.


Introduction
Effective word representations are a key component of machine learning models for most natural language processing tasks. The most popular approach to represent a word is to map it to a low-dimensional vector (Mikolov et al., 2013b;Pennington et al., 2014;Bojanowski et al., 2017;Tifrea et al., 2019). Several algorithms can produce word embedding vectors with distances or dot products capturing semantic relationships between words; the vector representations can be useful for solving numerous NLP tasks such as word analogy (Mikolov et al., 2013b), hypernymy detec-tion (Tifrea et al., 2019) or serving as features for supervised learning problems.
While representing words as vectors may be convenient, it is unnatural for language: words form a graph with a hierarchical structure (Miller, 1995) that has to be revealed and encoded by unsupervised learned word embeddings. A possible step towards this can be made by choosing a vector space more similar to the structure of the data: for example, a space with hyperbolic geometry (Dhingra et al., 2018;Tifrea et al., 2019) instead of commonly used Euclidean (Mikolov et al., 2013b;Pennington et al., 2014;Bojanowski et al., 2017) was shown beneficial for several tasks. However, learning data structure by choosing an appropriate vector space is likely to be neither optimal nor generalizable: Gu et al. (2018) argue that not only are different data better modelled by different spaces, but even for the same dataset the preferable type of space may vary across its parts. It means that the quality of the representations obtained from vector-based embeddings is determined by how well the geometry of the embedding space matches the structure of the data. Therefore, (1) any vectorbased word embeddings inherit limitations imposed by the structure of the chosen vector space; (2) the vector space geometry greatly influences the properties of the learned embeddings; (3) these properties may be the ones of a space geometry and not the ones of a language.
In this work, we propose to embed words into a graph, which is more natural for language. In our setting, each word is a node in a weighted undirected graph and the distance between words is the shortest path distance between the corresponding nodes; note that any finite metric space can be represented in such a manner. We adopt a recently introduced method which learns a representation of data as a weighted graph (Mazur et al., 2019) and use it to modify the GloVe algorithm for unsuper-vised word embeddings (Pennington et al., 2014). The former enables simple end-to-end training by gradient descent, the latter -learning a graph in an unsupervised manner. Using the fixed training regime of GloVe, we vary the choice of a distance: the graph distance we introduced, as well as the ones defined by vector spaces: Euclidean (Pennington et al., 2014) and hyperbolic (Tifrea et al., 2019). This allows for a fair comparison of vector-based and graph-based approaches and analysis of limitations of vector spaces. In addition to improvements on a wide range of word similarity and analogy tasks, analysis of the structure of the learned graphs suggests that graph-based word representations can potentially be used as a tool for language analysis.
Our key contributions are as follows: • we introduce GraphGlove -graph word embeddings; • we show that GraphGlove substantially outperforms both Euclidean and Poincaré GloVe on word similarity and word analogy tasks; • we analyze the learned graph structure and show that GraphGlove has hierarchical, similar to WordNet, structure and highly nontrivial geometry containing subgraphs with different local topology.

Graph Word Embeddings
For a vocabulary V = {v 0 , v 1 , . . . , v n }, we define graph word embeddings as an undirected weighted graph G(V, E, w). In this graph, • V is a set of vertices corresponding to the vocabulary words; • E={e 0 , e 1 , . . . , e m } is a set of edges: • w(e i ) are non-negative edge weights.
When embedding words as vectors, the distance between words is defined as the distance between their vectors; the distance function is inherited from the chosen vector space (usually Euclidean). For graph word embeddings, the distance between words is defined as the shortest path distance between the corresponding nodes of the graph: where Π G (v i , v j ) is the set of all paths from v i to v j over the edges of G.
To learn graph word embeddings, we use a recently introduced method for learning a representation of data in a form of a weighted graph (Mazur et al., 2019) and modify the training procedure of GloVe (Pennington et al., 2014) for learning unsupervised word embeddings. We give necessary background in Section 2.1 and introduce our method, GraphGlove, in Section 2.2.
2.1 Background 2.1.1 Learning Weighted Graphs PRODIGE (Mazur et al., 2019) is a method for learning a representation of data in a form of a weighted graph G(V, E, w). The graph requires (i) inducing a set of edges E from the data and (ii) learning edge weights. To induce a set of edges, the method starts from some sufficiently large initial set of edges and, along with edge weights, learns which of the edges can be removed from the graph. Formally, it learns G(V, E, w, p), where in addition to a weight w(e i ), each edge e i has an associated Bernoulli random variable b i ∼ Bern(p(e i )); this variable indicates whether an edge is present in G or not. For simplicity, all random variables b i are assumed to be independent and the joint probability of all edges in the graph can be written as p(G) = m i=0 p(e i ). Since each edge is present in the graph with some probability, the distance is reformulated as the expected shortest path distance: where d G (v i , v j ) is computed efficiently using Dijkstra's algorithm. The probabilities p(e i ) are used only in training; at test time, edges with probabilities less than 0.5 are removed, and the graph G(V, E, w, p) can be treated as a deterministic graph G(V, E, w).
Training. Edge probabilities p(e i ) = p θ (e i ) and weights w(e i ) = w θ (e i ) are learned by minimizing the following training objective: Here L(G, θ) is a task-specific loss, and is the average probability of an edge being present. The second term is the L 0 regularizer on the number of edges, which penalizes edges for being present in the graph. Training with such regularization results in a graph where an edge becomes either redundant (with probability close to 0) or important (with probability close to 1).
To propagate gradients through the second term in (3), the authors use the log-derivative trick (Glynn, 1990) and Monte-Carlo estimate of the resulting gradient; when sampling, they also apply a heuristic to reduce variance. For more details on the optimization procedure, we refer the reader to the original paper (Mazur et al., 2019).
Initialization. An important detail is that training starts not from the set of all possible edges for a given set of vertices, but from a chosen subset; this subset is constructed using task-specific heuristics. The authors restrict training to a subset of edges to make it feasible for large datasets: while the number of all edges in a complete graph scales quadratically to the number of vertices, the initial subset can be constructed to scale linearly with the number of vertices.

GloVe
GloVe (Pennington et al., 2014) is an unsupervised method which learns word representations directly from the global corpus statistics. Each word v i in the vocabulary V is associated with two vectors w i andw i ; these vectors are learned by minimizing Here X i,j is the co-occurrence between words v i and v j ; b i andb j are trainable word biases, and f (X i,j ) is a weight function: f (X i,j ) = min(1, [ X i,j xmax ] α ) with x max = 100 and α = 3/4. The original GloVe learns embeddings in the Euclidean space ;Poincaré GloVe (Tifrea et al., 2019) adapts this training procedure to hyperbolic vector spaces. This is done by replacing w T iw j in for- Table 1).

Our Approach: GraphGlove
We learn graph word embeddings within the general framework described in Section 2.1.1. Therefore, it is sufficient to (i) define a task-specific loss L(G, θ) in formula (3), and (ii) specify the initial subset of edges.

Loss function
We adopt GloVe training procedure and learn edge weights and probabilities directly from the co-" " in the loss term occurrence matrix X. We define L(G, θ) by modifying formula (4) for weighted graphs: 1. replace w T iw j with either graph distance or graph dot product as shown in Table 1 (see details below); 2. since we learn one representation for each word in contrast to two representations learned by GloVe, we setb j = b j .
Distance. We want negative distance between nodes in a graph to reflect similarity between the corresponding words; therefore, it is natural to replace w T iw j with the graph distance. The resulting loss L(G, θ) is: Dot product. A more honest approach would be replacing dot product w T iw j with a "dot product" on a graph. To define dot product of nodes in a graph, we first express the dot product of vectors in terms of distances and norms. Let w i , w j be vectors in a Euclidean vector space, then Now it is straightforward to define the dot product 2 of nodes in our weighted graph: where d(v i , v j ) is the shortest path distance. Note that dot product (8) contains distances to a zero element; thus in addition to word nodes, we also need to add an extra "zero" node in a graph. This is not necessary for the distance loss (5), but we add this node anyway to have a unified setting; a model can learn to use this node to build paths between other nodes.
All loss functions are summarized in Table 1.

Initialization
We initialize the set of edges by connecting each word with its K nearest neighbors and M randomly sampled words. The nearest neighbors are computed as closest words in the Euclidean GloVe embedding space, 3 random words are sampled uniformly from the vocabulary. We initialize biases b i from the normal distribution N (0, 0.01), edge weights by the cosine similarity between the corresponding GloVe vectors, and edge probabilities with 0.9. ; for both, we use the original implementation 4 with recommended hyperparameters. We chose these models to enable a comparison of our graph-based method and two different vector-based approaches within the same training scheme.

Corpora and Preprocessing
We train all embeddings on Wikipedia 2017 corpus. To improve the reproducibility of our results, we (1) use a standard publicly available Wikipedia snapshot from gensim-data 5 , (2) process the data with standard GenSim Wikipedia tokenizer 6 . Also, we release preprocessing scripts and the resulting corpora as a part of the supplementary code. 3 In preliminary experiments, we also used as nearest neighbors the words which have the largest pointwise mutual information (PMI) with the current one. However, such models have better loss but worse quality on downstream tasks, e.g. word similarity. 4 Euclidean GloVe: https://nlp.stanford. edu/projects/glove/, Poincaré GloVe: https: //github.com/alex-tifrea/poincare_glove. 5 https://github.com/RaRe-Technologies/ gensim-data , dataset wiki-english-20171001 6 gensim.corpora.wikicorpus.tokenize , commit de0dcc3

Setup
We compare embeddings with the same vocabulary and number of parameters per token. For vector-based embeddings, the number of parameters equals vector dimensionality. For GraphGlove, we compute number of parameters per token as proposed by Mazur et al. (2019): (|V |+2·|E|)/|V |. To obtain the desired number of parameters in Graph-Glove, we initialize it with several times more parameters and train it with L 0 regularizer until enough edges are dropped (see Section 2.2). We consider two vocabulary sizes: 50k and 200k. For 50k vocabulary, the models are trained with either 20 or 100 parameters per token; for 200k vocabulary -with 20 parameters per token. For initialization of GraphGlove with 20 parameters per token we set K = 64, M = 10; for a model with 100 parameters per token, K = 480, M = 32.
In preliminary experiments, we discovered that increasing both K and M leads to better final representations at a cost of slower convergence; decreasing the initial graph size results in lower quality and faster training. However, starting with no random edges (i.e. M = 0) also slows convergence down.

Training
Similarly to vectorial embeddings, GraphGlove learns to minimize the objective (either distance or dot product) by minibatch gradient descent. However, doing so efficiently requires a special graphaware batching strategy. Namely, a batch has to contain only a small number of rows with potentially thousands of columns per row. This strategy takes advantage of the Dijkstra algorithm: a single run of the algorithm can find the shortest paths between a single source and multiple targets. Formally, one training step is as follows: 1. we choose b = 64 unique "anchor" words; 2. sample up to n = 10 4 words that co-occur with each of b "anchors"; 3. multiply the objective by importance sampling weights to compensate for non-uniform sampling strategy. 7 This way, a single training iteration with b · n batch size requires only O(b) runs of Dijkstra algorithm.
7 Let X be the co-occurrence matrix. Then for a pair of words (vi, vj), an importance sampling weight is is the probability to choose a pair (vi, vj) in the original GloVe, qi,j = 1 |V | · 1 |{k:X i,k =0}| is the probability to choose this pair in our sampling strategy.  After computing the gradients for a minibatch, we update GraphGlove parameters using Adam (Kingma and Ba, 2014) with learning rate α=0.01 and standard hyperparameters (β 1 =0.9, β 2 =0.999).
It took us less than 3.5 hours on a 32-core CPU to train GraphGlove on 50k tokens until convergence. This is approximately 3 times longer than Euclidean GloVe in the same setting.

Experiments
In the main text, we report results for 50k vocabulary with 20 parameters per token. Results for other settings, as well as the standard deviations, can be found in the supplementary material.

Word Similarity
To measure similarity of a pair of words, we use cosine distance for Euclidean GloVe, the hyperbolic distance for Poincaré GloVe and the shortest path distance for GraphGlove. In the main experiments, we exclude pairs with out-of-vocabulary (OOV) words. In the supplementary material, we also provide results with inferred distances for OOV words.
We evaluate word similarity on standard benchmarks: WS353, SCWS, RareWord, SimLex and SimVerb. These benchmarks evaluate Spearman rank correlation of human-annotated similarities between pairs of words and model predictions 8 . Table 2 shows that GraphGlove outperforms vectorbased embeddings by a large margin. 8 We use standard evaluation code from https://github.com/kudkudak/ word-embeddings-benchmarks

Word Analogy
Analogy prediction is a standard method for evaluation of word embeddings. This task typically contains tuples of 4 words: (a, a * , b, b * ) such that a is to a * as b is to b * . The model is tasked to predict b * given the other three words: for example, "a = Athens is to a * = Greece as b = Berlin is to b * = (Germany)". Models are compared based on accuracy of their predictions across all tuples in the benchmark.
The standard benchmarks contain Google analogy (Mikolov et al., 2013a) and MSR (Mikolov et al., 2013c) test sets. MSR test set contains only morphological category; Google test set contains 9 morphological and 5 semantic categories, with 20 -70 unique word pairs per category combined in all possible ways to yield 8,869 semantic and 10,675 syntactic questions. Unfortunately, these test sets are not balanced in terms of linguistic relations, which may lead to overestimation of analogical reasoning abilities as a whole (Gladkova et al., 2016). 9 The Bigger Analogy Test Set (BATS) (Gladkova et al., 2016) contains 40 linguistic relations, each represented with 50 unique word pairs, making up 99,200 questions in total. In contrast to the standard benchmarks, BATS is balanced across four groups: inflectional and derivational morphology, and lexicographic and encyclopedic semantics.
Evaluation. Euclidean GloVe solves analogies by maximizing the 3COSADD score: We adapt this for GraphGlove by substituting cos(x, y) with a graph-based similarity function. As a simple heuristic, we define the similarity between two words as the correlation of vectors consisting of distances to all words in the vocabulary: This function behaves similarly to the cosine similarity: its values are from -1 to 1, with unrelated  words having similarity close to 0 and semantically close words having similarity close to 1. Another alluring property of sim G (x, y) is efficient computation: we can get full distance vector d G (x) with a single pass of Dijkstra's algorithm. We use sim G (x, y) to solve the analogy task in GraphGlove: For details on how Poincaré GloVe solves the analogy problem, we refer the reader to the original paper (Tifrea et al., 2019).
Results. GraphGlove shows substantial improvements over vector-based baselines (Tables 3 and 4). Note that for Poincaré GloVe, the best-performing loss functions for the two tasks are different (cosh 2 d for similarity and d 2 for analogy), and there is no setting where Poincaré GloVe outperforms Euclidean Glove on both tasks. While for GraphGlove best-performing loss functions also vary across tasks, GraphGlove with the dot product loss outperforms all vector-based embeddings on 10 out of 13 benchmarks (both analogy and similarity). This shows that when removing limitations imposed by the geometry of a vector space, embeddings can better reflect the structure of the data. We further confirm this by analyzing the properties of the learned graphs in Section 5.

Learned Graph Structure
In this section, we analyze the graph structure learned by our method and reveal its differences from the structure of vector-based embeddings.  We compare graph G G learned by Graph-Glove (d) with graphs G E and G P induced from Euclidean and Poincaré (cosh 2 d) embeddings respectively. 10 For vector embeddings, we consider two methods of graph construction: 1. THR -connect two nodes if they are closer than some threshold τ , 2. KNN -connect each node to its K nearest neighbors and combine multiple edges.
The values τ and K are chosen to have similar edge density for all graphs. 11 We find that in contrast to the graphs induced from vector embeddings: • in GraphGlove frequent and generic words are highly interconnected; • GraphGlove has hierarchical, similar to Word-Net, structure; • GraphGlove has non-trivial geometry containing subgraphs with different local topology.

Important words
Here we identify which words correspond to "central" (or important) nodes in different graphs; we consider several notions of node centrality frequently used in graph theory. Note that in this section, by word importance we mean graph-based properties of nodes (e.g. the number of neighbors), and not semantic importance (e.g., high importance for content words and low for function words). 10 We take the same models as in Section 4. 11 Namely, K = 13 and τ = 0.112 for Euclidean GloVe, K = 13 and τ = 0.444 for Poincaré Glove. Degree centrality. The simplest measure of node importance is its degree. For the top 200 nodes with the highest degree, we show the distribution of parts of speech and the average frequency percentile (higher means more frequent words). Figure 1 shows that for all vector-based graphs, the top contains a significant fraction of proper nouns and nouns. For G G , distribution of parts of speech is more uniform and the words are more frequent. We provide the top words and all subsequent importance measures in the supplementary material.
Eigenvector centrality. A more robust measure of node importance is the eigenvector centrality (Bonacich, 1987). This centrality takes into account not only the degree of a node but also the importance of its neighbors: a high eigenvector score means that a node is connected to many nodes who themselves have high scores. Figure 2 shows that for G G the top changes in a principled way: the average frequency increases, proper nouns almost vanish, many adverbs, prepositions, linking and introductory words appear (e.g., 'well', 'but', 'in', 'that'). 12 For G G , the top consists of frequent generic words; this agrees with the intuitive understanding of importance. Differently from G G , top words for G E and G P have lower frequencies, fewer adverbs and prepositions. This can be because it is hard to make generic words from different areas close for vector-based embeddings, while GraphGlove can learn arbitrary connections. 12 See the words in the supplementary material.  k-core. To further support this claim, we looked at the main k-core of the graphs. Formally, k-core is a maximal subgraph that contains nodes of degree k or more; the main core is non-empty core with the largest k. Table 5 shows the sizes of the main cores and the corresponding values of k. Note that the maximum k is much smaller for G G ; a possible explanation is that the cores in G E and G P are formed by nodes in highly dense regions of space, while in G G the most important nodes in different parts can be interlinked together.

The Structure is Hierarchical
In this section, we show that the structure of our graph reflects the hierarchical nature of words. We do so by comparing the structure learned by Graph-Glove to the noun hierarchy from WordNet. To extract hierarchy from G G , we (1) take all (lemmatized) nouns in our dataset which are also present in WordNet (22.5K words), (2) take the root noun 'entity' (which is the root of the WordNet tree), and (3) construct the hierarchy: the k-th level is formed by all nodes at edge distance k from the root. We consider two ways of measuring the agreement between the hierarchies: word correlation and level correlation. Word correlation is Spearman's rank correlation between the vectors of levels for all nouns. Level correlation is Spearman's rank correlation between the vectors l and l avg , where l i is the level in WordNet tree and l avg i is the average level of l i 's words in our hierarchy.
We performed these measurements for all graphs (see Table 6). 13 We see that, according to both correlations, G G is in better agreement with the WordNet hierarchy.

The Geometry is Non-trivial
In contrast to vector embeddings, graph-based representations are not constrained by a vector space 13 The low performance of threshold-based graphs can be explained by the fact that they are highly disconnected (we assume that all nodes which are not connected to the root form the last level).  geometry and potentially can imitate arbitrarily complex spaces. Here we confirm that the geometry learned by GraphGlove is indeed non-trivial. We cluster G G using the Chinese Whispers algorithm for graph node clustering (Biemann, 2006) and measure Gromov δ-hyperbolicity for each cluster. Gromov hyperbolicity measures how close is a given metric to a tree metric (see, e.g., Tifrea et al. (2019) for the formal definition) and has previously been used to show the tree-like structure of the word log-co-occurrence graph (Tifrea et al., 2019). Low average δ indicates tree-like structure with δ being exactly zero for trees; δ is usually normalized by the average shortest path length to get a value invariant to metric scaling. Figure 4 shows the distribution of average δhyperbolicity for clusters of size at least 10. Firstly, we see that for many clusters the normalized average δ-hyperbolicity is close to zero, which agrees with the intuition that some words form a hierarchy. Secondly, δ-hyperbolicity varies significantly over the clusters and some clusters have relatively large values; it means that these clusters are not tree-like. Figure 3 shows examples of clusters with different values of δ-hyperbolicity: both tree-like ( Figure 3a) and more complicated (Figure 3b-c).

Related Work
Word embedding methods typically represent words as vectors in a low-dimensional space; usu- ally, the vector space is Euclidean (Mikolov et al., 2013b;Pennington et al., 2014;Bojanowski et al., 2017), but recently other spaces, e.g. hyperbolic, have been explored (Leimeister and Wilson, 2018;Dhingra et al., 2018;Tifrea et al., 2019). However, vectorial embeddings can have undesired properties: e.g., in dot product spaces certain words cannot be assigned high probability regardless of their context (Demeter et al., 2020). A conceptually different approach is to model words as probability density functions (Vilnis and McCallum, 2015;Athiwaratkun and Wilson, 2017;Bražinskas et al., 2018;Muzellec and Cuturi, 2018;Athiwaratkun and Wilson, 2018). We propose a new setting: embedding words as nodes in a weighted graph.
To learn a weighted graph, we use the method by Mazur et al. (2019). Prior approaches to learning graphs from data are eigher highly problemspecific and not scalable Escolano and Hancock (2011); Karasuyama and Mamitsuka (2017); Kang et al. (2019) or solve a less general but important case of learning directed acyclic graphs (Zheng et al., 2018;Yu et al., 2019). The opposite to learning a graph from data is the task of embedding nodes in a given graph to reflect graph distances and/or other properties; see Hamilton et al. (2017) for a thorough survey.
Analysis of word embeddings and the structure of the learned feature space often reveals interesting language properties and is an important research direction (Köhn, 2015;Bolukbasi et al., 2016;Mimno and Thompson, 2017;Nakashole and Flauger, 2018;Naik et al., 2019;Ethayarajh et al., 2019). We show that graph-based embeddings can be a powerful tool for language analysis.

Conclusions
We introduce GraphGlove -graph word embeddings, where each word is a node in a weighted graph and the distance between words is the shortest path distance between the corresponding nodes. The graph is learned end-to-end in an unsupervised manner. We show that GraphGlove substantially outperforms both Euclidean and Poincaré GloVe on word similarity and word analogy tasks. Our analysis reveals that the structure of the learned graphs is hierarchical and similar to that of Word-Net; the geometry is highly non-trivial and contains subgraphs with different local topology.
Possible directions for future work include using GraphGlove for unsupervised hypernymy detection, analyzing undesirable word associations, comparing learned graph topologies for different languages, and downstream applications such as sequence classification. Also, given the recent success of models such as ELMo and BERT, it would be interesting to explore extensions of GraphGlove to the class of contextualized embeddings.  A Appendix: Additional benchmarks A.1 Variance study As our method relies on random initialization of a graph in PRODIGE, a natural question is whether different choice of drawn edges significantly affects the quality of representations in the end of training. Figure 5 demonstrates that after running the training procedure with distance-based loss for 5 different random seeds, the final metrics values have a standard deviation of less than 1 point in 10/13 tasks and have a standard deviation of at most 1.34 percent for the RareWord dataset. Thus, we can conclude that GraphGlove results are relatively stable with respect to selection of random edges before training.
Some word pairs in each similarity benchmark are out of vocabulary (OOV). In the main evaluation, we drop such pairs from the benchmark. However, there's also a different way to deal with such words.
A popular workaround is to calculate the distance between w i and OOV as an average distance from w i to other words. In the rare case when both words are OOV, we can consider them infinitely distant from each other. We report similarity benchmarks including OOV tokens in Tables 9, 10 and 11.