Semantic Word Clusters Using Signed Spectral Clustering

Vector space representations of words capture many aspects of word similarity, but such methods tend to produce vector spaces in which antonyms (as well as synonyms) are close to each other. For spectral clustering using such word embeddings, words are points in a vector space where synonyms are linked with positive weights, while antonyms are linked with negative weights. We present a new signed spectral normalized graph cut algorithm, signed clustering, that overlays existing thesauri upon distributionally derived vector representations of words, so that antonym relationships between word pairs are represented by negative weights. Our signed clustering algorithm produces clusters of words that simultaneously capture distributional and synonym relations. By using randomized spectral decomposition (Halko et al., 2011) and sparse matrices, our method is both fast and scalable. We validate our clusters using datasets containing human judgments of word pair similarities and show the benefit of using our word clusters for sentiment prediction.


Introduction
In distributional vector representations, opposite relations are not fully captured. Take, for example, words such as "great" and "awful" that can appear with similar frequency in the same sentence structure: "John had a great meeting" and "John had an awful day." Word embeddings, which are successful in a wide array of NLP tasks (Turney et al., 2010;Dhillon et al., 2015), fail to capture this antonymy because they follow the distributional hypothesis that similar words are used in similar contexts (Harris, 1954), thus assigning small cosine or euclidean distances between the vector representations of "great" and "awful".
While vector space models (Turney et al., 2010) such as word2vec (Mikolov et al., 2013), Global vectors (GloVe) (Pennington et al., 2014), or Eigenwords (Dhillon et al., 2015 capture relatedness, they do not adequately encode synonymy and semantic similarity (Mohammad et al., 2013;Scheible et al., 2013). Our goal is to create clusters of synonyms or semantically equivalent words and linguistically motivated unified constructs. Signed graphs, which are graphs with negative edge weights, were first introduced by Cartwright and Harary (1956). However, signed graph clustering for multiclass normalized cuts (K-clusters) has been largely unexplored until recently. We present a novel theory and method that extends multiclass normalized cuts (K-cluster) of Yu and Shi (2003) to signed graphs (Gallier, 2016) 1 and the work of Kunegis et al. (2010) to K-clustering. This extension allows the incorporation of knowledge base information, positive and negatively weighted links (see figure 2.1). Negative edges serve as repellent or opposite relationships between nodes.
Our signed spectral normalized graph cut algorithm (henceforth, signed clustering) builds negative edge relations into graph embeddings using similarity structure in vector spaces. It takes as input an initial set of vectors and edge relations, and hence is easy to combine with any word embedding method. This paper formally improves on the discrete optimization problem of Yu and Shi (2003).
Signed clustering gives better clusters than spectral clustering (Shi and Malik, 2000) of word embeddings, and it has better coverage and is more robust than thesaurus look-up. This is because the-sauri erroneously give equal weight to rare senses of a word -for example, "rich" as a rarely used synonym of "absurd". Also, the overlap between thesauri is small, due to their manual creation. Lin (1998) found 17.8397% overlap between synonym sets from Roget's Thesaurus and WordNet 1.5. We find similarly small overlap between all three thesauri tested.
We evaluate our clusters using SimLex-999 (Hill et al., 2014) and SimVerb-3500 (Gerz et al., 2016) as a ground truth for our cluster evaluation. Finally, we test our method on the sentiment analysis task. Overall, signed spectral clustering can augment methods using signed information and has broad application for many fields.
Our main contributions are: the novel extension of signed clustering to the multiclass (K-cluster), and the application of this method to create semantic word clusters that are agnostic to vector space representations and thesauri.

Related Work
Semantic word cluster and distributional thesauri have been well studied in the NLP literature (Lin, 1998;Curran, 2004). Recently there has been a line of research on incorporating synonyms and antonyms into word embeddings. Our approach is very much in the line of Vlachos et al. (2009). However, they explicitly made verb clusters using Dirichlet Process Mixture Models and must-link / cannot-link clustering. Furthermore, they note that cannot-link clustering does not improve performance whereas our signed clustering antonyms are key.
Most recent models either attempt to make richer contexts, in order to find semantic similarity, or overlay thesaurus information in a supervised or semi-supervised manner. One line of active research is post processing the word vector embedding by transforming the space using a single or multi-relational objective (Yih et al., 2012;Tang et al., 2014;Chang et al., 2013;Tang et al., 2014;Zhang et al., 2014;Faruqui et al., 2015;Mrkšić et al., 2016).
Alternatively, there are methods to modify the objective function for generating the word embeddings (Ono et al., 2015;Pham et al., 2015;Schwartz et al., 2015).
Our approach differs from the aforementioned methods in that we created word clusters using the antonym relationships as negative links. Unlike the previous approaches using semi-supervised methods, we incorporated the thesauri as a knowledge base. Similar to word vector retrofitting and counter-fitting methods described in Faruqui et al. (2015) and Mrkšić et al. (2016), our signed clustering method uses existing vector representations to create word clusters.
To our knowledge, this work is the first theoretical foundation of multiclass signed normalized cuts. 2 Zass and Shashua (2005) solved multiclass cluster from another approach, by relaxing the orthogonality assumption and focusing instead on the non-negativity constraint. This led to a doubly stochastic optimization problem. Negative edges are handled by a constrained hyperparameter. Hou (2005) used positive degrees of nodes in the degree matrix of a signed graph with weights (-1, 0, 1), which was advanced by Kolluri et al. (2004) and Kunegis et al. (2010) using absolute values of weights in the degree matrix. Interestingly, Chiang et al. (2014) presented a theoretical foundation for edge sign prediction and a recursive clustering approach. Mercado et al. (2016) found that using the geometric mean of the graph Laplacian improves performance. Wang et al. (2016) used semi-supervised polarity induction (Rao and Ravichandran, 2009) to create clusters of words with similar valence and arousal. Must-link and cannot-link soft spectral clustering (Rangapuram and Hein, 2012) share similarities with our method, particularly in the limit where there are no must-link edges present. Both must-link and cannot-link clustering as well as polarity induction differ in optimization method. Our method is significantly faster due to the use of randomized SVD (Halko et al., 2011) and can thus be applied to large scale NLP problems.
We developed a novel theory and algorithm that extends the clustering of Shi and Malik (2000) and Yu and Shi (2003) to the multiclass signed graph case.

Signed Normalized Cut
Weighted graphs for which the weight matrix is a symmetric matrix in which negative and positive entries are allowed are called signed graphs.
Such graphs (with weights (−1, 0, +1)) were introduced as early as 1953 by (Harary, 1953), to model social relations involving disliking, indifference, and liking. The problem of clustering the nodes of a signed graph arises naturally as a generalization of the clustering problem for weighted graphs. Figure 1 shows a signed graph of word similarities with a thesaurus overlay. Gallier Figure 1: Signed graph of words using a distance metric from the word embedding. The red dashed edges represent the antonym relation while solid edges represent synonymy relations.
(2016) extends normalized cuts to signed graphs in order to incorporate antonym information into word clusters.
. . , v m } is a set of nodes or vertices, and W is a symmetric matrix called the weight matrix, such that w i j ≥ 0 for all i, j ∈ {1, . . . , m}, and w i i = 0 for i = 1, . . . , m. We say that a set Given a signed graph G = (V, W ) (where W is a symmetric matrix with zero diagonal entries), the underlying graph of G is the graph with node set V and set of (undirected) edges where W is an m × m symmetric matrix with zero diagonal entries and with the other entries w ij ∈ R arbitrary, for any node v i ∈ V , the signed degree of v i is defined as |w ij |, and the signed degree matrix D as For any subset A of the set of nodes V , let For any two subsets A and B of V and A C which is the complement of A, define links + (A, B), links − (A, B), and cut(A, A C ) by Then, the signed Laplacian L is defined by and its normalized version L sym by Kunegis et al. (2010) showed that L is positive semidefinite. For a graph without isolated vertices, Given a partition of V into K clusters (A 1 , . . . , A K ), if we represent the jth block of this partition by a vector X j such that for some a j = 0. For illustration, suppose m = 5 and It should be noted that this formulation differs significantly from Kunegis et al. (2010) and even more so from must-link / cannot-link clustering.
Observe that minimizing sNcut(A 1 , . . . , A K ) minimizes the number of positive and negative edges between clusters and also the number of negative edges within clusters. Removing the term links − (A j , A j ) reduces sNcut to normalized cuts.
A linear algebraic formulation is where X is the N × K matrix whose jth column is X j .

Optimization Problem
We now formulate K-way clustering of a graph using normalized cuts.
If we let The resulting optimization problem is The problem can be reformulated to an equivalent optimization problem: subject to X DX = I, X ∈ X .
We then form a relaxation of the above problem, dropping the condition that X ∈ X , giving Relaxed Problem The minimum of the relaxed problem is achieved by the K unit eigenvectors associated with the smallest eigenvalues of L sym .

Finding an Approximate Discrete Solution
Given a solution Z of the relaxed problem, we look for pairs (X, Q) with X ∈ X and where Q is a K ×K matrix with nonzero and pairwise orthogonal columns, with X F = Z F , that minimize Here, A F is the Frobenius norm of A. This nonlinear optimization problem involves two unknown matrices X and Q. To solve the relaxed problem, we proceed by alternating between minimizing ϕ(X, Q) = X − ZQ F with respect to X holding Q fixed (step 5 in algorithm 1), and minimizing ϕ(X, Q) with respect to Q holding X fixed (steps 6 and 7 in algorithm 1).
This second stage in which X is held fixed has been studied, but it is still a hard problem for which no closed-form solution is known. Hence we divide the problem into steps 6 and 7 for which the solution is known. Since Q is of the form Q = RΛ where R ∈ O(K) and Λ is a diagonal invertible matrix, we minimize X − ZRΛ F . The matrix RΛ is not a minimizer of X − ZRΛ F in general, but it is an improvement on R alone, and both stages can be solved quite easily. In step 6 the problem reduces to minimizing −2tr(Q Z X); that is, maximizing tr(Q Z X).
Algorithm 1 Signed Clustering 1: Input: W the weight matrix (without isolated nodes), K the number of clusters, and termination threshold . 2: Using the D the degree matrix, and the signed Laplacian L, compute Lsym the signed normalized Laplacian.
3: Initialize Λ = I, X = D − 1 2 U where U is the matrix of the eigenvectors corresponding to the K smallest eigenvalues of Lsym. 3 4: while X − ZRΛ F > do 5: Minimize X − ZRΛ F with respect to X holding Q fixed. 6: Fix X, Z, and Λ, find R ∈ O(K) that minimizes X − ZRΛ F . 7: Fix X, Z, and R, find a diagonal invertible matrix Λ that minimizes X − ZRΛ F . 8: end while 9: Find the discrete solution X * by choosing the largest entry x ij on row i set x ij = 1 and all other x ij = 0 for row i. 10: Output: X * .
Steps 3 through 10 may be replaced by standard Kmeans clustering. It should also be noted that by removing the solution requirement that X j = 0, the algorithm can find k ≤ K clusters.

Similarity Calculation
The main input to the spectral signed clustering algorithm is the similarity matrix W , which overlays both the distributional properties and thesaurus information. Following Belkin and Niyogi (2003), we chose the heat kernel based on the Euclidean distance between word vector representations as our similarity metric, such that where σ and are hyperparameters found using grid search (see Supplemental material for more detail).
We represented the thesaurus as two matrices where T syn is the synonym graph and T ant is the antonym graph. The signed graph can then be written in matrix form asŴ = γW + β ant T ant W +β syn T syn W , where computes Hadamard product (element-wise multiplication). The parameters γ, β syn , and β ant are tuned to the data target dataset using cross validation. The reader should note that σ and are not found using a target dataset, but instead using cross validation and grid search to minimize the number of negative edges within clusters and the number of disconnected components in the cluster.

Evaluation Metrics
We evaluated the clusters using both intrinsic and extrinsic methods. For intrinsic evaluation, we used thesaurus information for two novel metrics: 1) the number of negative edges (NNE) within the clusters, which in our semantic clusters is the number of antonyms in the same cluster, and 2) the number of disconnected components (NDC) in the synonym graph, so the number of groups of words that are not connected by a synonym relation in the thesaurus. The NDC thus has the disadvantage that it is a function of the thesaurus coverage. Our third intrinsic measure uses a gold standard designed to measure how well we capture word similarity: Semantically similar words should be in the same cluster and semantically dissimilar words should not. For extrinsic evaluation, as descibed below, we measure how much our clusters help to identify text polarity. We also compare multiple word embeddings and thesauri to demonstrate the stability of our method.

Experiments with Synthetic Data
In order to evaluate our signed graph clustering method, we first focused on intrinsic measures of cluster quality in synthetic data. To do so, we created random signed graphs with the same proportion of positive and negative edges as in our real dataset. Figure 2 demonstrates that the number of Figure 2: The relation between disconnected component (NDC) and negative edge (NNE) using simulated signed graphs with 100 vertices. negative edges within a cluster is minimized using our clustering algorithm on simulated data. As the number of clusters becomes large, the number of disconnected components, which includes clusters of size one, consistently increases. Determining the optimal cluster size and similarity parameters requires making a trade off between NDC and NNE. For example, in figure 2 the optimal cluster size is 20. One can see that as the number of clusters increases NNE goes to zero, but the number of disconnected components becomes the number of vertices. In the extreme case all clusters contain one vertex. K-means, also shown in figure 2, does not optimize NNE. 6 Experimental Setup

Word Embeddings
We used four different word embedding methods for evaluation: Skip-gram vectors (word2vec) (Mikolov et al., 2013), Global vectors (GloVe) (Pennington et al., 2014), Eigenwords (Dhillon et al., 2015), and Global Context (GloCon) (Huang et al., 2012); however, we only report the results for word2vec, which is the most popular word embedding (see the supplemental material for other embeddings). We used word2vec 300 dimensional embeddings which were trained on several billion words of English: the Gigaword and the English discussion forum data gathered as part of BOLT. Tokenization was performed using CMU's Twokenize. 4

Thesauri
Several thesauri were used in order to test the robustness including Roget's Thesaurus (Roget, 1852), the Microsoft Word English (MS Word) thesaurus from Samsonovic et al. (2010) and WordNet 3.0 (Miller, 1995).
We chose a subset of 5108 words for the training dataset, which had high overlap between various sources. Changes to the training dataset had minimal effects on the optimal parameters. Within the training dataset, each of the thesauri had roughly 3700 antonym pairs; combined they had 6680. However, the number of distinct connected components varied, with Roget's Thesaurus having the fewest (629), and MS Word Thesaurus (1162) and WordNet (2449) having the most. These ratios were consistent across the full dataset.

Gold Standard SimLex-999 And
SimVerb-3500 Following the analysis of Vlachos et al. (2009), we threshold the semantically similar datasets to find word pairs which should or should not belong to the same cluster. As ground truth, we extracted 120 semantically similar words from SimLex-999 with a similarity score greater than 8 out of 10. SimLex-999 is a gold standard resource for semantic similarity, not relatedness, based on ratings by human annotators. Our 120 pair subset of SimLex-999 has multiple parts-of-speech including Noun-Noun pairs, Verb-Verb pairs and Adjective-Adjective pairs. Within SimVerb-3500, we used a subset of 318 semantically similar verb pairs.
The community is attempting to define better gold standards; however, currently these are the best datasets that we are aware of. We tried to use WordNet, Roget, and the Paraphrase Database (PPDB) (Ganitkevitch et al., 2013) as a gold standard, but manual inspection as well as empirical results showed that none of the automatically generated datasets were a sufficient gold standard. Possibly the symmetric pattern of (Schwartz et al., 2015) would have been sufficient; we did not have time to validate this.

Stanford Sentiment Treebank
We also evaluated our clusters by using them as features for predicting sentiment, using sentiment treebank 5 (Socher et al., 2013) with coarsegrained labels on phrases and sentences from movie review excerpts. This dataset is widely used for the evaluation of sentiment analysis. We used the standard partition of the treebank into training (6920), development (872), and test (1821) sets. Table 1 shows the four most-associated words with "accept" using different methods.

Cluster Evaluation
We now turn to quantitative measures of word similarity and synonym cluster quality.

Comparison with K-means and Normalized Cuts
In order to assess the model we tested (1) Kmeans, (2) normalized cuts without thesaurus, and (3) signed normalized cuts. As a baseline, we created clusters using K-means on the original word2vec vector representations where the number of K clusters was set to 750. Table 2 shows the relative ratios of the different clustering methods of with respect to antonym pair inclusion and the number of disconnected components within the clusters. For both methods, over twenty percent of the clusters contain antonym pairs even though the median cluster size is six. Signed clustering radically reduced the number of antonyms within clusters compared to the other methods.

Ref word Roget
WordNet MS Word W2V SC W2V  accept  adopt  agree  take  accepts  grant  accept your fate get  swallow  reject  permit  be fooled by  fancy  consent  agree  let  acquiesce hold assume accepting okay  Table 2: Clustering evaluation of K-means, normalized cuts, and signed normalized cuts with 750 clusters. Ratio of clusters with containing one or more antonym pair and ratio of clusters with disconnected components.

Empirical Results
Tables 3 and 5 present our main result. When using our signed clustering method with similar words, as labeled by SimLex-999 and SimVerb-3500, our clustering accuracy increased by 5% on both SimLex-999 and SimVerb-3000. Furthermore, by combining the thesauri lookup with our clustering, we achieved almost perfect accuracy (96%). Table 5 shows the sentiment analysis task performance. Our method outperforms all methods with similar complexity; however, we did not reach state-of-the-art results when compared to much more complex models which also use a richer dataset.

Evaluation Using Word Similarity Datasets
In a perfect setting, all word pairs rated highly similar by human annotators would be in the same cluster, and all words which were rated dissimilar would be in different clusters. Since our clustering algorithm produced sets of words, we used this evaluation instead of the more commonly reported correlations.
In table 3 we show the results of the evaluation with SimLex-999. Combining thesaurus lookup and word2vec+CombThes clusters, labeled as Lookup + SC(W2V), yielded an accuracy of 0.96 (5 errors). Note that clusters using word2vec with normalized cuts does not improve accuracy. The MSW thesaurus has much lower coverage, but 100 % accuracy, which is why when   Schwartz et al. (2015). Table 4 clearly shows that the overall performance of all methods is lower for verb similarity. However, the improvement using both signed clustering as well as thesaurus look is also larger.

Sentiment Analysis
We trained an l 2 -norm regularized logistic regression (Friedman et al., 2001) and simultaneously γ, β syn , and β ant using our word clusters in order to predict the coarse-grained sentiment at the sentence level. The γ and β parameters were found using a portion of the data where we iteratively switch between the logistic regression and the parameters, holding each fixed. However, hyperparameters σ and , and the number of clusters  Table 4: Clustering evaluation using SimVerb-3500 with 317 word pairs having similarity score over 8. SC stands for our signed clustering and NC is standard normalized cuts. SC(W2V) are the word clusters from signed clustering using word2vec and the combined thesauri.
K were optimized minimizing error using grid search. We compared our model against existing models: Naive Bayes with bag of words (NB) (Socher et al., 2013), sentence word embedding averages (VecAvg), retrofitted sentence word embeddings (RVecAvg) (Faruqui et al., 2015) that incorporate thesaurus information, simple recurrent neural networks (RNN), and two baselines of normalized cuts and signed normalized cuts using only thesaurus information.
While the state-of-the art Convolutional Neural Network (CNN) (Kim, 2014) is at 0.881, our model performs quite well with much less information and complexity. Table 5 shows that signed clustering outperforms the baselines of Naive Bayes, normalized cuts, and signed cuts using just thesaurus information. Furthermore, we outperform comparable models, including retrofitting, which has thesaurus information, and the recurrent neural network, which has access to domain specific context information.
Signed clustering using only thesaurus information (SC(Thes)) performed significantly worse than all other methods. This was largely due to low coverage; rare words such as "WOW" and "???" are not covered. As expected, because normalized cut clusters include antonyms, the method performs worse than others. Nonetheless the improvement from 0.79 to 0.836 is quite drastic.

Conclusion
We developed a novel theory for signed normalized cuts and an algorithm for finding their discrete solution. We showed that we can find su-
perior semantically similar clusters which do not require new word embeddings but simply overlay thesaurus information on preexisting ones. The clusters are general and can be used with many out-of-the-box word embeddings. By accounting for antonym relationships, our algorithm greatly outperforms simple normalized cuts. Finally, we examined our clustering method on the sentiment analysis task from Socher et al. (2013) sentiment treebank dataset and showed that it improved performance versus comparable models. Our automatically generated clusters give better coverage than manually constructed thesauri. Our signed spectral clustering method allows us to incorporate the knowledge contained in these thesauri without modifying the word embeddings themselves. We further showed that use of the thesauri can be tuned to the task at hand.
Our signed spectral clustering method could be applied to a broad range of NLP tasks, such as prediction of social group clustering, identification of personal versus non-personal verbs, and analyses of clusters which capture positive, negative, and objective emotional content. It could also be used to explore multi-view relationships, such as aligning synonym clusters across multiple languages. Another possibility is to use thesauri and word vector representations together with word sense disambiguation to generate semantically similar clusters for multiple senses of words. Furthermore, signed spectral clustering has broader applications such as cellular biology, social networking, and electricity networks. Finally, we plan to extend the hard signed clustering presented here to probabilistic soft clustering.