Specializing Word Vectors by Spectral Decomposition on Heterogeneously Twisted Graphs

Traditional word vectors, such as word2vec and glove, have a well-known inclination to conflate the semantic similarity with other semantic relations. A retrofitting procedure may be needed to solve this issue. In this work, we propose a new retrofitting method called Heterogeneously Retrofitted Spectral Word Embedding. It heterogeneously twists the similarity matrix of word pairs with lexical constraints. A new set of word vectors is generated by a spectral decomposition of the similarity matrix, which has a linear algebraic analytic form. Our method has a competitive performance compared with the state-of-the-art retrofitting method such as AR (CITATION). In addition, since our embedding has a clear linear algebraic relationship with the similarity matrix, we carefully study the contribution of each component in our model. Last but not least, our method is very efficient to execute.


Introduction
Word embedding is one of the core research areas in natural language processing. Its usefulness has been demonstrated in a wide variety of NLP tasks, e.g. dependency parsing, sentiment analysis, and machine reading comprehension. Modern embedding methods are usually based on the distributional hypothesis, namely, co-occurred words tend to purport similar semantic meanings. Although the distributional word vectors perform well on lots of tasks, they have a well-known tendency to conflate the semantic similarity information with the semantic relatedness (Hill et al., 2015). Therefore, the similarity between word vectors cannot reflect the precise semantic relation between word pairs, but just a semantic association (Yih et al., 2012). For instance, if two words are antonyms, their corresponding word vectors could be very close geometrically, which makes it very hard to distinguish one word from the other.
One way to solve this problem is to inject some lexical constraints, such as antonym relationships into the word vectors, the aim is to make antonyms far apart from each other. This process is often referred as semantic specialization . There are two kinds of semantic specialization methods: (1) joint specialization methods, in which word vectors are trained from scratch by incorporating the lexical knowledge into the learning objective of the distributional models. Pham et al. (2015) inject a synonym/antonym margin loss into the skip-gram objective to enforce that antonym pairs have low similarity while synonym pairs have high similarity; (2) retrofitting methods (also referred as post-processing methods), in which pre-trained word vectors are fine-tuned by injecting the lexical information into vector spaces.  proposed the ATTRACT-REPEL(AR) algorithm which tries to push or pull a pair of words by a margin compared with its corresponding negative samples. Another retrofitting method is to inject lexical constraints into the word similarity matrix and obtain the tuned word vectors via the matrix decomposition. The method proposed by Sedoc et al. (2017) is in this line of research. Generally speaking, the retrofitting methods have a better performance (Mrkšic et al., 2016) while the joint specialization methods can specialize all words.
In this paper, we want to address the following question: Can we design a retrofitting method that is interpretable and has a competitive performance with the start-of-the-art methods? Thus, we propose a novel model called HRSWE (Heterogeneously Retrofitted Spectral Word Embedding). The basic idea is that if the difference between the similarity of an antonym pair and the minimum similarity of all pairs is large (small), the weight of lexical constraint about this antonym pair will be high (low). Similar ideas are applied for synonym pairs. Moreover, words i and k will tend to be synonyms if words i and k have a common synonym(antonym) j. On the other hand, i and k will tend to be antonyms if (i, j) are synonyms and (j, k) are antonyms or the other way around. This phenomenon is called contagion, which will be modeled via the matrix multiplication of thesauri matrices. After the similarity matrix is constructed, we do spectral decomposition on the similarity matrix to obtain the specialized embedding. We also care about whether the performance of a specialization method still holds when the thesauri used by the method is perturbed, which is called the robustness. In this paper, we explore a few perturbation methods. Overall, HRSWE is slightly better than AR in terms of robustness given these perturbations. The contributions of our method are as follows.
• Foremost, our method is more interpretable than AR in three folds. First, our embedding has a clear algebraic relationship with the original word embedding, while ATTRACT-REPEL does not have this property. In particular, our embedding can be formulated as an analytical form in terms of the injected similarity matrix. Furthermore, the importance of synonym and antonym information injected into the similarity matrix is quantified by experiments. Finally, the significance of the contagion information is demonstrated by experiments as well.
• In terms of performance, on one hand, our novel method not only has a much better performance compared with Word2Vec but also achieves a competitive performance with the state-of-the-art method ATTRACT-REPEL on three tasks. Furthermore, our method has slightly better robustness compared with ATTRACT-REPEL. On the other hand, our method is faster than ATTRACT-REPEL by at least one order of magnitude in terms of running time. It makes our method appealing.

The Methodology
Let V = {v 1 , v 2 , ...v n } be the vocabulary set, S = {(v i , v j )|v i is a synonym of v j } be the synonym set, and A = {(v i , v j )|v i is an antonym of v j } be the antonym set, where n is the number of words. The original word vector set is {x 1 , ..., x n }, where ∀i, x i ∈ R d . The word vectors matrix X = [x 1 |...|x n ] ∈ R d×n is obtained by stacking the d dimensional original word vectors one by one horizontally. Then, the similarity matrix is defined as W = X T X.
Next, the synonym and antonym thesauri information S 0 and A 0 are introduced, where After that, we consider the thesauri contagion information which is defined in Figure 1. Two same types of relations sharing a common word produce a synonym relation. Otherwise, an antonym relation will be induced. Let the original thesauri be where a and b are two positive hyperparameters. Given the definition of the contagion, the similarity between words j and k can be modeled as considering all words. This is indeed the matrix multiplication. Then, we extract the synonym and antonym contagion information as follows 2 , Finally, we combine S 0 , A 0 , S 1 , and A 1 with W to obtain the thesauri injected similarity matrixŴ , where is the Hadamard product(elementwise multiplication). The W max and W min are the maximum and minimum of W and are used as the similarity baselines of synonym pairs and antonym pairs to guide the thesauri injection. The β 0 , β 1 , and β 2 are hyperparameters. Note that the weight of lexical constraints information injected into one word pair depends on the similarity calculated by its original word vectors. For instance, the weights injected for two synonym words with lower original similarity (implied by W ) would be larger compared with another synonym pair with higher original similarity. Thus, the weights are heterogeneous.
Recall that our goal is to construct d dimensional specialized word vectorsV ∈ R d×n . GivenŴ , it can be achieved as follows. min The problem is equivalent to find a matrixŴ SD such that where the notationŴ SD 0 means thatŴ SD is symmetric and positive semidefinite(SPSD).
SinceŴ is a symmetric matrix, it has a truncated spectral decomposition
5: Do a truncated spectral decomposition onŴ to obtainV ,V = Λ where Λ d is a diagonal matrix containing the largest d eigenvalues with respect to multiplicities, Q d is the d eigenvectors corresponding to the largest eigenvalues ofŴ . According to Dax (2014), there is an analytic optimal solution to this nearest low-rank SPSD matrix problem where Λ d is obtained by replacing the negative values of Λ d with 0. Thus, theV iŝ The time complexity of this method is O(n 2 d) and the method is summarized in Algorithm 1. For a large n, one can use matrix sparsification methods, Nyström methods, and GPUs to accelerate the eigendecomposition.

Experiments
In this section, we evaluate the methods on four tasks: the word similarity, the synonymy/antonymy classification, the lexical simplification, and the robustness test. Basic Setup To evaluate the effectiveness of our method, Word2Vec (Mikolov et al., 2013) is our original embedding. Specifically, we choose the 300-dimensional skip-gram vectors 3 and denote this original embedding as SGNS-GN. The synonym and antonym relationships in Vulić (2018) are adopted as the lexical constraints. We call this set Ω. Throughout the paper, the main baseline of our method is AR. Although the method proposed by Sedoc et al. (2017) is sensitive to thesauri according to our private communication, we try their homogeneous graph construction method and do spectral decomposition on this graph. This benchmark is called RSWE. All our experiments are carried out on a server with an NVIDIA RTX 2080 Ti GPU and an Intel i9-7940x CPU with 32 GB of RAM.

Word Similarity
The word similarity task is a standard evaluation task for word embeddings. There are several datasets containing pairs of words that their semantic similarities are labeled by humans. On the other hand, the similarity of a pair of words can be predicted by our word embeddings, e.g. the cosine similarity between the word vectors of the pair. By computing the Spearman's ρ rank correlation between human's scores and our predictions, one can measure how well word embedding models the semantic similarities. We evaluate our methods with two recent datasets: SimLex-999 (Hill et al., 2015) and SimVerb-3500 (Gerz et al., 2016). In the following, we will describe how we validate and test those models. For HRSWE, the vocabulary set V is composed of all the words in SimLex-999 and SimVerb-3500. Meanwhile, the lexical constraint sets S and A are constructed in the following way: if v i , v j ∈ V , v i = v j and the pair (v i , v j ) belongs to our constraint set Ω, (v i , v j ) will be added to S or A correspondingly. We don't have to train these two models. The only thing to do is to choose the hyperparameters β 0 , β 1 , β 2 , a, and b. To quantify how different information affects the quality of the specialized embedding, we also evaluate two degenerate cases of HRSWE(called HRSWE-1/HRSWE-2, the complete model is HRSWE-3). They both make no use of the contagion information and HRSWE-1 just uses one hyperparameter in which β 0 = 1 and β 1 = β 2 . RSWE-2 and RSWE-3 are similar with HRSWE-2 and HRSWE-3 except that W max − W and W − W min are replaced with W . For the state-of-the-art retrofitting model ATTRACT-REPEL, it is trained on the same vocabulary V and lexical constraint sets S and A as our method. This is quite different from its original implementation in Vulić (2018), where its lexical constraint set is Ω. This makes our implementation more like a customization method, in which only the set of words and constraints related to the task will be used. In terms of the hyperparameters selection, all HRSWE hyperparameters take values from [0, 1] and the same ranges will be applied for RSWE. The hyperparameters ranges of AR are [0, 1], [0, 1], {64, 128, 256}, {1, 2, ..., 20}, [10 −9 , 10 0 ] for the synonym margin, antonym margin, mini-batch size, number of epochs, and regularization strength respectively. Bayesian optimization with Gaussian Processes is employed to search hyperparameters for all methods. All models are tuned on the validation set of the SimVerb-3500. The test results are summarized in Table 1. First, let's focus on the quality of HRSWEs. The HRSWE-3 has a comparable performance with AR. On the SimVerb, it outperforms AR by about 1 point while it is slightly worse than AR by 1-2 points on the SimLex. Furthermore, our experiments demonstrate why our method can compete with the state-of-the-art method in an ablation fashion. Very intuitively, the number of hyperparameters is important. For instance, HRSWE-2 has three hyperparameters while HRSWE-1 has only one. Given the more flexibility to tune the weights of the original, synonym, and antonym information, HRSWE-2 yields a performance boost compared with HRSWE-1. The corresponding values of hyperparameters are β 0 = 0.70, β 1 = 0.29, and β 2 = 1.0. It shows that the performance boost comes from decreasing the contribution of the original information and increasing the contribution of the antonym information. Most importantly, the contagion information is a necessity under our setting. With the aid of that, the performance of HRSWE-3 exceeds that of HRSWE-2 by about 2 points, which makes our method even on a par with AR 4 . The performance of RSWE is much poorer than HRSWE which demonstrates the advantages of the heterogeneous twisting. Next, we put our attention on the efficiency of the methods. Given V , X, A, S, the search ranges of hyperparameters, and the number of search rounds, the average computation time to generate the specialized embeddings is documented, which is denoted as the specializing time. Our method is much more efficient than AR. The specializing time of HRSWE is about 0.8s which is nearly 50 times faster than that of AR. Note that in all our implementations of ATTRACT-REPEL, the GPU is utilized.

Synonym/Antonym Classification
The synonym/antonym classification task is a binary classification task to decide whether pairs of words are synonyms or antonyms. Given the word embeddings of a pair of words and a threshold γ, if the cosine similarity of the pair is higher than the given threshold, this pair is regarded as synonyms; otherwise, they are antonyms. We evaluate our methods on a recent dataset proposed by Nguyen et al. (2017b). It has a validation set and a test set. Both sets consist of noun pairs, verb pairs, and adjective pairs.
For HRSWE, the vocabulary set V consists of all words in the validation set and test set of Nguyen et al. (2017b). The lexical constraints are constructed in the same way as that in the word similarity task. For ATTRACT-REPEL, the vocabulary and the lexical constraints are the same as our methods. Hyperparameters ranges of HRSWE are β 0 ∈ [0, 1], β 1 ∈ [10 −1 , 10 1 ], β 2 ∈ [10 −1 , 10 1 ], a ∈ [0, 1], and b ∈ [0, 1]. The same ranges will be applied for RSWE. The hyperparameters ranges of AR are the same as those in the word similarity task.  The test results are listed in Table 2. From the specialization quality perspective, HRSWE-3 is competitive with AR. The F 1 difference between the two models is quite small, which is within 1 point. In the meantime, we analyze the components of HRSWE, answering why it specializes embeddings so well. Similar to the word similarity tasks, more hyperparameters will enhance performance. For instance, HRSWE-2 improves the performance on Verb and Noun by 3-4 points compared with HRSWE-1. The corresponding βs are β 0 = 0.0, β 1 = 1.87, and β 2 = 7.25. Surprisingly, β 0 is 0.0 in this test. Note that 85% percent of the task tuples are covered by thesauri tuples. This might be one reason for that. Meanwhile, the contagion information is also crucial to this task. Without that, HRSWE-3 cannot surpass the HRSWE-2 by 2 points. On the other hand, the F 1 score of HRSWE is higher than RSWE by about 3 points which accentuates the need for the heterogeneous twisting. From the efficiency perspective, the specializing time of HRSWE is significantly less than that of AR by around 7-8 times.

Lexical Simplification
We now evaluate HRSWE on a downstream task called Lexical Simplification. The goal of this task is to replace the complex words that are used less frequently and known to fewer speakers with their simpler and frequently used synonyms. For instance, given a sentence "the notorious pirate won the match", one may expect the word "notorious" to be replaced by some other simpler words like "infamous". We choose the dataset crowdsourced by Horn et al. (2014) as the task data. It contains 500 sentences and each of the sentences has one target word. For each target word, it has a candidate set. Simplification models are expected to replace the target words with words or phrases that in candidate sets. The 500 sentences are equally split into validation and testing sets. The LIGHT-LS model (Glavaš andŠtajner, 2015) is adopted as the simplification model 5 .
For HRSWE, the vocabulary V is prepared as follows. First, we exclude phrases in all candidate sets. Since the LIGHT-LS retrieves simpler words from the embedding space and lots of phrases are not in the space, the phrases in candidate sets will not be retrieved and are removed from our vocabulary. Second, we lemmatize all the target words and words in the rest of the candidate sets as the lemmatized words will be found in constraints more easily. Finally, the lemmatized words and words in sentences will be added to V . The lexical constraints are constructed in the same way as that in the word similarity task. For ATTRACT-REPEL, the vocabulary and the lexical constraints are the same as our methods. The hyperparameter range of HRSWE is the same as the synonym/antonym classification task. Hyperparameters ranges of ATTRACT-REPEL are the same as the word similarity task except that the range of the mini-batch size is {32, 64, 128, 256, 512, 1024}   The test results are listed in Table 3. From the specialization quality perspective, HRSWE-2 surpasses AR by about 1 point. This might because fewer hyperparameters can be better tuned given the same number of hyperparameter search rounds 6 . From the efficiency perspective, HRSWE is about 25 times faster than AR. To summarize, the HRSWE has a competitive specialization quality compared with AR while runs dramatically faster.

The Robustness Test
Can specialization methods still produce high-quality embeddings given the perturbation in the thesauri? In this section, we evaluate the robustness of our method and AR on the word similarity task and the synonym/antonym classification task given three types of perturbed thesauri. The perturbation methods are described as follows. Figure 2: Results(Spearman's ρ) of HRSWE and AR on two word similarity datasets with respect to three types of perturbations. The first and second rows represent results on the SimLex and SimVerb-test datasets. The first, second, and third columns represent results with Syn-Adv, Ant-Adv, and Syn-Ant-Adv perturbations respectively. The red lines are results of HRSWE. The x-axis is the proportion of the subset in the intersection r.
First, we extract word pairs from the validation and test sets in a particular task and create a union set from both sets. Second, the union set is intersected with the synonym part of the thesauri. Finally, a random subset of the intersection is moved into the antonym part of the thesauri. The proportion of the subset in the intersection is denoted as r. This perturbation method is called Syn-Adv and one can also put the antonym intersection into the synonym thesauri which is denoted as Ant-Adv. By combining the Syn-Adv and Ant-Adv, we obtain the Syn-Ant-Adv perturbation.
We first evaluate the robustness of HRSWE and AR on word similarity tasks. The initial V , A, and S are constructed in the same way as the previous word similarity task. The hyperparameter ranges of HRSWE and AR are almost the same as the previous word similarity task except that the range of the mini-batch size is {32, 64, 128, 256, 512, 1024}. The proportion r varies from 0.2 to 0.5. The test results are presented in Figure 2. Interestingly, HRSWE achieves a better performance in most cases(5 cases out of 6) over 3 types of perturbation. Given the perturbation of Syn-Adv and Syn-Ant-Adv, HRSWE outperforms AR by 0.6-12 points on SimLex and SimVerb-test. As the proportion increases, the performance gap between the two methods increases. Nonetheless, AR wins the round on the SimLex perturbed via Ant-Adv by around 2 points on average. This phenomenon might be related to the fact that the set of antonym intersection takes only about 8% of SimLex word pairs, which is too low. We then evaluate the robustness of HRSWE and AR on the synonym/antonym classification task. The hyperparameter ranges of AR are the same as those in the word similarity perturbation tasks while the ranges of HRSWE are almost the same as those in the previous synonym/antonym classification tasks except replacing the β 1 ∈ [10 −1 , 10 1 ] and β 2 ∈ [10 −1 , 10 1 ] with β 1 ∈ [0, 1] and β 2 ∈ [0, 1] in the Syn-Adv perturbation. The test results are demonstrated in Figure 3. Overall, HRSWE is slightly worse than AR. On average, AR surpasses HRSWE by about 2.9 points on all the three datasets perturbed by Syn-Adv while it falls behind our method by about 1.6 points on all the three datasets perturbed by Ant-Adv. For the Syn-Ant-Adv perturbation, the two methods are almost on par on Adjective pairs while AR is slightly better than HRSWE on Noun and Verb pairs.
Summarization and Discussion In all, HRSWE beats AR on 9 robustness test cases out of 15 and wins 92 points in total while loses 58 points. Among all rs in all test cases, the performance of HRSWE is better than that of AR up to 12 points when r is 0.5 with the Syn-Adv perturbation on the SimVerbtest dataset. On the other hand, AR exceeds HRSWE by up to 5 points in the Syn-Adv perturbation on the Noun dataset when r is 0.5, which is the best scenario that AR outperforms HRSWE. From these perspectives, we argue AR is slightly less robust than our method.
To reduce the impact of the perturbation, the thesauri contagion information should be fully exploited. The usage of the contagion information in our method is explicit while in AR is implicit. Suppose we have three words i, j, k, (i, j) and (j, k) are two pairs of synonyms, AR forces (i, j) to be closer than (j, k) while also forces (j, k) to be closer than (i, j). Thus, the two words i and k will probably have a high similarity in the final embedding as well. This explains why AR is resistant to the perturbations.

Related Works
Apart from intrinsic tasks, lots of extrinsic downstream applications would be influenced without the semantic specialization. For sentiment analysis, if "good" and "bad" are similar to each other, it would be hard to distinguish the sentiment polarity of a sentence. For spoken language understanding, it would be annoying if a user wants a cheap restaurant while the virtual assistant recommends an expensive one.
Before AR, several important post-processing methods need to be mentioned. The first post-processing method is Retrofitting (Faruqui et al., 2014) in which the word "retrofitting" is first used. After that, the method PARAGRAM (Wieting et al., 2015) extends Retrofitting with a more sophisticated "ATTRACT" term. Note that both Retrofitting and PARAGRAM do not consider antonymy, Counter-Fitting (Mrkšic et al., 2016) models both synonym and antonym relations. The differences among these methods are reviewed in Glavaś et al. (2019).
To make the retrofitting models able to specialize the entire vocabulary,  try to explicitly retrofit the word embeddings. The basic idea is to learn a global retrofitting function using linguistic constraints as training examples. After that,  propose a post-specialization model that tries to train a neural network that can mimic the specialization of AR. This method yields considerable gains on a variety of tasks.
So far, the semantic similarity is a symmetric relation. Some salient asymmetric relations like hypernym and meronym also need to be modeled in the word embeddings. HyperVec (Nguyen et al., 2017a) model is a joint model that augments the skip-gram objective with the hypernym constraints and it can also tell which word is the hypernym.  propose a retrofitting model that extends the AR to the lexical entailment by adding the attract objective according to hypernym constraints and the asymmetric norm-based objective.

Conclusion
In this paper, we propose a new retrofitting method called Heterogeneously Retrofitted Spectral Word Embedding. This method shows comparable performance with the state-of-the-art retrofitting method while it is quite efficient. Besides that, our method is slightly more robust than AR under several perturbations. One major advantage of our method is its interpretability. Our specialized embedding has a clear linear algebraic relationship with original embeddings. Moreover, the impact of hyperparameters and contagion information on HRSWE has been carefully analyzed. It demonstrates the enhancement of the performance of embeddings in a step by step fashion. In the future, we would like to extend our methods to other types of word relations such as hyponymy, meronymy, and so on.