Manifold Learning-based Word Representation Refinement Incorporating Global and Local Information

Recent studies show that word embedding models often underestimate similarities between similar words and overestimate similarities between distant words. This results in word similarity results obtained from embedding models inconsistent with human judgment. Manifold learning-based methods are widely utilized to refine word representations by re-embedding word vectors from the original embedding space to a new refined semantic space. These methods mainly focus on preserving local geometry information through performing weighted locally linear combination between words and their neighbors twice. However, these reconstruction weights are easily influenced by different selections of neighboring words and the whole combination process is time-consuming. In this paper, we propose two novel word representation refinement methods leveraging isometry feature mapping and local tangent space respectively. Unlike previous methods, our first method corrects pre-trained word embeddings by preserving global geometry information of all words instead of local geometry information between words and their neighbors. Our second method refines word representations by aligning original and re-fined embedding spaces based on local tangent space instead of performing weighted locally linear combination twice. Experimental results obtained from standard semantic relatedness and semantic similarity tasks show that our methods outperform various state-of-the-art baselines for word representation refinement.


Introduction
Semantic word representations are normally represented as dense, distributed and fixed-length word vectors that are generated by different word embedding models. They can be used to discover some semantic information among words and measure the semantic relatedness of words. Not surprisingly, word vectors and word embedding models have been attracting a lot of attention in the research com- Corresponding author. This work is licensed under a Creative Commons Attribution 4.0 International Licence. Licence details: http://creativecommons.org/licenses/by/4.0/. munity. These word embeddings have been proved to be quite useful in a number of Information Retrieval (IR) tasks, such as machine translation (Mujjiga et al., 2019), text classification (Stein et al.,2019), question answering (Esposito et al., 2020 ) and ad-hoc retrieval (Bagheri et al., 2018;Roy et al. 2018).
The performance of aforementioned tasks critically depends on the quality of word embeddings generated from different models. There exist a large number of word embedding models such as BERT (Devlin et al., 2019), C&W (Collobert et al., 2011), Continuous Bag-of-Words (CBOW) (Mikolov et al., 2013a), Skip-Gram (Mikolov et al., 2013b), GloVe (Pennington et al., 2014) and many others (Qiu et al., 2014;Niu et al., 2017;. BERT and its successor models can effectively generate contextual word embeddings with high-quality. However, the computational cost is very high. We reserve the study of contextual word embedding refinement as our future work. On the contrary, static word embedding models are generally simple and effective. These models assume that data distribution of words is in a linear structure. However, there still exists the situation that data distribution of words is in a strong non-linear structure (Chu et al., 2019), making the aforementioned models fail to estimate similarities between words. They may underestimate similarities between similar words and overestimate similarities between distant words, causing similarities obtained from word embedding models inconsistent with human judgment.
Some efforts have been made to address the inconsistency issue. For example, Locally Linear Embedding (LLE) method (Hasan and Curry, 2017) and Modified Locally Linear Embedding (MLLE) method (Chu et al., 2019) were proposed to refine pre-trained word vectors based on the weighted locally linear combination between words and their neighbors. The idea of these two similar methods is to utilize geometry information and keep the reconstruction weights between words and their local neighbors unchanged both in original and refined new embedding spaces. However, there are certain shortcomings in their methods. The reconstruction weights are constructed by the linear combination of words and their neighbors. These weights are easily influenced by different selections of neighboring words. Furthermore, the weighted linear combination used in their methods needs to perform twice and the whole process is time-consuming. The total operation needs to be performed separately in original and new embedding spaces.
In this paper, we propose two novel word representation refinement methods that overcome the shortcomings in previous methods. Our first Word Representation Refinement method utilizes Isometric Feature Mapping to refine word vectors based on the global geodesic distances between all words in the original embedding space (denoted as WRR-IFM). This method mainly focuses on global geometry information (geodesic distance) between all words. The geodesic distances between word points in the original embedding space are equal to those in a refined new embedding space through isometric feature mapping (Tenenbaum et al., 2000). The WRR-IFM method firstly computes the geodesic distances between all words by finding the shortest paths between them, then uses the isometric feature mapping method to re-embed word vectors from the original embedding space to a refined new embedding space. Meanwhile, we also introduce another novel Word Representation Refinement method by re-embedding word vectors based on Local Tangent Space (denoted as WRR-LTS). Our method considers a locally linear plane constructed by Principal Components Analysis (PCA) on word neighbors as an approximation of tangent space of each word. The tangent space of word points of manifold structure can represent the local geometry information (Zhang and Zha, 2002;Zhang and Zha, 2003). Then our WRR-LTS method re-embeds word vectors by aligning original and refined new embedding spaces based on the tangent space of each word. We conduct comprehensive experiments on seven different datasets with standard semantic relatedness and semantic similarity tasks to verify our proposed methods. The experimental results show that our WRR-IFM method can significantly refine the pre-trained word vectors and our WRR-LTS method achieves better performance than that of state-of-the-art baseline methods for word representation refinement. In summary, our contributions are presented as follows: a) We introduce a word representation refinement method leveraging isometric feature mapping to correct word vectors based on the global geodesic distance between all words. This method mainly focuses on global geometry information (geodesic distance) between all words instead of local geometry information (the weighted locally linear combination between words) used previous studies. b) We also introduce another word representation refinement method based on local tangent space. This method performs word representation refinement by aligning original and refined new embedding spaces based on different local geometry information, i.e. the local tangent space of words. c) In this paper, we demonstrate that manifold-learning algorithms that preserve local geometry information are more beneficial to refine word representation in comparison with the manifold-learning algorithms that preserve global geometry information.

Related Work
Word representation is an essential component for semantic relatedness measurement in many IR tasks.
In the past few years, different methods have been proposed to generate and refine word vectors.
Early idea about using vectors to represent words was derived from the vector space model (Salton et al., 1975), which utilized TF-IDF to construct a word-document co-occurrence matrix to represent words and documents as vectors. Subsequently, several methods were proposed to produce word embeddings by globally utilizing word-context co-occurrence counts based on word-context matrices in a corpus (Deerwester et al., 1990;Dhillon et al., 2011;Lebret and Collobert, 2013). These aforementioned methods all focus on word co-occurrence probability or word counts. These count-based methods do not consider the semantic relationships between words and their context words.
Apart from these count-based methods, there are prediction-based methods. These methods are derived from the distributed word representation hypothesis proposed by Hinton (1986). Distributed word representations represent words as dense and low-dimensional word vectors. There are many famous distributed word embedding models, such as C& W (Collobert et al., 2011), Continuous Bag-of-Words (CBOW) (Mikolov et al., 2013a), Skip-Gram (Mikolov et al., 2013b), GloVe (Pennington et al., 2014) and many others (Qiu et al., 2014;Niu et al., 2017;. These methods leverage word context to generate word embeddings. Apart from the aforementioned static word embedding models, contextual embedding models become popular in recent days, such as BERT (Devlin et al., 2019), ELMO (Peters et al., 2018) and many others (Lan et al., 2020, Liu et al., 2020. These models demonstrate better performance on word embedding generation. In general, contextual word embedding models have a huge amount of parameters. Training such models are very time-consuming. On the contrary, static embedding models are far more simple and equally effective. To improve the quality of word embeddings, many word representation refinement methods are proposed. Mu et al. (2018) post-processed pre-trained word vectors by removing the common mean vectors. Utsumi (2018) refined word vectors by using Layer-wise Relevance Propagation. Yu et al. (2017) utilized the ranking list of sentiment lexicon to guide word representation refinement. Methods utilizing manifold learning-based algorithms are particularly effective. Hasan and Curry (2017) proposed an method using Locally Linear Embedding (LLE) algorithm to re-embed pre-trained GloVe word vectors into a new embedding space. They used the weighted locally linear relationships between words and word neighbors in original space. Chu et al. (2019) used a Modified Locally Linear Embedding (MLLE) algorithm to refine pre-trained word vectors with the help of geometric information of words and neighboring words. Though they can achieve good performance, there still exist some limitations. The performance of above two manifold-learning based methods critically depends on local geometry information and the weighted locally linear combination between words and their (multiple) neighbors. The reconstruction weights are easily influenced by different selections of neighboring words. Also, the weighted locally linear combination needs to perform twice in both original and new embedding spaces. The whole process is quite time-consuming.
Because of these limitations, our WRR-IFM method tries to refine word representations by using global geometry information of all words instead of local geometry information. Our WRR-LTS method corrects word representations by using local geometry information (local tangent space) to align two embedding spaces rather than performing the weighted locally linear combination between words and neighboring words twice.

Overall Framework
Our proposed word representation refinement methods are based on a universal framework. The idea is to utilize manifold learning algorithms to re-embed word vectors from the original embedding space to a refined new embedding space. A sketch of the framework is shown in Figure 1. In the first step, we select a sample subset of word vectors from the original embedding space through a sample window. Word vectors are ordered by their corresponding word frequencies in a corpus. In this work, to demonstrate the effectiveness of our proposed methods, we test original word embeddings from GloVe 1 , Word2Vec 2 , and FastText 3 . Note that as in previous studies (Hasan and Curry, 2017;Chu et al., 2019), we use samples of word vectors rather than all vectors in the original embedding space to reduce high computational cost. In the second step, a fitted manifold learning algorithm will be used to transform word vectors from the original embedding space to a refined new embedding space with the dimension of word vectors retained. In the third step, we pick word vectors of word pairs in specific evaluation tasks from the original embedding space. Finally, we re-embed these word vectors to form new vectors in a new embedding space by the fitted manifold learning algorithm.

Word Representation Refinement based on Isometric Feature Mapping
LLE (Hasan and Curry, 2017) and MLLE methods (Chu et al., 2019) show promising results on word representation refinement. These two methods pay more attention to uncover the local geometry information of manifold structure. We make an attempt to exploit global geometry information instead of local geometry information of manifold structure. Hence, we propose a novel Word Representation Refinement method which utilizes Isometric Feature Mapping (WRR-IFM) to refine word vectors. The method is based on global geodesic distances between all words in the original embedding space. The basic assumption is that global geodesic distances between words are equal in original and refined new embedding spaces. In this method, we first compute global geodesic distances between all words by finding the shortest paths between them. Then we re-embed word vectors by applying the classical Multidimensional Scaling (MDS) technique to decompose the distance matrix constructed by geodesic distances. Fit according to Eq. (2), (3) to obtain refined new word embeddings space 5:

Algorithm 1. Word Representation Refinement
If use WRR-LTS 6: Fit according to Eq. (6), (10), (11) to obtain refined new word embeddings space 7: end for 8: for all ∈ { 1 , 2 , ⋯ , } do 9: obtain corresponding word vector of each from 10: re-embed word vector of to obtain refined vector set of test words based on 11: end for 12: return refined vector set of test words We fit the Isometric Feature Mapping (IFM) algorithm on those selected samples. Firstly, we select word vector samples from the original embedding space by using a sample window. The set of selected training samples is defined as a word vector set = [ 1 , 2 , ⋯ , ], where is the number of words. Note that ∈ × , where represents the dimension of word vectors. Then we fit the IFM algorithm based on . For each word vector point ∈ , we find its nearest neighbors (including itself). Based on these neighbors, an undirected neighborhood graph is constructed, where nodes represent word vectors (points) and edges represent links between two points. The edge weight between two neighboring points and in graph is calculated by Euclidean distance ( ) . If neighboring points and are linked, we initially set ( ) = ( ), otherwise, ( ) is set to ∞. Graph is updated by using the shortest path algorithm (Dijkstra algorithm). The shortest path from point to can be regarded as the geodesic distance between these two points: The shortest path distances between all pairs of word vectors in graph will form a matrix , where = ( ). We use the classical MDS technique on to re-embed word vectors into a new refined embedding space that can preserve the intrinsic geometry of the manifold structure. The re-embedded word vectors ∈ for point in refined space are chosen to minimize the cost function: Where is the reconstruction error matrix, is the matrix of Euclidean distance { ( ) = || − ||} in new refined space and || || 2 is the 2 matrix norm √∑ , 2 , (A= ( ) − ( )). The operator converts distance to inner products, which uniquely characterize the geometry of data in a form that supports efficient optimization. The global minimum of Eq. (2) can be achieved by setting the vector in a refined word embedding set (which is also regarded as refined new embedding space) to the top eigenvectors of the matrix ( ), where is equal to , as the embedding dimension is identical in original and new embedding space. To obtain these eigenvectors, we compute ( ) = − 1 2 , where is a Householder centering matrix. Then we compute eigenvalue decomposition ( ) = with λ = diag( 1 , 2 , … , ), were 1 ≥ 2 ≥ ⋯ ≥ ≥ 0. Finally, we choose top nonzero eigenvalues and corresponding eigenvectors as refined word embedding coordinates. The refined word embedding set can be obtained by Eq. (3), which is computed by Eq. (4) and Eq. (5) below: According to Eq. (2) and Eq. (3), we train the IFM algorithm on selected word vector training samples to obtain a refined new embedding space . Then, we pick the test word vectors from the original embedding space and re-embed them to obtain refined vector set of test words by leveraging the new embedding space . The overall procedure of our proposed Word Representation Refinement methods for both WRR-IFM (as well as WRR-LTS) is described in Algorithm 1.

Word Representation Refinement based on Local Tangent Space
Similar to LLE (Hasan and Curry, 2017) and MLLE methods (Chu et al., 2019), our WRR-LTS method also considers preserving local geometry information of words and their neighbors for refining word vectors. However, local geometry information used in our method is different from those of LLE and MLLE methods. Furthermore, to overcome the limitations of these two methods, our method utilizes local tangent space of word points instead of performing weighted locally linear combination twice for word representation refinement. In this proposed method, we firstly construct a locally linear plane by utilizing PCA on words and their neighbors. This plane is regarded as an approximation of the tangent space at each word. Due to the existence of a linear mapping of each word from both original and new embedding spaces to its local tangent space, our method re-embeds word representations by aligning these linear mappings based on this local tangent space.
The procedure of the proposed method described in this section is similar to that of the WRR-IFM method. We firstly select word vector samples from the original embedding space via a sample window and this selected word vector sample set is defined as = [ 1 , 2 , ⋯ , ]. Note that ∈ × , where and represent the dimension of word vector samples and the number of word vectors. Then we train a Local Tangent Space (LTS) algorithm on them. To be specific, for each word vector ∈ , we need to find its nearest neighborhoods (including itself) and the adjacent neighborhood set is denoted as = [ 1 , 2 , ⋯ , ]. To preserve the local structure of the neighborhood set of each word vector , we apply PCA to to approximate the local tangent space of the word corresponding to a word vector . The objective function is where is an identity matrix, represents the vector of all 1's, is an orthonormal basis matrix of the tangent space, = [ 1 , 2 , ⋯ , ] represents the local linear approximation of . The optimal in the above formula is given by neighborhood set , because it is the mean value of all word vectors , ( = 1,2, ⋯ ) in . The optimal is given by the orthogonal basis and is composed of left singular vectors of ( − ) corresponding to its largest singular values ( = and the reason is mentioned in Section 3.2). The tangent coordinates can be defined as After we extract local tangent coordinates by an optimal linear fitting to neighboring samples, we need to obtain the global coordinates in new embedding space. The purpose of global coordinate construction is to find a group of global coordinates in a new embedding space. We assume that there is an alignment matrix, which re-embeds tangent coordinates to new space coordinates = { 1 , 2 , ⋯ , } in new embedding space, then we have where is the alignment matrix which maps to and is the local reconstruction error matrix. To preserve as much of local geometry information in the new embedding space as possible, we seek to find and to minimize the reconstruction error Obviously, the optimal alignment matrix has the form = ( − ) + , and the local reconstruction error = ( − )( − + ) is minimal, where + is Moore-Penrose generalized inverse of . Let refined word vector set = [ 1 , 2 , ⋯ , ] in new embedding space ( is also called as refined new embedding space) and be the 0-1 selection matrix such that = . We find the optimal by minimizing the overall reconstruction error and the objective function in Formula (9) can be rewritten as: To uniquely obtain , the constraint = is imposed. The vector of all ones is an eigenvector of corresponding to a zero eigenvalue. Then the refined word embedding set is given by the eigenvectors of the matrix , corresponding to the 2nd to ( + 1)th smallest eigenvalues of , and the eigenvector matrix picked from is [ 2 , ⋯ , +1 ] , where is an eigenvector of . Then dimensional refined new embedding set should be: We use word vector samples from the original embedding space to train the LTS algorithm by Eq. (6), Eq. (10) and Eq. (11) to obtain a refined new embedding space . Then we obtain word vectors of test words from the original embedding space and obtain refined vector set of these words based on the new embedding space .
We use three types of pre-trained word vectors in our word representation refinement experiments, which are GloVe (Pennington et al., 2014), FastText (Mikolov et al., 2018) and Word2Vec (Mikolov et al., 2013a). GloVe word vectors are learned from different sources. 400,000 GloVe vectors are trained on Wikipedia 2014+Gigaword 5 corpora (consists of 6 Billion tokens, 400,000 vocabularies, word vectors with 50, 100, 200, and 300 dimensions). Another 1.9 Million GloVe vectors are trained on Common Crawl corpus (consists of 42 Billion tokens, 1.9 Million vocabularies, word vectors with 300 dimensions). 1 million FastText vectors are trained on Wikipedia 2017 corpus (consists of 16 Billion tokens, word vectors with 300 dimensions, 1 million words). Word2vec vectors with 300 dimensions are trained on part of Google News dataset, consisting of 3 million words and phrases.

Baselines and Evaluation Metrics
We report our experimental results in comparison with other state-of-the-art word representation refinement methods. The description of baseline methods is presented as follows.
GloVe. GloVe (Pennington et al., 2014) vectors are trained on global word co-occurrence statistics and it considers both global and local features of words.
Word2Vec. Word vectors trained by Word2Vec model (Mikolov et al., 2013a) only focus on local features of words. This method utilizes the sliding context windows to select neighboring words to predict the target word, or the current word to predict its neighbors.
FastText. This method (Mikolov et al., 2018) considers subwords and uses them to deal with out-ofvocabularies when producing word vectors.
LLE. This method is proposed in Hasan and Curry (2017)'s work to preserve local linear features between words and their neighbors by using the LLE manifold learning algorithm.
MLLE. Similar to LLE descried above, Chu et al. (2019) used the MLLE manifold learning algorithm to refine pre-trained word vectors.
WRR-ISM. The first method we proposed in this paper. We use an Isometric Feature Mapping algorithm that focuses on preserving geodesic distance between words to re-embed word vectors from the original embedding space to a refined new embedding space.
WRR-LTS. The second method we proposed in this paper. It uses the Local Tangent Space algorithm to re-embed word vectors by aligning original and refined new space based on the tangent space of each word.

Performance on Two Evaluation Tasks
We conduct experiments on seven datasets of semantic similarity and semantic relatedness tasks to verify the performance of our proposed methods. Table 1 shows the comparison results of our proposed methods (WRR-IFM and WRR-LTS) and all baseline methods (LLE and MLLE) on three different sets of pre-trained word vectors.
The results show that, when all methods are trained on GloVe vectors, the WRR-LTS method achieves the best scores on five out of seven datasets. When they are trained on Word2Vec vectors, the WRR-LTS method achieves the best scores on five out of seven datasets. When they are trained on the FastText vectors, the WRR-LTS method achieves the best scores on six out of seven datasets. The clear advantage of the WRR-LTS method demonstrates that our local tangent space-based method captures more accurate local geometry information than those of baseline LLE and MLLE methods. In other words, local tangent space in our proposed method is more beneficial to represent the local geometry information in comparison with weighted locally linear combination between words and their neighbors in LLE and MLLE methods.
However, our WRR-IFM method works less well. In most of runs, it can only bring improvements over the original word embeddings (i.e. GloVe, word2ve and FastText) and fail to outperform various baseline methods. This shows that local geometry information may be more important than global geometry information in refining word representations. Global geometry information may introduce some noises in the whole refining process.
We now examine the differences between two evaluation tasks. We can see that all manifold learningbased methods including our proposed methods demonstrate similar performance. These results are inline with previous findings, so that the two tasks are quite suitable to evaluate the word representation refinement.

Comparison of Refining Different Word Vectors
In Hasan and Curry (2017) and Chu et al., (2019)'s work, they only compared manifold learning-based methods with the GloVe vectors. In this paper, we try to compare their performance on three representative and popular pre-trained word vectors. From  Table 1: Spearman correlations between scores predicted by our model and scores obtained from human judgment on seven specific datasets. Bold values with * represent our proposed approach achieve the best performance among all baseline methods. Bold values with † represent our proposed method achieves better results than original pre-trained models. Note that all baseline results of GloVe pretrained word vectors are taken from the study (Chu et al., 2019) improvements from 1.45% to 6.88%, outperforms Word2Vec with improvements from 1.32% to 6.94%. The larger improvement on count-based model (GloVe) than prediction-based model needs further investigation, we leave it as our future work.

Performance on Refining GloVe Word Vectors
To compare the performance of different embedding dimensions of GloVe word vectors of our proposed methods and all baseline methods, we randomly choose two datasets, WS353 and RG to report results. The results are shown in Table 2. Compared with baseline methods (including LLE and MLLE), our WRR-LTS method achieves the best performance in 5 out of 10 experimental runs and our WRR-IFM method obtains the highest scores in 4 out of 10 experimental runs. Our methods also show significant improvement in most of runs in comparison with original GloVe word vectors. When the dimension and training size increase, the performance is better. So that in section 4.3.1, all GloVe vectors are trained with the most data that we can obtain.

Impact of Parameters
Finally, we describe the impact of all parameters. We set the number of eigenvectors to be equal to the dimension of pre-trained word vectors. The size of training sample window is in the range [300,1000]. The value range of number of neighbors is chosen from [300,1500]. Generally, the lower number of local neighbors is, the faster the fitted manifold learning algorithm runs.  Table 2: Spearman correlations between scores predicted by our model and scores obtained from human judgment on two evaluation datasets. Bold values with * represent our proposed approach achieve the best performance among all baseline methods. Bold values with † represent our proposed method achieves better results than the original GloVe pre-trained model. Note that baseline results are taken from the study (Chu et al., 2019)

Conclusion and Future Work
In this paper, we study word representation refinement problem by utilizing manifold learning algorithms. We propose two novel methods (WRR-IFM and WRR-LTS) for this purpose. Our WRR-IFM method utilizes isometric feature mapping to refine word vectors based on the global geodesic distance between all words in the original embedding space. Our WRR-LTS method corrects word representations by aligning original and refined new embedding space based on the tangent space of words. The WRR-IFM method focuses on preserving global geometry information (global geodesic distances) between all words, while our WRR-LTS method considers local geometry information (local tangent space of words) between words and their neighbors. We conduct several experiments on semantic relatedness and semantic similarity tasks. The results obtained in these two evaluation tasks suggest that our proposed methods consistently perform well for refining word representations. In the future, we intend to extend our experiments to refine aligned bilingualism and multilingual word vectors. We also intend to investigate whether our proposed methods have a significant impact on refining contextual word embeddings.