Effective Dimensionality Reduction for Word Embeddings

Pre-trained word embeddings are used in several downstream applications as well as for constructing representations for sentences, paragraphs and documents. Recently, there has been an emphasis on improving the pretrained word vectors through post-processing algorithms. One improvement area is reducing the dimensionality of word embeddings. Reducing the size of word embeddings can improve their utility in memory constrained devices, benefiting several real world applications. In this work, we present a novel technique that efficiently combines PCA based dimensionality reduction with a recently proposed post-processing algorithm (Mu and Viswanath, 2018), to construct effective word embeddings of lower dimensions. Empirical evaluations on several benchmarks show that our algorithm efficiently reduces the embedding size while achieving similar or (more often) better performance than original embeddings. We have released the source code along with this paper.


Introduction
Word embeddings are distributed and dense realvalued representations of words as low dimensional vectors, that geometrically capture the semantic "meaning" of a word, along with several linguistic regularities such as analogy relationships.Such embeddings (e.g.Glove (Pennington et al., 2014), word2vec Skip-Gram (Mikolov et al., 2013)) are learned from unlabeled text corpora and have found great use in several natural language processing and information retrieval tasks (Mu et al., 2017).Given their widespread utility, recently, there as been an emphasis on applying post-processing algorithms on the pretrained word vectors to further improve their quality.For example, (Mrkšic et al., 2016) try to in-ject antonymy and synonymy constraints into vector representations, while (Faruqui et al., 2014) refine word vectors by using relational information from semantic lexicons such as WordNet.(Bolukbasi et al., 2016) propose algorithms to remove the biases (e.g.gender biases) present in word embeddings and (Nguyen et al., 2016) try to "denoise" word embeddings by strengthening salient information and weakening noise.In particular, the post-processing algorithm in (Mu et al., 2017) considerably improves the embeddings' performance by projecting the embeddings away from the most dominant directions.
Another issue related with word embeddings is their size (Ling et al., 2016).For example, loading an embedding matrix of 2.5M tokens takes up to 6 GB memory (for 300-dimensional vectors, on a 64-bit system).Such large memory requirements impose significant constraints on the practical use of word embeddings, especially on mobile devices where the available memory is often highly restricted.(Ling et al., 2016) try to ameliorate this situation by using limited precision representation during word embedding use and training while (Andrews, 2016) tries to compress word embeddings using different compression algorithms.Our approach differs from both these works as we directly try to reduce the dimensionality of word embeddings instead of using limited precision representation or compressing individual vector values.The next section explains our algorithm's design and presents its evaluation results.

The Algorithm
In this section, we first explain the post-processing algorithm from (Mu et al., 2017) in subsection 2.1.Our algorithm, along with its motivations is explained in subsection 2.2.The experimental reults are presented in subsection 2.3.arXiv:1708.03629v1[cs.CL] 11 Aug 2017 2.1 Post-Processing Word Embeddings (Mu et al., 2017) present a simple post-processing algorithm that renders off-the-shelf word embeddings even stronger, as measured on a number of lexical-level and sentence-level tasks.The algorithm (1) is based on the geometric observations that the word embeddings (across all representations such as Glove, word2vec etc.) have a large mean vector and most of their energy (after subtracting the mean vector) is contained in a very low dimensional subspace.Since all embeddings share a common mean vector and have the same dominating directions, both of which strongly influence the representations in the same way, eliminating them renders the embeddings stronger.
Algorithm 1 The Post-Processing Algorithm, PPA(X, D) Input: Word Embedding Matrix X, Threshold Parameter D.

Eliminate Top
end Figure 1(a) demonstrates the impact of the post-processing algorithm (PPA, with D=7) as observed on Glove embeddings (300-dimensions).It compares the fraction of variance explained by the top 20 principal components of the original and post-processed word vectors respectively (the total sum of explained variances over the 300 principal components is equal to 1.0).In the postprocessed word embeddings, none of the top principal components are disproportionately dominant in terms of explaining the data, which implies that the post-processed word vectors are not as influenced by the common dominant directions as the original embeddings.This makes the individual word vectors more "discriminative", thereby, improving their quality, as validated on several benchmarks in (Mu et al., 2017).

Dimensionality Reduction
In this section we explain and present our algorithm that effectively incorporates the postprocessing algorithm in the dimensionality reduction procedure.Our algorithm is based on three considerations.First, since the post-processing algorithm demonstrably leads to better word embeddings, it is appropriate that to construct a lower dimensional representation of word embeddings, the dimensionality reduction algorithm (PCA (Shlens, 2014)) be applied on the "purified" word embeddings.
For the second point, consider Figure 1(b).It compares the variance explained by the top 20 principal components for the embeddings constructed by first post-processing the Glove-300D embeddings according to Algorithm 1 (PPA) and then transforming the vectors to 150 dimensions using PCA (labelled as P+PCA); against a further post-processed version of the same embedding (the total sum of explained variances over the 150 principal components is equal to 1.0).We observe that even though PCA has been applied on the post-processed embeddings (which had their dominant directions eliminated), the variance in the resulting embeddings is still explained disproportionately by a few of the top principal components.The re-emergence of this geometric behaviour implies that further post-processing the lower-dimensional embeddings by projecting the word vectors away from the dominant directions will make the embeddings better.Finally, it is also evident that the extent to which the top principal components explain the data is not as great as in the case of the original 300 dimensional embeddings (Figure 1(a)).Hence, multiple levels of post-processing at different levels of dimensionality will yield diminishing returns.These considerations form the intuition behind our algorithm (2) for constructing lowerdimensional word embeddings, where we apply the post-processing algorithm twice, on either side of a PCA based dimensionality reduction of the word vectors.

Algorithm 2 The Dimensionality Reduction Algorithm
Input: Word Embedding Matrix X, New Dimension N, Threshold Parameter D.
Output: Word Embedding Matrix X of Reduced Dimension N. end 2.3 Evaluation

Word Embeddings
The pre-trained word embeddings (for English only) used for evaluating our algorithms are: Glove embeddings 1 of dimensions 300, 200 and 100, trained on Wikipedia 2014 and Gigaword 1 Available at https://nlp.stanford.edu/projects/glove/.

Datasets
We use the word similarity benchmarks summarized in (Faruqui and Dyer, 2014) for evaluating the word vectors.The datasets have word pairs that have been assigned similarity rating by humans.While evaluating word vectors, the similarity between the words is calculated by the cosine similarity of their vector representations.Then, Spearman's rank correlation coefficient (ρ) (Myers et al., 2010) between the ranks produced by using the word vectors against the human rankings is calculated.The reported metric in experiments is ρ × 100.Hence, for better word similarity, the evaluation metric will be higher.

Compared Baselines
To evaluate the performance of our algorithm, we establish some baselines comprising of alternative schemes of combining the post-processing algorithm along with PCA based dimensionality reduction.The baseline algorithms are: 1. PCA: Transform the word vectors using PCA.
2. P+PCA: Transform the word vectors using PCA after applying the post-processing algorithm.
These baselines can also be regarded as ablations on our algorithm and can shed light on whether our intuitions in developing the algorithm were correct.In the comparisons ahead, our algorithm is represented as Algo-N (where N is the new dimensionality of the word embeddings).All evaluations use the PCA implementation available in (Pedregosa et al., 2011).Further, in our implementation, subtracting the common mean vector from word embeddings is done as a pre-processing step before applying PCA on the embedding matrix.

Experimental Results
First we evaluate our algorithm against the 3 baselines mentioned above and then, we evaluate our algorithm across word embeddings of different dimensions and types.In all the experiments, the threshold parameter D in the PPA algorithm was set to 7 and the new dimensionality after applying the dimensionality reduction algorithms, N is set to d/2 (where d = embedding dimensionality).
Against Different Baselines: Table 1 summarizes the results of different baselines on the 12 datasets.As expected from the discussions in subsection 2.2, our algorithm achieves the best results on 6 out of 12 datasets when compared across all the columns (the best scores are highlighted in bold).In particular, the 150-dimensional word embeddings constructed using our algorithm performs better than the 300-dimensional embeddings in 7 out of 12 datasets (with an average improvement of 2.74%), does significantly better than PCA, PCA+P baselines and beats the P+PCA baseline in 8 out of the 12 tasks.
Across Different Embeddings: Table 2 summarizes the results of applying our algorithm on 300-dimensional fastText embeddings, 100-dimensional Glove embeddings and 200dimensional Glove embeddings (the better scores are highlighted in bold).In the case of fast-Text embeddings, the 150-dimensional word vectors constructed using our algorithm get better performance on 4 out of 12 datasets when compared to the 300-dimensional embeddings.Overall, the 150-dimensional word vectors have a cumulative score of 765.5 against the 771.93 of the 300-dimensional vectors.Hence, overall its performance is quite similar to the 300-dimensional embeddings (with an average performance decline of 0.53%).In the case of Glove embeddings of 100 and 200 dimensions, our algorithm leads to significant gains (with average performance improvements of 2.6% and 3% respectively) and the lower dimensional embeddings achieve better performance on 8 and 10 datasets respectively.

Conclusions
To conclude, we restate that our algorithm is effective in constructing lower dimensional word embeddings, while maintaining similar or (more often) better performance.We hope that it will improve the utility of word embeddings on memory restricted devices.In future, we would like to explore combining compression along with dimensionality reduction to further reduce the size of the word embeddings.

Figure 1 :
Figure 1: Comparison of (a) the Original and Post-Processed Glove Embeddings (300-Dimensional) in terms of fraction of Variance Explained by Top 20 Principal Components.(b) the P+PCA Baseline (150 Dimensions) and Further Post-Processed Glove Embeddings in terms of fraction of variance explained by top 20 principal components.