Single Training Dimension Selection for Word Embedding with PCA

In this paper, we present a fast and reliable method based on PCA to select the number of dimensions for word embeddings. First, we train one embedding with a generous upper bound (e.g. 1,000) of dimensions. Then we transform the embeddings using PCA and incrementally remove the lesser dimensions one at a time while recording the embeddings’ performance on language tasks. Lastly, we select the number of dimensions, balancing model size and accuracy. Experiments using various datasets and language tasks demonstrate that we are able to train about 10 times fewer sets of embeddings while retaining optimal performance. Researchers interested in training the best-performing embeddings for downstream tasks, such as sentiment analysis, question answering and hypernym extraction, as well as those interested in embedding compression should find the method helpful.

The impact of vector dimension on embeddings' performance is well known (Lai et al., 2016). With too few dimensions, the model will underfit; with too many dimensions the model will overfit. Both undercut the embeddings' performance (Yin and Shen, 2018). What is also known is that the size of the embeddings will grow linearly with the vector dimension (Ling et al., 2016;Raunak, 2017). What is less known is how to identify the optimal vector dimension given any dataset. The method we propose here helps fill this gap. We offer a fast and reliable PCA-based method that (1) only needs to train the embeddings once and (2) is able to select vector dimension with competitive performance. 1 First, we train one embedding with a generous upper bound (e.g. 1,000) of dimensions. Then we transform the embeddings using PCA and incrementally remove the lesser dimensions one at a time while recording the embeddings' performance on language tasks. Lastly we calculate the best dimensionality and return the corresponding embeddings.
Experiments using various datasets and language tasks reveal three key observations: • The optimal dimensionality calculated on the basis of PCA agrees with that by grid search.
• The resulting embedding is competitive against the one selected by grid search.
Researchers interested in downstream tasks, such as sentiment analysis (Lin et al., 2015), question answering (Devlin et al., 2018) and hypernym extraction (Chen et al., 2018), as well as those interested in embedding compression should find the method helpful.

Related Work
Our work draws inspiration from Yin and Shen (2018). The authors build on the Latent Semantics Analysis (LSA) approach and slide k from a lower bound (e.g. 10) to a generous upper bound (e.g. 1,000) in E = U 1:k D α 1:k,1:k , where U and D come from the singular-value decomposition of the signal matrix and α is a hyperparameter to be tuned. For each k, the authors generate one corresponding embedding and compare it with the simulated oracle embedding. The k that yields the smallest loss is selected. In a similar vein, our work bypasses the problem of training multiple embeddings, often necessitated by grid search, by sliding over all the k values of PCA. Compared with Yin and Shen (2018), our method is easier to implement, as we do not rely on, e.g, Monte Carlo simulations of the oracle embeddings.
At a deeper level, our work is also connected to Yin and Shen (2018) in terms of the trade-off between bias and variance. Yin and Shen (2018) propose pairwise inner product (PIP) loss to measure the quality of an embedding. They decompose the PIP loss into a bias term and a variance term, where reducing the dimension increases the bias term but reduces the variance. They show that the bias-variance trade-off reflects the signalto-noise ratio in dimension selection. While there is no exact 1-1 mapping from their theorem to our work, we do have analogous discussion in Section 3. The PCA step in our algorithm enables us to identify and drop dimensions that (1) contribute less to the explained variance in the embedding and yet (2) contribute equally to cosine similarity. In essence, our PCA step is removing dimensions with low signal-to-noise ratios.
Our work also draws strength from the literature on post-processing embeddings. Mu and Viswanath (2018) demonstrate that removing the top dominating directions in the PCA-transformed embeddings helps improve the embeddings' performance in word analogy and similarity tasks. Building on that, Raunak (2017) shows that by performing another iteration of PCA and dropping the bottom directions, one can further improve a model's performance as well as reduce its size. Both works focus on improving pre-trained embeddings's performance in terms of accuracy and size. By contrast, our algorithm selects the optimal dimensionality before the actual training.
In addition, our work is related to a few recent studies on model compression (Faruqui et al., 2015;Ling et al., 2016;Shu and Nakayama, 2018). In particular, Ling et al. (2016) seek to drop the least significant digits to reduce the embeddings' size. By comparison, our method removes the least significant dimensions (in terms of explained variance (Bishop, 2006)). It should be noted that the two methods complement each other, as one focuses on dimension selection whereas the other on limited precision representation.

Algorithm
In this section, we formally describe how to select a competitive dimensionality by training one embedding. We state the proposed algorithm in Algorithm 1.
First, we note that the PCA transformation (when retaining all the N dimensions) does not affect embeddings' performance on word similarity tasks. Any potential performance gain should come from dropping the lesser dimensions. By "lesser dimensions," we mean the dimensions that contribute little to the explanation of variance (Bishop, 2006).
In Figure 2, we first transform an embedding Algorithm 1 Select the Optimal Dimensionality using PCA Set dimension upper bound N; select language task from {word analogy, similarity} Train embedding E with N dimensions Transform E using PCA: (u 1 , u 2 , ..., u N ;Ẽ) ← PCA(E), where u 1 , u 2 , .., u N are the new basis vectors,Ẽ represents the transformed coefficients for i = N to 2 do E = E-Ẽ :,i · u i , whereẼ :,i represents the ith column ofẼ and each scalar inẼ :,i scales vector u i Evaluate E on the selected language task, record (i, metric) end for Return the selected dimension: [argmax using PCA, so that each new dimension represents a principal component. We show that the explained variance goes drastically down for dimensions beyond the 100th and stays relatively stable for 100th-1,000th dimensions. In terms of magnitude, the first dimension explains 69.2 times more variance than the last dimension. While different dimensions contribute differently to the explained variance, they nonetheless contribute equally to the calculation of inner product (Figure 2, right). Therefore, the lesser dimensions, with less variance but equal weighting, effectively decreases the discriminative power of the model. Removing these lesser dimensions enables us to focus on the more discriminative dimensions. To identify optimal dimensionality, beyond which all dimensions are considered lesser, we turn to experiments in Section 4.

Experiments
Given the popularity of the word2vec model (Mikolov et al., 2013b), we use Skip-gram as the embedding algorithm. Following Yin and Shen (2018) and Grave et al. (2017), we use the widely used benchmark datasets, Text8 (Mahoney, 2011) and WikiText-103 (Merity et al., 2017), as the training datasets. For ground truth, we train 20 embeddings, with dimensions ranging from 50 to 1,000 at an interval of 50 as well as two embeddings with dimensions of 5 and 25. 2 Here we have made the implicit assumption that 1,000 is an upper bound for the embedding space's dimensionality. For each embedding, we train 200 epochs (Pennington et al., 2014;Shazeer et al., 2016) and keep only the checkpoint that performs best on the word analogy task (Mikolov et al., 2013a). Our experiments focus on (1) comparing our method with grid-search based ground truth and (2) examining consistency between different upper bounds.

Performance Compared with Grid Search
In this subsection, we compare the optimal dimensionality that our method calculates with the ground truth ( Figure 3). We perform the comparison across three testing datasets: Wordsim 353 (Finkelstein et al., 2002), RW Stanford (Luong et al., 2013) and MTurk 771 (Halawi et al., 2012). Figure 3 demonstrates that our PCA based method (with one training only) is able to uncover the optimal dimensionality. Table 1 reports the distance in selected dimensionalities.
We also observe that the optimal embedding that results from our method (with retraining) is competitive against the optimal embedding found using grid search. In Figure 3, we mark out the respective optimal performances of the two approaches in similarity tasks. In Table 2 Figure 3: Across different benchmarks and different training datasets, the optimal dimensionality that our method (blue curve) identifies closely matches grid search (red curve). The top row is based on Text8. The bottom row is based on WikiText-103. The vertical dotted lines represent the optimal dimensionality for the respective curves. The horizontal dotted line represents the performance of grid search and the blue star marks the performance of our method. The score function is f (i, metric)= metric-50×i. All the curves are averaged over 5 random runs. report the optimal performance achieved by grid search and our method as well as their relative performance. Even though we have only trained one embedding (and one retraining), our method, on average, is able to achieve 100.2% (WordSim 353) to 96.9% (MTurk 771) of the optimal performance by grid-searching through 22 sets of embeddings.

Consistency across Upper Bounds
One hyperparameter involved in our method is the upper bound. Intuitively, we expect the upper bound should be higher for larger datasets. In this  subsection, we demonstrate our method is robust against different upper bounds.
In Figure 4, we vary the dimension from 500 to 1,000 at an increment of 100. We observe that the dimensionality our method selects is consistent across different upper bounds. Based on the demonstrated consistency, different upper bounds can be selected and still the optimal dimensionality can be uncovered as long as the chosen upper bound is larger than the optimal dimensionality.

Efficiency Compared with Grid Search
In this subsection, we report the running time of our algorithm and compare it with that of grid search. We have recorded the running time of our experiments and that of the PCA transformations. We average them over 5 random runs ( Figure 5).
For Text8, grid search takes 22,801 minutes cumulatively. Training a 1,000-dimension embedding takes 1,724 minutes. The PCA step takes 22 minutes (note that for each embedding we only need one PCA operation). This represents a 13.1x speedup for our method. For WikiText-103, grid search takes 132,652 minutes. Training a 1,000dimension embedding takes 10,448 minutes. The PCA step takes 34 minutes. This represents a 12.7x speedup.
We note that the comparison results are dependent on grid granularity. A coarser grid search could save researchers more time, at the cost of performance loss.

Conclusion
In this paper, we provided a fast and reliable method based on PCA to select the number of dimensions for training word embeddings. First, we train one embedding with a generous upper bound (e.g. 1,000) of dimensions. Then we transform the embedding using PCA and incrementally remove the lesser dimensions while recording the embeddings' performance on language tasks. Experiments demonstrate that (1) our method is able to identify the optimal dimensionality, (2) the resulting embedding has competitive performance against grid search, and (3) our method is robust against the selection of the upper bound.