When Hyperparameters Help: Beneficial Parameter Combinations in Distributional Semantic Models

Distributional semantic models can predict many linguistic phenomena, including word similarity, lexical ambiguity, and semantic priming, or even to pass TOEFL synonymy and analogy tests (Landauer and Dumais, 1997; Grifﬁths et al., 2007; Turney and Pantel, 2010). But what does it take to create a competitive distributional model? Levy et al. (2015) argue that the key to success lies in hyperparameter tuning rather than in the model’s architecture. More hyperparameters trivially lead to potential performance gains, but what do they actually do to improve the models? Are individual hyperparameters’ contributions independent of each other? Or are only speciﬁc parameter combinations beneﬁcial? To answer these questions, we perform a quantitative and qualitative evaluation of major hyperparameters as identi-ﬁed in previous research.


Introduction
In a rigorous evaluation,  showed that neural word embeddings such as skipgram have an edge over traditional count-based models. However, as argued by Levy and Goldberg (2014), the difference is not as big as it appears, since skip-gram is implicitly factorizing a word-context matrix whose cells are the pointwise mutual information (PMI) of word context pairs shifted by a global constant. Levy et al. (2015) further suggest that the performance advantage of neural network based models is largely due to hyperparameter optimization, and that the optimization of count based models can result in similar performance gains. In this paper we take this claim as the starting point. We experiment with three hyperparameters that have the greatest effect on model performance according to Levy et al. (2015): subsampling, shifted PMI and context distribution smoothing. To get a more detailed picture, we use a greater range of hyperparameter values than in previous work, comparing all hyperparameter value combinations, and perform a qualitative analysis of their effect.  Mikolov et al. (2013b) smoothed the original contexts distribution raising unigram frequencies to the power of alpha. Levy and Goldberg (2015) used this technique in conjunction with PMI.
After CDS, either PPMI or Shifted PPMI may be applied. We implemented CDS by raising every count to the power of α, exploring several values for α, from .25 to .95 to 1 (no smoothing).

Shifted PPMI
Levy and Goldberg introduced Shifted Positive Pointwise Mutual Information (SPPMI) as an association measure more efficient than PPMI. For every word w and every context c, the SPPMI of w is the higher value between 0 and its PMI value minus the log of a constant k.
SP P M I k (w, c) = max(P M I(w, c) − log k, 0)

Subsampling
Subsampling was used by Mikolov et al. as a means to remove frequent words that provide less information than rare words (Mikolov et al., 2013a). Each word in the corpus with frequency above treshold t can be ignored with probability p, computed for each word using its frequency f : Following Mikolov et al., we used t = 10 −5 . In word2vec, subsampling is applied before the corpus is processed. Levy and Goldberg explored the possibility of applying subsampling afterwards, which does not affect the context window's size, but found no significant difference between the two methods. In our experiments, we applied subsampling before processing.

Corpus
For maximum consistency with previous research, we used the cooccurrence counts of the best countbased configuration in , extracted from the concatenation of the web-crawled ukWack corpus (Baroni et al., 2009), Wikipedia, and the BNC, for a total of 2.8 billion tokens, using a 2-word window and the 300K most frequent tokens as contexts. This corpus will be referred to as WUB. For comparison with a smaller corpus, similar to the one in Levy and Goldberg's setup, we also extracted cooccurrence data from Wikipedia alone, leaving the rest of the configuration identical. This corpus will be referred to as Wiki.

Evaluation Materials
Three data sets were used to evaluate the models. The MEN data set contains 3000 word pairs rated by human similarity judgements. Bruni et al. (2014) report an accuracy of 78% on this data-set using an approach that combines visual and textual features. The WordSim data set is a collection of word pairs associated with human judgements of similarity or relatedness. The similarity set contains 203 items (WS sim) and the relatedness set contains 252 items (WS rel). Agirre et al. achieved an accuracy of 77% on this data set using a context window approach (Agirre et al., 2009). The TOEFL data set includes 80 multiplechoice synonym questions (Landauer and Dumais, 1997). For this data set, corpus-based approaches have reached an accuracy of 92.50% (Rapp, 2003).

Context Distribution Smoothing
Our results show that smoothing is largely ineffective when used in conjunction with PPMI. It also becomes apparent that .95 is a better parameter than .75 for smoothing purposes.

Shifted PPMI
When using SPPMI, Levy and Goldberg (2014) tested three values for k: 1, 5 and 15. On the MEN data set, they report that the best k value was 5 (.721), while on the WordSim data set the best k value was 15 (.687). In our experiments, where (in contrast to Levy and Goldberg) all other hyperparameters are set to 'vanilla' values, the best k value was 3 for all data sets.

Smoothing and Shifting Combined
The results in Table 3 show that Context Distribution Smoothing is effective when used in conjunction with Shifted PPMI. With CDS, 5 turns out to be a better value than 3 for k. These results are also consistent with the previous experiment: a smoothing of .95 is in most cases better than .75.

Subsampling
Under the best shifting and smoothing configuration, subsampling can improve the model's performance score by up to 9.2% (see Table 4). But in  the absence of shifting and smoothing, subsampling does not produce a consistent performance change, which ranges from −6.7% to +7%. The nature of the task is also important here: on WS rel, subsampling improves the model's performance by 9.2%. We assume that diversifying contextual cues is more beneficial in a relatedness task than in others, especially on a smaller corpus.

Qualitative Analysis
CDS and SPPMI increase model performance because they reduce statistical noise, which is illustrated in Table 5. It shows the top ten neighbours of the word doughnut in the vanilla PPMI configuration vs. SPPMI with CDS, in which there are more semantically related neighbours (in bold).
To visualize which dimensions of the vectors are discarded when shifting and smoothing, we randomly selected a thousand word vectors and compared the number of dimensions with a positive value for each vector in the vanilla configuration vs. log(5)cds(.95). For instance, the word segmentation has 1105 positive dimensions in the vanilla configuration, but only 577 in the latter.
For visual clarity, only vectors with 500 or less contexts are shown in Figure 1.
This figure indicates that the process of shifting and smoothing appears to be largely independent from the number of contexts of a vector: a word with a high number of positive contexts in the vanilla configuration may very well end up with zero positive contexts under SPPMI with CDS.
The independence of the number of positive contexts under the vanilla configuration from the probability of having at least one positive context   We further analysed a sample of 1504 vectors that lose all positive dimensions under SPPMI with CDS. We annotated a portion of those vectors, and found that the vast majority were numerical expressions, such as dates, prices or measurements, e.g. 1745, which may appear in many different contexts, but is unlikely to have a high number of occurrences with any of them. This explains why its number of positive contexts drops to zero when SPPMI and CDS are applied.

Count vs Predict and Corpus Size
We conducted the same experimentations on two corpora: the WUB corpus (Wikipedia+ukWack+BNC) used by Baroni et al.,    to the one that Levy et al. employed. With these two corpora, we found the same general pattern of results, with the exception of the WordSim relatedness task benefitting greatly from a larger corpus and MEN favoring steeper smoothing (.75) under the smaller corpus. This suggests that the smoothing hyperparameter should be adjusted to the corpus size and the task at hand. For comparison, we give the results for a word2vec model trained on the two corpora using the best configuration reported by : CBOW, 10 negative samples, CDS, window 5, and 400 dimensions. We find that PPMI is more efficient when using the Wikipedia corpus alone, but when using the larger corpus the predict model still outperforms all count models.

Conclusion
Our investigation showed that the interaction of different hyperparameters matters more than the implementation of any single one. Smoothing only shows its potential when used in combina-    (5) as a shifting hyperparameter. Qualitatively speaking, the hyperparameters help largely by reducing statistical noise in cooccurrence data. SPPMI works by removing low PMI values, which are likely to be noisy. CDS effectively lowers PMI values for rare contexts, which tend to be more noisy, allowing for a higher threshold for SPPMI (log 5 vs. log 3) to be effective. Subsampling gives a greater weight to underexploited data from rare words at the expense of frequent ones, but it amplifies the noise as well as the signal, and should be combined with the other noise-reducing hyperparameters to be useful.
In terms of corpus size, we've seen that similar performance can be achieved with a smaller corpus if the right hyperparameters are used. One exception is the WordSim relatedness task, in which models require more data to achieve the same level of performance, and benefit from subsampling much more than in the similarity task.
While the best predictive model from Baroni et al. trained on the WUB corpus still outperforms our best count model on the same corpus, hyperparameter tuning does significantly improve the performance of count models and should be used when a corpus is too small to build a predictive model.