Robust Gram Embeddings

Word embedding models learn vectorial word representations that can be used in a variety of NLP applications. When training data is scarce, these models risk losing their generalization abilities due to the complexity of the models and the overﬁtting to ﬁnite data. We propose a regularized embedding formulation, called Robust Gram (RG), which pe-nalizes overﬁtting by suppressing the disparity between target and context embeddings. Our experimental analysis shows that the RG model trained on small datasets generalizes better compared to alternatives, is more robust to variations in the training set, and correlates well to human similarities in a set of word similarity tasks.


Introduction
Word embeddings represent each word as a unique vector in a linear vector space, encoding particular semantic and syntactic structure of the natural language (Arora et al., 2016). In various lingual tasks, these sequence prediction models shown superior results over the traditional count-based models (Baroni et al., 2014). Tasks such as sentiment analysis (Maas et al., 2011) and sarcasm detection (Ghosh et al., 2015) enjoys the merits of these features.
These word embeddings optimize features and predictors simultaneously, which can be interpreted as a factorization of the word cooccurence matrix C. In most realistic scenarios these models have to be learned from a small training set. Furthermore, word distributions are often skewed, and optimizing the reconstruction ofĈ puts too much empha-sis on the high frequency pairs (Levy and Goldberg, 2014). On the other hand, by having an unlucky and scarce data sample, the estimatedĈ rapidly deviates from the underlying true cooccurence, in particular for low-frequency pairs (Lemaire and Denhire, 2008). Finally, noise (caused by stemming, removal of high frequency pairs, typographical errors, etc.) can increase the estimation error heavily (Arora et al., 2015).
It is challenging to derive a computationally tractable algorithm that solves all these problems. Spectral factorization approaches usually employ Laplace smoothing or a type of SVD weighting to alleviate the effect of the noise (Turney and Pantel, 2010). Alternatively, iteratively optimized embeddings such as Skip Gram (SG) model (Mikolov et al., 2013b) developed various mechanisms such as undersampling of highly frequent hub words apriori, and throwing rare words out of the training.
Here we propose a fast, effective and generalizable embedding approach, called Robust Gram, that penalizes complexity arising from the factorized embedding spaces. This design alleviates the need from tuning the aforementioned pseudo-priors and the preprocessing procedures. Experimental results show that our regularized model 1) generalizes better given a small set of samples while other methods yield insufficient generalization 2) is more robust to arbitrary perturbations in the sample set and alternations in the preprocessing specifications 3) achieves much better performance on word similarity task, especially when similarity pairs contains unique and hardly observed words in the vocabulary.

Robust Gram Embeddings
Let |y| = V the vocabulary size and N be the total number of training samples. Denote x, y to be V × 1 discrete word indicators for the context and target: corresponding to the context and word indicators c, w in word embedding literature. Define Ψ d×V and Φ d×V as word and context embedding matrices. The projection on the matrix column space, Φx, gives us the embedding x ∈ R d . We use Φx and Φ x interchangeably. Using a very general formulation for the regularized optimization of a (embedding) model, the following objective is minimized: where L(Ψ, Φ, x i , y i ) is the loss incurred by embedding example target y i using context x i and embedding parameters Ψ, Φ, and where g(Ψ, Φ) is a regularization of the embedding parameters. Different embedding methods differ in the form of specified loss function and regularization. For instance, the Skip Gram likelihood aims to maximize the following conditional: This can be interpreted as a generalization of Multinomial Logistic Regression (MLR). Rewriting (Ψy) T (Φx) = (y T Ψ T Φx) = y T W x = W y x shows that the combination of Φ and Ψ become the weights in the MLR. In the regression the input x is transformed to directly predict y. The Skip Gram model, however, transforms both the context x and the target y, and can therefore be seen as a generalization of the MLR. It is also possible to penalize the quadratic loss between embeddings (Globerson et al., 2007): Since these formulations predefine a particular embedding dimensionality d, they impose a low rank constraint on the factorization W = Ψ T Φ. This means that g(Ψ, Φ) contains λrank(Φ T Ψ) with a sufficiently large λ. The optimization with an explicit rank constraint is NP hard. Instead, approximate rank constraints are utilized with a Trace Norm (Fazel et al., 2001) or Max Norm (Srebro and Shraibman, 2005). However, adding such constraints usually requires semidefinite programs which quickly becomes computationally prohibitive even with a moderate vocabulary size.
Do these formulations penalize the complexity? Embeddings under quadratic loss are already regularized and avoids trivial solutions thanks to the second term. They also incorporate a bit weighted data-dependent 2 norm. Nevertheless, choosing a log-sigmoid loss for Equation 1 brings us to the Skip Gram model and in that case, p regularization is not stated. Without such regularization, unbounded optimization of 2V d parameters has potential to converge to solutions that does not generalize well.
To avoid this overfitting, in our formulation we choose g 1 as follows: where Ψ v is the row vector of words. Moreover, an appropriate regularization can also penalize the deviance between low rank matrices Ψ and Φ. Although there are words in the language that may have different context and target representations, such as the 1 , it is natural to expect that a large proportion of the words have a shared representation in their context and target mappings. To this end, we introduce the following regularization: where F is the Frobenius matrix norm. This assumption reduces learning complexity significantly while a good representation is still retained, optimization under this constraint for large vocabularies is going to be much easier because we limit the degrees of freedom.
The Robust Gram objective therefore becomes: is the loglinear prediction model, and L the cross entropy loss. Since we are in the pursuit of preserving/restoring low masses inĈ, norms such as 2 allow each element to still possess a small probability mass and encourage smoothness in the factorized Ψ T Φ matrix. As L is picked as the cross entropy, Robust Gram can be interpreted as a more principled and robust counterpart of Skip Gram objective.
One may ask what particular factorization Equation 6 induces. The objective searches for Ψ, Φ matrices that have similar eigenvectors in the vector space. A spectral PCA embedding obtains an asymmetric decomposition W = U ΣV T with Ψ = U and Φ = ΣV , albeit a convincing reason for embedding matrices to be orthonormal lacks. In the Skip Gram model, this decomposition is more symmetric since neither Ψ nor Φ are orthonormal and diagonal weights are distributed across the factorized embeddings. A symmetric factorization would be: (Levy and Goldberg, 2014). The objective in Eq. 6 converges to a more symmetric decomposition since ||Ψ − Φ|| is penalized. Still some eigenvectors across context and target maps are allowed to differ if they pay the cost. In this sense our work is related to power SVD approaches (Caron, 2000) in which one searches an a to minimize ||W − U Σ a Σ 1−a V T ||. In our formulation, if we enforce a solution by applying a strong constraint on ||Ψ − Φ|| 2 F , then our objective will gradually converge to a symmetric powered decomposition such that U ≈ V .

Experiments
The experiments are performed on a subset of the Wikipedia corpus containing approximately 15M words. For a systematic comparison, we use the same symmetric window size adopted in (Pennington et al., 2014), 10. Stochastic gradient learning rate is set to 0.05. Embedding dimensionality is set to 100 for model selection and sensitivity analysis. Unless otherwise is stated, we discard the most frequent 20 hub words to yield a final vocabulary of 26k words. To understand the relative merit of our approach 2 , Skip Gram model is picked as the baseline. To retain the learning speed, and avoid inctractability of maximum likelihood learning, we learn our embeddings with Noise Contrastive Estimation using a negative sample (Gutmann and Hyvärinen, 2012).

Model Selection
For model selection, we are going to illustrate the log likelihood of different model instances. However, exact computation of the LL is computationally difficult since a full pass over the validation likelihood is time-consuming with millions of samples. Hence, we compute a stochastic likelihood with a few approximation steps. We first subsample a million samples rather than a full evaluation set, and then sample few words to predict in the window context similar to the approach followed in (Levy and Goldberg, 2014). Lastly, we approximate the normalization factor with one negative sample for each prediction score (Mnih and Kavukcuoglu, 2013)(Gutmann and Hyvärinen, 2012). Such an approximation works fine and gives smooth error curves. The reported likelihoods are computed by averaging over 5-fold cross validation sets.
Results. Figure 1 shows the likelihood LL obtained by varying {λ 1 , λ 2 }. The plot shows that there exits a unique minimum and both constraints contribute to achieve a better likelihood compared to their unregularized counterparts (for which λ 1 = λ 2 = 0). In particular, the regularization imposed by the differential of context and target embeddings g 2 contributes more than the regularization on the em-beddings Ψ and Φ separately. This is to be expected as g 2 also incorporates an amount of norm bound on the vectors. The region where both constraints are employed gives the best results. Observe that we can simply enhance the effect of g 2 by adding a small amount of bounded norm g 1 constraint in a stable manner. Doing this with pure g 2 is risky because it is much more sensitive to the selection of λ 2 . These results suggest that the convex combination of stable nature of g 1 with potent regularizer of g 2 , finally yields comparably better regularization.

Sensitivity Analysis
In order to test the sensitivity of our model and baseline Skip Gram to variations in the training set, we perform two sensitivity analyses. First, we simulate a missing data effect by randomly dropping out γ ∈ [0, 20] percent of the training set. Under such a setting, robust models are expected to be effected less from the inherent variation. As an addition, we inspect the effect of varying the minimum cutoff parameter to measure the sensitivity. In this experiment, from a classification problem perspective, each instance is a sub-task with different number of classes (words) to predict. Instances with small cut-off introduces classification tasks with very few training samples. This cut-off choice varies in different studies (Pennington et al., 2014;Mikolov et al., 2013b), and it is usually chosen based on heuristic and storage considerations. Results. Figure 2 illustrates the likelihood of the Robust and Skip Gram model by varying the dropout ratio on the training set. As the training set shrinks, both models get lower LL. Nevertheless, likelihood decay of Skip Gram is relatively faster. When 20% drop is applied, the LL drops to 74% in the SG model. On the other hand the RG model not only starts with a much higher LL, the drop is also to 75.5%, suggesting that RG objective is more resistant to random variations in the training data. Figure 3 shows the results of varying the rarewords cut-off threshold. We observe that the likelihood obtained by the Skip Gram is consistently lower than that of the Robust Gram. The graph shows that throwing out these rare words helps the objective of SG slightly. But for the Robust Gram removing the rare words actually means a significant decrease in useful information, and the performance starts to degrade towards the SG performance. RG avoids the overfitting occurring in SG, but still extracts useful information to improve the generalization.

Word Similarity Performance
The work of (Schnabel et al., 2015) demonstrates that intrinsic tasks are a better proxy for measuring the generic quality than extrinsic evaluations. Motivated by this observation, we follow the experimental setup of (Schnabel et al., 2015;Agirre et al., 2009), and compare word correlation estimates of each model to human estimated similarities with Spearman's correlation coefficient. The evaluation is performed on several publicly available word similarity datasets having different sizes. For datasets having multiple subjects annotating the word similarity, we compute the average similarity score from all subjects.
We compare our approach to set of techniques on the horizon of spectral to window based approaches. A fully spectral approach, HPCA, (Lebret and Le-bret, 2013) extracts word embeddings by running a Hellinger PCA on the cooccurrence matrix. For this method, context vocabulary upper and lower bound parameters are set to {1, 10 −5 }, as promoted by its author. GLoVe (Pennington et al., 2014) approach formulates a weighted least squares problem to combine global statistics of cooccurence and efficiency of window-based approaches. Its objective can be interpreted as an alternative to the cross-entropy loss of Robust Gram. The x max , α values of the GLoVe objective is by default set to 100, 3/4. Finally, we also compare to shallow representation learning networks such as Skip Gram and Continuous Bag of Words (CBoW) (Mikolov et al., 2013a), competitive state of the art window based baselines.
We set equal window size for all these models, and iterate three epochs over the training set. To yield more generality, all results obtained with 300 dimensional embeddings and subsampling parameters are set to 0. For Robust Gram approach, we have set λ 1 , λ 2 = {0.3, 0.3}. To obtain the similarity results, we use the final Φ context embeddings.
Results. Table 1 depicts the results. The first observation is that in this setting, obtaining word similarity using HPCA and GLoVe methods are suboptimal. Frankly, we can conjecture that this scarce data regime is not in the favor of the spectral methods such as HPCA. Its poor performance can be attributed to its pure geometric reconstruction formulation, which runs into difficulties by the amount of inherent noise. Compared to these, CBoW's performance is moderate except in the RW dataset where it performs the worst. Secondly, the performance of the SG is relatively better compared to these approaches. Surprisingly, under this small data setting, RG outperforms all of its competitors in all datasets except for RG65, a tiny dataset of 63 words containing very common words. It is admissible that RG sacrifices a bit in order to generalize to a large variety of words. Note that it especially wins by a margin in MEN and Rare Words (RW) datasets, having the largest number of similarity query samples. As the number of query samples increases, RG embeddings' similarity modeling accuracy becomes clearly perceptible. The promising result Robust Gram achieves in RW dataset also sheds light on why CBoW performed worst on RW: CBOW overfits rapidly confirming the recent studies on the stability of CBoW (Luo et al., 2014). Finally, these word similarity results suggest that RG embeddings can yield much more generality under data scarcity.

Conclusion
This paper presents a regularized word embedding approach, called Robust Gram. In this approach, the model complexity is penalized by suppressing deviations between the embedding spaces of the target and context words. Various experimental results show that RG maintains a robust behaviour under small sample size situations, sample perturbations and it reaches a higher word similarity performance compared to its competitors. The gain from Robust Gram increases notably as diverse test sets are used to measure the word similarity performance. In future work, by taking advantage of the promising results of Robust Gram, we intend to explore the model's behaviour in various settings. In particular, we plan to model various corpora, i.e. predictive modeling of sequentially arriving network packages. Another future direction might be encoding available domain knowledge by additional regularization terms, for instance, knowledge on synonyms can be used to reduce the degrees of freedom of the optimization. We also plan to enhance the underlying optimization by designing Elastic constraints (Zou and Hastie, 2005) specialized for word embeddings.