Evaluating Natural Alpha Embeddings on Intrinsic and Extrinsic Tasks

Skip-Gram is a simple, but effective, model to learn a word embedding mapping by estimating a conditional probability distribution for each word of the dictionary. In the context of Information Geometry, these distributions form a Riemannian statistical manifold, where word embeddings are interpreted as vectors in the tangent bundle of the manifold. In this paper we show how the choice of the geometry on the manifold allows impacts on the performances both on intrinsic and extrinsic tasks, in function of a deformation parameter alpha.


Introduction
Word embeddings are compact representations for the words of a dictionary. Rumelhart et al. (1986) first introduced the idea of using the internal representation of a neural network to construct a word embedding. Bengio et al. (2003) employ a neural network to predict the probability of the next word given the previous ones. Mikolov et al. (2010) proposed the use of a recurrency language model based on RNN, to learn the vector representations. More recently, this approach has been exploited further, with great success by means of bidirectional LSTM (Peters et al., 2018) and transformers (Radford et al., 2018;Devlin et al., 2018;Yang et al., 2019). In this paper we focus on Skip-Gram (SG), a well-known model for the conditional probability of the context of a given central word, which it has been shown to work well at efficiently capturing syntactic and semantic information. SG is at the basis of many popular word embeddings algorithms, such as Word2Vec (Mikolov et al., 2013a,b), the contpdfinfoinuous bag of words (Mikolov et al., 2013a,b), and models based on weighted matrix factorization of the global cooccurrences as GloVe (Pennington et al., 2014), cf. Levy and Goldberg (2014). These methods are deeply related, Levy and Goldberg showed how Word2Vec SG with negative sampling is effectively performing a matrix factorization of the Shifted Positive PMI (Levy and Goldberg, 2014).
It has been noted (Mikolov et al., 2013c) how, once the embedding space has been learned, syntactic and semantic analogies between words translate in linear relations between the respective word vectors. There have been numerous works investigating the reason of the correspondence between linear properties and word relations. Pennington et al. gave a very intuitive explanation in their paper on GloVe (Pennington et al., 2014). More recently Arora et al. (Arora et al., 2016) tried to study this property by introducing a hidden Markov model, under some regularity assumptions on the distribution of the word embedding vectors, cf. (Mu et al., 2017). Word embeddings are also often used as input for another computational model, to solve more complex inference tasks. The evaluation of the quality of a word embedding, which ideally should encode syntactic and semantic information, is not easy to be determined and different approaches have been proposed in the literature. This evaluation can be in terms of performance on intrinsic tasks like word similarity (Bullinaria andLevy, 2007, 2012;Pennington et al., 2014;Levy et al., 2015), or by solving word analogies (Mikolov et al., 2013c,a), however several authors (Tsvetkov et al., 2015;Schnabel et al., 2015) has showed a low degree of correlation between the quality of an embedding for word similarities and analogies on one side, and on downstream (extrinsic) tasks, for instance on classification or prediction, to which the embedding is given in input.
Several works have highlighted the effectiveness of post-processing techniques (Bullinaria andLevy, 2007, 2012), such as PCA (Raunak, 2017;Mu et al., 2017), focusing on the fact that certain dominant components are not carriers of semantic nor syn-tactic information and thus act like noise for determinate tasks of interest. A different approach which still acts on the learned vectors after training has been recently proposed by Volpi and Malagò (2019). The authors present a geometrical framework in which word embeddings are represented as vectors in the tangent space of a probability simplex. A family of word embeddings called natural alpha embeddings is introduced, where α is a deformation parameter for the geometry of the probability simplex, known in Information Geometry in the context of α-connections (Amari and Nagaoka, 2000;Amari, 2016). Noticeably, alpha word embeddings include the classical word embeddings as a special case. In this paper we provide an experimental evaluation of natural alpha embeddings over different tasks, both intrinsic and extrinsic, including word similarities and analogies, as well as downstream tasks, such as document classification and sentiment analysis, in order to study the impact of the geometry on performances.

Conditional Models and the Embeddings Structure
The Skip-Gram conditional model (Mikolov et al., 2013b;Pennington et al., 2014) allows the unsupervised training of a set of word-embeddings, by predicting the conditional probability of any word χ to be in the context of a central word w with Z w = χ ∈D exp(u T w v χ ) partition function. The conditional model represents an exponential family in the simplex, parameterized by two matrices U and V of size n×d, where n is the cardinality of the dictionary D, and d is the size of the embeddings. We will refer to the rows of a matrix V as v χ or V χ , and to its columns as V k . It is common practice in the literature of word embedding to consider u w or alternatively u w + v w as embedding vectors for w (Bullinaria and Levy, 2012;Mikolov et al., 2013a,b;Pennington et al., 2014;Raunak, 2017). In the remaining part of this section we briefly review the natural alpha embeddings and limit embeddings, based on Information Geometry framework. We refer the reader to Volpi and Malagò (2019) for more details and mathematical derivations.

Alpha Embeddings
After training, the matrices U and V are fixed. For each w, the conditional model p w (χ) is an exponential family E in the n − 1 dimensional simplex, where n is the size of the dictionary. This models the probability of a word χ in the context, when w is the central word. The sufficient statistics of this model are determined by the columns of V , while each row u w of U can be seen as an assignment for the natural parameters, i.e., each row identifies a probability distribution.
According to the language of Information Geometry, a statistical model can be modelled as a Riemannian manifold endowed with the Fisher information matrix and with a family of αconnections (Amari, 1985;Shun-Ichi and Hiroshi, 2000;Amari, 2016). The alpha embeddings are defined up to the choice of a reference distribution p 0 . The natural alpha embedding of a given word w is defined as the projection of the logarithmic map Log α p 0 w onto the tangent space of the submodel T p 0 E. The main intuition is that a word embedding for w corresponds to the vector in the tangent space which allows to reach the distribution of the context of w from p 0 . Deforming the simplex continuously with a family of isometries depending from a parameter alpha, and by considering a family of α-logarithmic maps, depending on the choice of the α-connection, a family of natural alpha embeddings W α p 0 (w) can be defined as a function of the deformation parameter α is the matrix of centered sufficient statistics in p 0 and The Fisher metric is simply computed as the metric for an exponential family (Amari and Nagaoka, 2000) and it does not depend on alpha since the family of alpha divergences induces the same Fisher information metric for any value of alpha.
The notion of alpha embeddings can be used both for downstream tasks and also to evaluate similarities and analogies in the tangent space of the manifold (Volpi and Malagò, 2019). Given two words a and b, a measure of similarity is defined by while analogies of the form a : b = c : d can be solved by minimizing an analogy measure κ It is possible to show that for α = 1 and choosing p 0 equal to the uniform distribution, the embeddings of Eq. (2) reduce to the standard vectors u w . Furthermore, by substituting the Fisher Information matrix I(p 0 ) with the identity 1 , Eqs. (5) and (6) reduce to the standard formulas used in the literature for similarities and analogies.
The embedding vectors u + v have been shown to provide better results (Pennington et al., 2014) than simply u. In the context of natural alpha embeddings, the vectors u + v can be interpreted as a recentering of the natural parameters u of the exponential family. This corresponds to a reweighting of the probabilities in Eq. (1) based on a change of reference measure proportional to exp(v w v χ ), i.e., by weighting more those words χ in the context whose outer vectors are aligned to the outer vector of the central word w.

Limit Embeddings
The behavior of the alpha embeddings for α progressively approaching minus infinity turns out to be particularly interesting. In this case, l α p 0 w (χ) is progressively more and more peaked on and presents a growing norm, see Eq.
(3). By normalizing these alpha embeddings to preserve the direction of the tangent vector, a simple formula can be obtained depending only on the χ * w row of the matrix of sufficient statistics ∆V (p 0 ). The normalized limit embeddings then simplify to leading to simple geometrical methods in the limit. Let us notice that the same row ∆V a can be associated to multiple words, thus limit embeddings are also naturally inducing a clustering in the embedding space.

Experiments
We considered two corpora: (2) will be denoted with 'E' in figures and tables, while the limit embeddings in Eq. (9) will be denoted with 'LE'. Embeddings have been normalized either with the Fisher Information matrix (F) or with the Identity (I). Similarly after normalization, the scalar products can be computed with the respective metric (on the tasks that requires scalar product calculation). In this study, normalization and scalar product are always using the same metric. For the reference distribution needed for the computation of the alpha embeddings we have chosen the uniform distribution (0), the unigram distribution of the model (u) -obtained by marginalization of the joint distribution learned by the model, or the unigram distribution estimated from the corpus data (ud). Embeddings are denoted by 'U', if in the computation of Eqs. (2) and (9), the formula used for p w is Eq. (1), while they will be denoted by 'U+V' if Eq. (7) is used instead.
We evaluated the alpha embeddings on intrinsic (similarities, analogies, concept categorization) and extrinsic (document classification, sentiment analysis) tasks.

Intrinsic Tasks
In Fig. 1 we report results for similarities and analogies with embedding size 300. For similarities we use: ws353 (Finkelstein et al., 2001),  Baroni et al. (2014) and LGD are the best methods in cross-validation with fixed window size of 10 and 5 (for varying hyperparameters) reported by Levy et al. (2015).  . For analogies we use the Google analogy dataset (Mikolov et al., 2013a). The limit embeddings (colored dotted lines) achieve good performances on both tasks, above the competitor methods from the literature U and U+V centered and normalized by column, as described in Pennington et al. (2014). Comparison with baseline methods from literature on word similarity is presented in Tables 1, we compare with the limit embeddings since they usually seem to be the best performing on the similarity task, see Fig. 1  and other comparable baselines from the literature with similar window size. In Table 2 we report best performances on analogy task on alpha embeddings, where alpha is selected with cross-validation (Table 3). For enwiki syn, the limit embedding has been found to work better instead. The errors reported are obtained averaging the performances on test of the top three alpha selected based on best performances on validation. The errors obtained are relatively small which indicates that tuning alpha is easy also on tasks with small amount of data in cross-validation. The best tuned alpha on the geb dataset completely outperform the baselines. The last intrinsic tasks considered are cluster purity for concept categorization datasets AP (Al-muhareb, 2006) and BLESS (Baroni and Lenci, 2011). The purity curves (Fig. 2) are more noisy, this is because the datasets available for this task are quite limited in size. Almost all the curves exhibit a peak which is relatively more pronounced for smaller embedding sizes, while the limit behaviour for very negative alphas is better performing for larger embedding size. This points to the fact that the natural clustering performed by the limit embeddings of Eq. 9 is better behaved when the dimension of the embedding grows. Increasing the embedding size, increases the number of sufficient statistics, thus allowing more flexibility for the limit clustering during training.  Figure 2: Cluster purity on concept categorization task.

Extrinsic Tasks
As extrinsic tasks we choose 20 Newsgroup multi classification (Lang, 1995) and IMDBReviews sentiment analysis (Maas et al., 2011). Embeddings are normalized before training either with I or F. We use a linear architecture (BatchNorm+Dense) for both tasks, while for sentiment analysis we also use a recurrent architecture (Bidirectional LSTM 32 channels, GlobalMaxPool1D, Dense 20 + Dropout 0.05, Dense). In Tables 4 and 5 we report the best methods chosen with respect to the validation set and the best limit embedding performances for embedding size 300. A more complete set of experiments can be found in Appendix. Limit Embeddings have been generalized, instead of considering only the max row χ * (see Sec. 2.2), by considered the top k rows from ∆V . Limit embeddings are evaluated with respect to top 1, 3, and 5, denoted -t1/3/5. Furthermore we denote by -w if a weighted average (with weights p w (χ)/p 0 (χ)) is performed for the top rows of ∆V . The improvements reported in the Tables are small but consistent, of above 0.5% accuracy on both Newsgroups and IMDBReviews, furthermore the improvement persist also with increased complexity of the network architecture (bidirectional LSTM).   reports curves for the values on test with early stopping based on validation for embedding sizes of 50 and 300. The improvements for tuning alpha are higher on size 50 exhibiting a more evident peak. For size 300 improvements are smaller but consistent. In particular a peak performance for alpha can be always easily identified for a chosen reference distribution and a chosen normalization.

Conclusions
For word similarities and analogies alpha embeddings provide significant improvements over baseline methods (corresponding to α = 1). For the other tasks the improvements are smaller but consistent, depending on the value of α, the chosen reference distribution (0, u, ud) and the chosen normalization method (I, F). The improvements persist also when increasing the complexity of the networks used (linear vs BiLSTM). This motivates further studies on more complex architectures, for example on models employing transformers with the aim to close the experimental gap with the state of the art.
The best value of alpha depends both on the task and on the dataset. Alpha embeddings thus provide an extra handle on the optimization problem, allowing to choose the deformation parameter based on data. Alpha values lower than 1 and negative seems to be preferred across most tasks. Limit embeddings provide a simple method which does not require validation over alpha, but can still offer an improvement on several tasks of interest. Furthermore limit embeddings can be interpreted as a natural clustering in space learned by the SG model itself during training. Performances of the limit embeddings grow with increasing dimension, pointing to the possibility to have a consistent improvement in higher embedding dimensions without tuning alpha.

A Additional Details
We have performed experiments using two corpora: english Wikipedia dump October 2017 (enwiki) and also we augmented this last one with Guthenberg(Gutenberg) and BookCorpus(BookCorpus; Kobayashi) calling this geb (guthenberg, enwiki, bookcorpus). We used the wikiextractor python script(Attardi) to parse the Wikipedia dump xml file. A minimal preprocessing have been used: lower case all the letters, remove stop-words and remove punctuation. We use a cut-off minimum frequency (m0) of 1000 during GloVe training (Pennington et al., 2014). We obtained a dictionary of about 67k words for both enwiki and geb. The window size was set to be 10 as in (Pennington et al., 2014), with decaying weighting rate from the center of 1/d for the calculation of cooccurrences. We trained the models for a maximum of 1000 epochs. Embedding sizes used are 50 and 300.  -4, 4]). We also report limit embedding performances.         Table 11: Analogy tasks for the different methods on enwiki and geb. The best alpha is selected with a 3-fold cross validation (α between -10 and 10). The methods reported are implementing either euclidean normalization (I) or normalization with the Fisher (F) in different points on the manifold (0, u). Scalar products (-p) are always calculated with respect to the Identity in this table (I). corpus method semantic syntactic total alpha acc alpha acc alpha acc enwiki 1.5B