Word Representations Concentrate and This is Good News!

This article establishes that, unlike the legacy tf*idf representation, recent natural language representations (word embedding vectors) tend to exhibit a so-called concentration of measure phenomenon, in the sense that, as the representation size p and database size n are both large, their behavior is similar to that of large dimensional Gaussian random vectors. This phenomenon may have important consequences as machine learning algorithms for natural language data could be amenable to improvement, thereby providing new theoretical insights into the field of natural language processing.


Introduction
One of the reasons of the success of deep learning (DNN) representations (such as image features or word embeddings) lies in their very high performing and stable behavior, when used for instance as inputs to classification or regression algorithms. We hypothesize that the underlying explanation is that, from the point of view of these learning algorithms, these (usually large dimensional) efficient representations exhibit a behavior "akin" (although formally different) to large dimensional random Gaussian vectors. In a sense, one may think of these "compressed raw data" representations as being all the better than they have a large "entropy", i.e., that they are composed of independent and isotropic components (otherwise, according to information theory, one may compress them even more). However, large dimensional "Gaussian-like" representation vectors x ∈ R p display quite counter-intuitive behavior when compared to small dimensional data, thereby disrupting our standard approach to machine learning. In particular, they naturally suffer from various sources of the curse of dimensionality: for instance, from the law of large numbers, 1 p x 2 = 1 p p i=1 x 2 i tends to converge as p → ∞, so that the representations, rather than occupying all of R p , concentrate at the edges of a sphere. Worse, the normalized distance 1 p x 1 − x 2 2 = 1 p x 1 2 + 1 p x 2 2 − 2 p x T 1 x 2 1 p x 1 2 + 1 p x 2 2 (due to 1 p x T 1 x 2 → 0 for vectors of independent entries) loses the information of correlation between x 1 and x 2 .
Fortunately, these curses of dimensionality can be turned into blessings. Again by (advanced versions of) the law of large numbers, the behavior of machine learning algorithms running on Gaussianlike data becomes amenable to theoretical analysis, in particular using recent advances in the fields of large dimensional statistics and random matrix theory. Therefore, these analyses allow for the performance prediction, improvement, and optimization of machine learning methods on real data. Consequently, proving that data representations behave like Gaussian vectors implies the possibility to theoretically control the learning algorithms designed to handle these data.
In a recent line of works, Couillet and co-authors suggest and theoretically support that DNN representations are indeed not Gaussian per se, but closely resemble concentrated random vectors. By definition, a concentrated random vector x ∈ R p is a vector which satisfies a concentration of measure phenomenon in the sense of Ledoux (2001): in essence, concentration means that x does not converge (quite the opposite) but any scalar Lipschitz observation g(x) ∈ R of x converges around its statistical mean when the size p of x increases; Figure 1 schematically illustrates the concentration of measure phenomenon. In particular, a key property to the present article is that the distance between any two concentrated random vectors x 1 and x 2 with "nice properties" converges to a constant value, which only depends on the data statistics, and is in particular independent of their random re-alization. This fundamental phenomenon, not true for small data, is at the core of our present study. Concentration of two Lipschitz functionals (g 1 (x) = x T 1 p / √ p and g 2 (x) = x ∞ ). While x "spreads out" in its ambient space, g 1 (x) and g 2 (x) converge.
In detail, Seddik et al. (2020) shows that natural images and their modern representations (such as VGG, ResNet embeddings) can be appropriately modelled by concentrated random vectors: they precisely prove that the extremely realistic images produced by modern generative adversarial networks (GANs) are by definition concentrated random vectors. Besides, in Louart and Couillet (2018), the authors establish a universality result which proves that the performance of many machine learning algorithms -from support vector machines to (kernel) spectral clustering -applied to concentrated random data is asymptotically 1 the same as if the data had been Gaussian random vectors with the same first and second order statistics. These findings have important consequences to modern machine learning: they in particular ensure that even involved algorithms applied to real data are analytically tractable, and that their performances can be anticipated and improved offline (without the need for cross-validation).
As (possibly large dimensional) vector representations of words and documents have become a basic building block of many natural language processing methods (Turney and Pantel, 2010), in particular since the success of word embeddings such as word2Vec (Mikolov et al., 2013) and Glove 2 (Pennington et al., 2014), two natural questions arise: (i) do word (and document) representations exhibit concentration of measure phenomena?, and (ii) do some of the aforementioned findings on real images extend to words and textual documents?
The present article empirically investigates this question and claims to reach a positive answer 3 . Specifically, the main contributions of the article are as follows: 1. We empirically establish that recent word embedding representations can suffer a distance concentration phenomenon, typical of concentrated random vectors but usually considered as a manifestation of the curse of dimensionality; 2. We empirically confirm that these word embeddings, unlike tf*idf vectors, exhibit a universality phenomenon in the following sense: letting x 1 , . . . , x n ∈ R p be n words or document representations of dimension p, the ker- 2 ) for some smooth function f has the same behavior (entry-wise and spectral) as a matrix K built out of Gaussian random vectors x i having the same statistical mean and covariance as the original data. 4 3. As a concrete application, the classification performances achieved by a kernel (leastsquare) support vector machine applied to classes of documents of popular databases are shown to be theoretically predictable and to match the theory established on mere Gaussian random vectors, thereby confirming the universality property of word embedding representations and the possibility to use a simple Gaussian vector theory to predict the performance of machine learning algorithms for natural language processing.
Related works. Several works similarly tried to reinterpret word embeddings, either in terms of matrix factorization (Levy and Goldberg, 2014b) or latent models Arora et al. (2016), and to account for the associations and analogies typical of the linear behavior of these embeddings (Levy and Goldberg, 2014a;Bolukbasi et al., 2016;Gittens et al., 2017;Ethayarajh et al., 2019a,b;Allen and Hospedales, 2019). In a different line of research, many attempts were made to understand the syntactic and semantic generalization capabilities of different deep learning models based on word embeddings, as in Dessì and Baroni (2019); Hewitt and Manning (2019); Lakretz et al. (2019); Chi et al. (2020) to list a few. Our approach is however different in its trying to statistically model word embeddings so to grasp the behavior of related machine learning algorithms. To the best of the authors' knowledge, this the first time this approach is being investigated.
2 Preliminaries and first observations 2.1 Asymptotics of learning From a crude viewpoint, machine learning algorithms may be seen as functionals F θ : R p×n × R p → R, (X, x) → F θ (X, x) which, for an input training data matrix X = [x 1 , . . . , x n ] and a test datum x returns a soft scalar score or hard decision. Here θ accounts for the possible hyperparameter vector used to fine-tune the algorithm. Assuming the training dataset X ∈ R p×n to be a random matrix with some prescribed distribution (and similarly for x), evaluating the performances of F θ boils down to establishing the statistics of the random variable F θ (X, x). This has long been a cumbersome, if not impossible, task which has mainly been studied so far using the asymptotic statistics n → ∞ and p fixed. Yet, these results have long remained of little use, not very expressive, and of limited interest when n is not much larger than p; this being in particular due to the non-linear (and often even implicit) nature of F θ . Random matrix theory and statistical physics have recently changed this paradigm and managed to break the non-linearity barrier by showing that, as n, p → ∞ simultaneously (thereby mimicking the modern large and numerous data setting), the performances of many non-trivial learning algorithms become tractable since they converge, as n, p → ∞, to some deterministic limits (Couillet et al., 2016). These latest results are based on sufficiently "stable" random models for X (and x): statistical physics uses isotropy and symmetries, which however often reduces to standard Gaussian data assumptions, while random matrix theory is richer and has lately exploited the Lipschitz stability offered by concentrated random vector models (Louart and Couillet, 2018). By definition, a ran-dom vector z in a vector space S is concentrated if, for all 1-Lipschitz functional g : S → R, we have that for all ε > 0, for some constant C, c > 0 and m g a median of g(z). That is, z itself may not converge in any usual sense (in general it does not: for instance z ∼ N (0, I p ) is concentrated but does not converge) but its Lipschitz functionals, also called observations of z, do converge (e.g., 1 √ p z → 1 almost surely). Recall Figure 1 for a visual intuition. Concentrated random vector modelling is particularly convenient as it ensures that, if X is, say, a concentrated random matrix, then for any Lipschitz function G (that outputs either small or large dimensional data), G(X) is still concentrated and in particular functionals G : R p×n → R are such that G(X) almost surely converges.
It is proved in Louart and Couillet (2018) that, for a rich family of functionals F θ , if X and x are concentrated, not only does F θ (X, x) converge, but it converges to the same limit as F θ (X , x ) for X and x random Gaussian matrix and vector having the same statistics (mean and covariance) as X and x, respectively. This is a classical but fundamental result in random matrix theory, referred to as universality.
Remark 1 (When are n, p large enough?). If random matrix theory predicts the asymptotic convergence of algorithms as n, p → ∞, these results are only useful if, in practice, n and p need not be extremely large. As a matter of fact, and quite surprisingly, the large dimensional effects arise very rapidly so that, in practice, n, p of the order of hundreds (sometimes even tens) is enough for an asymptotic behavior to emerge. This is explained by the numerous (O(np)) degrees of freedom inherent to the data which in particular induce rates of convergence, e.g., central limit theorems, at speed 1/ √ np instead of 1/ √ n when n → ∞ alone. Word embedding vectors, of size p ∼ 100 or more, naturally enter this regime.

How to testify of a concentration of measure phenomenon?
With this introductory overview in mind naturally arises the question of the relevance of a concentrated random vector modelling for practical data. As pointed out in the introduction, the synthetic images produced by GANs (Goodfellow et al., 2014) are by definition concentrated random vectors: this is because they are bounded Lipschitz functions (the Lipschitz operator being the pre-trained neural network) of a Gaussian random vector which is itself concentrated. Genuine images being so well approximated by GAN synthetic images, this strongly suggests that real images can be modelled as concentrated random vectors, which is confirmed by simulation results in Seddik et al.
. But words and documents are so far not reliably produced by GANs and it is unclear whether they might embrace the concentration of measure phenomenon. The objective of the article is to empirically assess whether the most pregnant phenomena occurring in concentration random vectors, namely the convergence of distances between distinct vectors and the (Gaussian-like) universality behavior, are observed on word and document representations.

Concentration of distances, and kernel spectrum 2.3.1 Concentration of distance
A first phenomenon arising in concentrated random vectors, which disrupts standard machine learning intuition, is the convergence of distances phenomenon. Specifically, if x 1 , . . . , x n ∈ R p are i.i.d. concentrated random vectors with C ≡ Cov(x i ) of bounded spectral norm, then, as p, n → ∞ in such a way that n grows no more than polynomially with p (which is the case, for example, if p/n is constant), almost surely, where τ p ≡ 2 p trC. That is, the distances between any pair of data all converge to the same limit.
Besides, and most importantly, if the x i 's are drawn from a mixture of k distribution classes , then (1) remains valid. This means that the classes cannot asymptotically be distinguished by the data distances. Here τ p can be taken to be any 2 p trC a , for a ∈ {1, . . . , k}. The setting µ a − µ b = O p (1) and is referred to as the nontrivial classification regime.
Remark 2 (On "non-trivial" classification). The above two assumptions µ a − µ b = O p (1) and tr(C a −C b ) = O p ( √ p) are quite natural to model a non-trivial, that is neither too easy nor too hard, classification scenario. In other words, if either µ a −µ b or 1 √ p tr(C a −C b ) were to increase with p, then a simple Bayesian analysis demonstrates that a trivial algorithm can achieve asymptotically perfect classification as p increases; conversely, if both µ a − µ b and 1 √ p tr(C a − C b ) were to vanish as p increases, it is theoretically impossible to retrieve the classes with any algorithm. In practice, of course, p remains fixed so that the conditions are mostly quantitative: in fact, "good" vector representations will tend to have rather large values of µ a − µ b and sometimes fall in a rather trivial regime (the classification task is then easy in general and most standard algorithms perform well), while other representations may be less discriminative, in which case classification is non-trivial and a well-tailored classification algorithm must be devised.
Our first result consists in empirically confirming that the concentration of distances phenomenon of Equation (1)  2. The x i 's correspond to n = 1 100 balanced documents from two classes ("Christian" versus "Forsale") from the 20News-Group database 5 ) obtained by selecting in each class the top 3 500 words according to their tf*idf scores, the idf being computed within the documents of the class, and encoded through (ii) tf*idf based weighted averages of the Glove embeddings of the words in the document, (iii) tf*idf based weighted averages of the word2vec embeddings of the words in the document, or merely through their (iv) tf*idf vectors.
For comparison purposes, all datasets have been centered.
A first observation is that the distances between two-class distributions of both Glove and word2vec representations seemingly "concentrate around √ 2" instead of displaying a bi-modal distribution. Besides, and possibly more importantly, the distribution closely matches the distribution of distances obtained for mere large dimensional Gaussian random vectors. This "resemblance to large (rather than small) Gaussian vector behavior" provides a first hint into a behavior typical of concentrated random vectors. This conclusion does however not hold for tf*idf representations, the distance histogram of which is far from being symmetrically centered around √ 2, which is naturally explained by the sparse nature of the the tf*idf vectors. Together, these results are a first indicator of a peculiar concentration behavior of "modern" vector representations for documents, as opposed to tf*idf vectors.
The above results are further supported in Figure 3 based on all the classes of the 20NewsGroup dataset and detailed in the next section.
From a practical standpoint though, the monomodal histograms of Figure 2 strongly suggest that "individual distance-based" document classification methods are likely to fail. The next section investigates this aspect by showing that more elaborate methods which treat data distances collectively rather than individually, such as spectral-based techniques, are more amenable to handle document vector classification than individual distance-based techniques.

Kernel spectral behavior
A broad range of machine learning algorithms F θ (X, x) are of the form G θ (K, x) where K ∈ R n×n is a kernel matrix of the input data X (ker-6 Exact calculus reveals that, for xi ∼ N (µa, Ca) and xj ∼ N (µ b , C b ), under the aforementioned non-trivial regime, for τp = 1 p tr(Ca + C b ) here, the quantities appearing in the variance σ 2 a,b being consistently estimated from: 1 p tr(Ĉ1Ĉ2) = 1 p tr(C1C2) + op(1) and 1 p tr(Ĉ 2 ) = 1 p tr(C 2 ) + 1 np (tr(Ĉ)) 2 + op(1) with n the number of independent samples used to evaluate the sample covariance matrixĈ of C. (ii) christian-vs-forsale, Glove, p = 300 (iii) christian-vs-forsale, word2vec, p = 300 0.5 1 1.5 2 2.5 0.5 1 1.5 2 2.5 (iv) christian-vs-forsale, tf*idf, p = 3 500 1.3 1.35 1.4 1.45 1.5 Figure 2: Distribution of (centered and normalized) input data distances { 1 √ p x i − x j } 1≤i =j≤n for (i) two-class Gaussian mixture with mean ±µ of size (ia) p = 4 or (i-b) p = 400, and two-class documents (20NewsGroups, "Christian" vs. "Forsale") with (ii) Glove, (iii) word2vec, or (iv) tf*idf representations. In blue are displayed the intra-and inter-class distance distributions and in red the collective distance distribution, as if all data were Gaussian and all distances were independent (which they are not). 6 Dashed-red line pointing the √ 2 position (where distances theoretically concentrate). nel spectral clustering, kernel SVM, graph kernel semi-supervised learning, etc.). Typically, following our distance-based development, for some smooth function f . Studying the statistical behavior of such algorithms, even under a mere Gaussian mixture model setting, has long remained an open problem, due to the non-linearity of f and of the intricate dependence between the entries of K. As a positive aftermath of the (a priori deleterious) concentration of distance phenomenon though, the authors in El Karoui et al. (2010); Couillet et al. (2016) prove that, when p, n → ∞, the involved matrix K is asymtotically well approximated by a form where W is a non-informative full-rank noise ma-trix and P is a low-rank 7 informative matrix which carries in its few eigenvectors the information about (a) the k data classes only through the first and second order statistics {µ a } k a=1 and {C a } k a=1 of the classes, and (b) the kernel function f only through its local behavior around the joint distance concentration point τ p . For instance, the popular radialbasis (RBF) kernel f (t) = exp(− t 2σ 2 ) behaves theoretically the same as any other function (for instance a mere polynomial of order 2) having the same first two derivatives as f in τ p . This finding opens the perspective to improve kernel-based algorithms based on a careful choice of the behavior of f around τ p .
One of the main consequences of the approximation (2) is the theoretical ability to anticipate the spectral behavior, so in particular to describe the statistics of the dominant eigenvectors 8 of K, thereby allowing for a theoretical prediction of the performances of spectral learning (e.g., spectral clustering, manifold learning, etc.). These results are again universal in that they only depend on the statistical means and covariances of the data classes; see Couillet et al. (2016) for details.
We wish here to demonstrate that kernel matrices built on natural language data similarly conform to the behavior of large dimensional Gaussian vectors. To this end, we use both the same two-class data benchmark introduced in the previous section as well as the complete set of classes from 20News-Group. We design a matrix K for the popular RBF kernel f (t) = exp(−t/2) (that is with bandwidth σ 2 = 1) and extract its second dominant eigenvector v 2 . 9 This is depicted in Figure 3 and Figure 4, which it is convenient to compare to Figure 2. It is first observed that, while, according to Figure 2 and subsequently supported by Figure 3, the entries of K, i.e., exp(− · /2) applied to the distances 1 p x i − x j 2 , are not discriminating -the distance distribution being unimodal in Figure 2 and the contrast between inner and outer class similarity being weak for Glove and word2vec in Figure 3 -, the entries of v 2 are instead strongly informative and the eigenvector distribution is bi-modal: this is in essence explained by a "redundancy" effect 7 Of rank usually equal or bounded by the number of classes in the dataset. 8 Those associated to the largest (or smallest) isolated eigenvalues of K. 9 Which is known to be the best discriminating eigenvector in a two-class setting. Figure 3: Display of the Gaussian kernel matrices for tf*idf, word2vec and Glove embeddings over the whole 20News-Groups database. Very low contrast is observed between inner and outer similarities, especially for glove and word2vec, as a consequence of the distance concentration effect.
in the numerous data belonging to the same class which "gather energy" into an isolated eigenvalue with eigenvector v 2 . 10 This cumulative effect is not exploited by algorithms which treat data distances one-by-one (such a KNN kernel with few neighbors) rather than collectively.
A second observation, more to the point for our present demonstration, is that the histogram of the entries of v 2 for genuine natural language data is a close match to the histogram of the synthetic Gaussian vector counterparts: this is a second manifestation of the universality of concentration of measure.
Remark 3 ("Behaving like" is not "being" a Gaussian). We wish to insist that this universality observation does not suggest that word and document vectors look like Gaussian vectors (this would be a mistake); it merely states that the observed functional of the learning data X (here the entries of an eigenvector of K) has the same asymptotic behavior as with Gaussian vector inputs.
These empirical results are strong indicators that natural language data representations may behave similar to concentrated random vectors and may adequately be modelled as such. This implies that the curse of dimensionality, appearing here in the distance concentration phenomenon, is at play: as a main consequence, we expect many standard algorithms based on individual data distance evaluations to dramatically fail, where more elaborate techniques using spectral properties remain competitive and, in addition, are now prone to theoretical analysis. The next section investigates this claim in the specific case of SVMs.
Real data (same setting as in Figure 2; (Right) Gaussian vectors with the same (empirically estimated) first and second order statistics as their left counterpart. In red are displayed the theoretical distributions under Gaussian data input (according to Couillet et al. (2016)).

Application to supervised learning
The concentration of measure phenomenon in real data (Equation (1)) has a fundamental advantage: the performance of many learning algorithms become predictable and, consequently, amenable to improvement. The results of the previous section therefore strongly suggest that, for the first time to the authors' knowledge, one can predict to some extent (so long that the non-trivial conditions are met for the processed data) the performance of a host of machine learning algorithms for natural language processing.
Specifically, we consider here the standard leastsquare kernel support vector machine (LSSVM) classifier (used, e.g., in Mitra et al. (2007) for text classification with some refinement), with kernel for some function f to be specified. The LSSVM classifier allocates the class of a new datum x based on its position with respect to a hyperplane in kernel space designed from the training set X. Although not directly a spectral method (as in the unsupervised spectral clustering algorithm (Von Luxburg, 2007)), for large n, p, the LSSVM classifier inherently exploits the eigenspectrum of the kernel matrix K and its performance is proved in Liao and Couillet (2019) to be asymptotically predictable (for large enough p, n) and in closed form (which is thus simpler than the margin-based SVM, whose asymptotic performances do not admit a closed form).
Precisely, the class C 1 or C 2 allocated to x is the result of the binary test 1 T n S −1 y 1 T n S −1 1 n and S = K + n γ I n , for y ∈ {±1} n the vector of training data labels, k(x) = {f ( 1 p x i − x 2 )} 1≤i≤n , and regularization γ > 0.
In Liao and Couillet (2019), the authors precisely show that, for a two-class mixture of concentrated random vectors with means µ 1 , µ 2 and covariances C 1 , C 2 , as n, p → ∞ in the non-trivial regime described above, for x genuinely in class C i , where m 1 , m 2 , σ 2 1 and σ 2 2 only depend on (a) the ratio f (τ p )/f (τ p ) and (b) scalar functionals of the statistical means and covariances (specifically, only µ 1 − µ 2 2 , tr(C 1 − C 2 )/ √ p and tr((C 1 − C 2 ) 2 )/p); see Liao and Couillet (2019) for details. 11 For instance, f (t) = exp(−t/2σ 2 ) 332 is the standard radial-basis function kernel (RBF) with bandwidth σ 2 , the asymptotic performances of which only depend on f (τ p )/f (τ p ) = −2σ 2 . Of utmost relevance here is that the asymptotic performances are identical for concentrated random vectors as for Gaussian random vectors having the same first and second order statistics. Figure 5 reports the performances of LSSVM as a function of the ratio f (τ p )/f (τ p ), wherê τ p ≡ 1 n(n−1) 1≤i =j≤n 1 p x i − x j 2 is a consistent (and fast converging) estimate for τ p , here for two kernels: (a) the second order polynomial kernel such that f (τ p ) = 4, f (τ p ) = 1 and f (τ p ) varying from −2 to 1, and (b) the RBF kernel with bandwidth σ 2 such that −2σ 2 varies from −2 to 0 (of course −2σ 2 cannot be positive).
The benchmark dataset are the Yahoo Answer classes "cult" versus "education", the feature vectors of which are either (ii) tf*idf based weighted averages of Glove embedding (p = 300), (iii) tf*idf based weighted averages of word2vec embeddings (p = 300) and (iv) tf*idf representation (with dictionary size p = 3 000). A comparison to (i) Gaussian input data vectors is also provided for reference (p = 300). In each experiment, the number of training data is n = 500 or n = 2 000. Figure 5 first shows a trend for the performances to converge, as the results for both n = 500 and n = 2 000 are similar: as such, these performances are not random and then possibly amenable to theoretical analysis.
More in detail, Figure 5 demonstrates that, for the tf*idf representation, the theoretical equivalent for concentrated vectors (red) and the empirical performance (blue) are quite different, clearly confirming that tf*idf representations are not appropriately modelled by concentrated vectors. This is again no surprise as these vectors are intrinsically sparse, which concentrated vectors cannot be.
The case of word2vec and Glove is more interesting as Figure 5 reports an extremely accurate fit between theory and practice for f (τ p )/f (τ p ) below −1 and above .5. More crucially, in these regions, the performances for both the RFB kernel and the polynomial kernel with f (τ p )/f (τ p ) = −2σ 2 perfectly coincide, so that real data performance corroborates the theory. Only the region [−1, .5] shows a severe discrepancy. This is explained by two factors: (a) for any kernel (here for the polynomial kernel), Liao and Couillet (2019) shows that the region where f (τ p ) 0 is particularly unstable to "strongly mean-discriminative data", i.e., data mixtures strongly identifiable from their statistical means; this is what is observed here with a vanishing performance (dropping to 50%) when f (τ p ) = 0, inducing instability; at this point of our analysis though, we cannot explain the performance increase near 0 − predicted by the theory while the empirical performance monotonously drops; (b) for the RBF kernel, in the vicinity of σ 2 ∼ 0, the entries of K degenerate; K becomes sparse, which goes against concentration; this is already observed for Gaussian inputs (top display); this gap can only be covered with larger p, n values.

Concluding Remarks
The results of this article may scratch the surface of a new mathematical theory for harnessing modern natural language processing representations: recent word and document features (word2vec and Glove) were shown here to exhibit some key characteristics of concentrated random vectors, which tf*idf maps do not. This, as a consequence of recent works on the analysis of machine learning algorithms for concentration random vectors, opens the path to theoretical analyses, improved understanding, finetuned and new algorithms for natural language data processing. Yet, the preliminary conclusions of the present article are less compelling than similar conclusions drawn for image representations (e.g., in Liao and Couillet (2019);Seddik et al. (2020), where the performance predictions on real images are extremely accurate for wide ranges of hyperparameters). This may be interpreted in two ways: either the document representation (Glove and word2vec) need be perfected to be as discriminative and "maximum entropic" 12 as VGG or ResNet are for images, or the concentration power of document embeddings is intrinsically weaker than image embeddings. If the latter hypothesis is correct, further mathematical efforts are needed to improve our understanding of these "weakly concentrated" data models. We plan to investigate these points in the future. empirical results (blue) versus asymptotic theory (red), for n = 500 (thick) or n = 2000 (light), with either RBF (plain) or second-order polynomial (dashed) kernel. Good fit for word2vec and Glove embeddings away from unstability region of f (τ p )/f (τ p ), suggesting concentration (Gaussian-like) behavior; tf*idf data do not concentrate.