Word Mover’s Embedding: From Word2Vec to Document Embedding

While the celebrated Word2Vec technique yields semantically rich representations for individual words, there has been relatively less success in extending to generate unsupervised sentences or documents embeddings. Recent work has demonstrated that a distance measure between documents called Word Mover’s Distance (WMD) that aligns semantically similar words, yields unprecedented KNN classification accuracy. However, WMD is expensive to compute, and it is hard to extend its use beyond a KNN classifier. In this paper, we propose the Word Mover’s Embedding (WME), a novel approach to building an unsupervised document (sentence) embedding from pre-trained word embeddings. In our experiments on 9 benchmark text classification datasets and 22 textual similarity tasks, the proposed technique consistently matches or outperforms state-of-the-art techniques, with significantly higher accuracy on problems of short length.


Introduction
Text representation plays an important role in many NLP-based tasks such as document classification and clustering (Zhang et al., 2018;Gui et al., 2016Gui et al., , 2014)), sense disambiguation (Gong et al., 2017(Gong et al., , 2018a)), machine translation (Mikolov et al., 2013b), document matching (Pham et al., 2015), and sequential alignment (Peng et al., 2016(Peng et al., , 2015)).Since there are no explicit features in text, much work has aimed to develop effective text representations.Among them, the simplest bag of words (BOW) approach (Salton and Buckley, 1988) and its term frequency variants (e.g.TF-IDF) (Robertson and Walker, 1994) are most widely used due to simplicity, efficiency and often surprisingly high accuracy (Wang and Manning, 2012).However, simply treating words and phrases as discrete symbols fails to take into account word order and the semantics of the words, and suffers from frequent nearorthogonality due to its high dimensional sparse representation.To overcome these limitations, Latent Semantic Indexing (Deerwester et al., 1990) and Latent Dirichlet Allocation (Blei et al., 2003) were developed to extract more meaningful representations through singular value decomposition (Wu and Stathopoulos, 2015) and learning a probabilistic BOW representation.
A recent empirically successful body of research makes use of distributional or contextual information together with simple neural-network models to obtain vector-space representations of words and phrases (Bengio et al., 2003;Mikolov et al., 2013a,c;Pennington et al., 2014).A number of researchers have proposed extensions of these towards learning semantic vector-space representations of sentences or documents.A simple but often effective approach is to use a weighted average over some or all of the embeddings of words in the document.While this is simple, important information could easily be lost in such a document representation, in part since it does not consider word order.A more sophisticated approach (Le and Mikolov, 2014;Chen, 2017) has focused on jointly learning embeddings for both words and paragraphs using models similar to Word2Vec.However, these only use word order within a small context window; moreover, the quality of word embeddings learned in such a model may be limited by the size of the training corpus, which cannot scale to the large sizes used in the simpler word embedding models, and which may consequently weaken the quality of the document embeddings.
Recently, Kusner et al. (Kusner et al., 2015) presented a novel document distance metric, Word Mover's Distance (WMD), that measures the dissimilarity between two text documents in the Word2Vec embedding space.Despite its stateof-the-art KNN-based classification accuracy over other methods, combining KNN and WMD incurs very high computational cost.More importantly, WMD is simply a distance that can be only combined with KNN or K-means, whereas many machine learning algorithms require a fixed-length feature representation as input.
A recent work in building kernels from distance measures, D2KE (distances to kernels and embeddings) (Wu et al., 2018a) proposes a general methodology of the derivation of a positive-definite kernel from a given distance function, which enjoys better theoretical guarantees than other distancebased methods, such as k-nearest neighbor and distance substitution kernel (Haasdonk and Bahlmann, 2004), and has also been demonstrated to have strong empirical performance in the time-series domain (Wu et al., 2018b).
In this paper, we build on this recent innovation D2KE (Wu et al., 2018a), and present the Word Mover's Embedding (WME), an unsupervised generic framework that learns continuous vector representations for text of variable lengths such as a sentence, paragraph, or document.In particular, we propose a new approach to first construct a positive-definite Word Mover's Kernel via an infinite-dimensional feature map given by the Word Mover's distance (WMD) to random documents from a given distribution.Due to its use of the WMD, the feature map takes into account alignments of individual words between the documents in the semantic space given by the pre-trained word embeddings.Based on this kernel, we can then derive a document embedding via a Random Features approximation of the kernel, whose inner products approximate exact kernel computations.Our technique extends the theory of Random Features to show convergence of the inner product between WMEs to a positive-definite kernel that can be interpreted as a soft version of (inverse) WMD.
The proposed embedding is more efficient and flexible than WMD in many situations.As an example, WME with a simple linear classifier reduces the computational cost of WMD-based KNN from cubic to linear in document length and from quadratic to linear in number of samples, while simultaneously improving accuracy.WME is extremely easy to implement, fully parallelizable, and highly extensible, since its two building blocks, Word2Vec and WMD, can be replaced by other techniques such as GloVe (Pennington et al., 2014;Wieting et al., 2015b) or S-WMD (Huang et al., 2016).We evaluate WME on 9 real-world text classification tasks and 22 textual similarity tasks, and demonstrate that it consistently matches or outperforms other state-of-theart techniques.Moreover, WME often achieves orders of magnitude speed-up compared to KNN-WMD while obtaining the same testing accuracy.Our code and data is available at https://github.com/IBM/WordMoversEmbeddings.

Word2Vec and Word Mover's Distance
We briefly introduce Word2Vec and WMD, which are the key building blocks of our proposed method.
Here are some notations we will use throughout the paper.Given a total number of documents N with a vocabulary W of size |W| = n, the Word2vec embedding gives us a d-dimensional vector space V ⊆ R d such that any word in the vocabulary set w ∈ W is associated with a semantically rich vector representation v w ∈ R d .Then in this work, we consider each document as a collection of word vectors x := (v j ) L j=1 and denote X := Lmax L=1 V L as the space of documents.
Word2Vec.In the celebrated Word2Vec approach (Mikolov et al., 2013a,c), two shallow yet effective models are used to learn vector-space representations of words (and phrases), by mapping those that co-occur frequently, and consequently with plausibly similar meaning, to nearby vectors in the embedding vector space.Due to the model's simplicity and scalability, high-quality word embeddings can be generated to capture a large number of precise syntactic and semantic word relationships by training over hundreds of billions of words and millions of named entities.The advantage of document representations building on top of word-level embeddings is that one can make full use of highquality pre-trained word embeddings.Throughout this paper we use Word2Vec as our first building block but other (unsupervised or supervised) word embeddings (Pennington et al., 2014;Wieting et al., 2015b) could also be utilized.
Word Mover's Distance.Word Mover's Distance was introduced by (Kusner et al., 2015) as a special case of the Earth Mover's Distance (Rubner et al., 2000), which can be computed as a solution of the well-known transportation problem (Hitchcock, 1941;Altschuler et al., 2017).WMD is a distance between two text documents x, y ∈ X that takes into account the alignments between words.Let |x|, |y| be the number of distinct words in x and y.Let f x ∈ R |x| , f y ∈ R |y| denote the normalized frequency vectors of each word in the documents x and y respectively (so that f T x 1 = f T y 1 = 1).Then the WMD distance between documents x and y is defined as: WMD(x, y) := min where F is the transportation flow matrix with F ij denoting the amount of flow traveling from i-th word x i in x to j-th word y j in y, and C is the transportation cost with C ij := dist(v x i , v y j ) being the distance between two words measured in the Word2Vec embedding space.A popular choice is the Euclidean distance dist(v x i , v y j ) = v x i − v y j 2 .When dist(v x i , v y j ) is a metric, the WMD distance in Eq. (1) also qualifies as a metric, and in particular, satisfies the triangle inequality (Rubner et al., 2000).Building on top of Word2Vec, WMD is a particularly useful and accurate for measure of the distance between documents with semantically close but syntactically different words as illustrated in Figure 1(a).
The WMD distance when coupled with KNN has been observed to have strong performance in classification tasks (Kusner et al., 2015).However, WMD is expensive to compute with computational complexity of O(L 3 log(L)), especially for long documents where L is large.Additionally, since WMD is just a document distance, rather than a document representation, using it within KNN incurs even higher computational costs O(N 2 L 3 log(L)).
3 Document Embedding via Word Mover's Kernel In this section, we extend the framework in (Wu et al., 2018a), to derive a positive-definite kernel from an alignment-aware document distance metric, which then gives us an unsupervised semantic embeddings of texts of variable length as a byproduct through the theory of Random Feature Approximation (Rahimi and Recht, 2007).

Word Mover's Kernel
We start by defining the Word Mover's Kernel: where φ ω (x) := exp(−γWMD(x, ω)). ( where ω can be interpreted as a random document {v j } D j=1 that contains a collection of random word vectors in V, and p(ω) is a distribution over the space of all possible random documents Ω := Dmax D=1 V D .φ ω (x) is an possibly infinitedimensional feature map derived from the WMD between x and all possible documents ω ∈ Ω.
An insightful interpretation of this kernel (2): where and f (ω) = {WMD(x, ω) + WMD(ω, y)}, is a version of soft minimum function parameterized by p(ω) and γ.Comparing this with the usual definition of soft minimum softmin i f i := −softmax (−f i ) = − log i e −f i , it can be seen that the soft-min-variant in the above Equations uses a weighting of the objects ω via the probability density p(ω), and moreover has the additional parameter γ to control the degree of smoothness.When γ is large and f (ω) is Lipschitz-continuous, the value of the soft-min-variant is mostly determined by the minimum of f (ω).
Note that since WMD is a metric, by the triangular inequality we have and the equality holds if we allow the length of random document D max to be not smaller than L. Therefore, the kernel (2) serves as a good approximation to the WMD between any pair of documents x, y as illustrated in Figure 1(b), while it is positive-definite by the definition.

Word Mover's Embedding
Given the Word-Mover's Kernel in Eq. ( 2), we can then use the Monte-Carlo approximation: i=1 gives a vector representation of document x.We call this random approximation Word Mover's Embedding.Later, we show that this Random Features approximation in Eq. ( 3) converges to the exact kernel (2) uniformly over all pairs of documents (x, y) .

Distribution p(ω).
A key ingredient in the Word Mover's Kernel and Embedding is the distribution p(ω) over random documents.Note that ω ∈ X consists of sets of words, each of which lies in the Word2Vec embedding space; the characteristics of which need to be captured by p(ω) in order to generate (sets of) "meaningful" random words.Several studies have found that the word vectors v are roughly uniformly dispersed in the word embedding space (Arora et al., 2016(Arora et al., , 2017)).This is also consistent with our empirical findings, that the uniform distribution centered by the mean of all word vectors in the documents is generally applicable for various text corpora.Thus, if d is the dimensionality of the pre-trained word embedding space, we can draw a random word u ∈ R d as u j ∼ Uniform[v min , v max ], for j = 1, . . ., d, and where v min and v max are some constants.
Given a distribution over random words, the remaining ingredient is the length D of random documents.It is desirable to set these to a small number, in part because this length is indicative of the number of hidden global topics, and we expect the number of such global topics to be small.In particular, these global topics will allow short random documents to align with the documents to obtain "topic-based" discriminatory features.Since there is no prior information for global topics, we choose to uniformly sample the length of random documents as D ∼ Uniform[1, D max ], for some constant D max .Stitching the distributions over words, and over the number of words, we then get a distribution over random documents.We note that our WME embedding allows potentially other random distributions, and other types of word embeddings, making it a flexible and powerful feature learning framework to utilize state-of-the-art techniques.
Algorithm 1 Word Mover's Embedding: An Unsupervised Feature Representation for Documents Input: Texts {x i } N i=1 , D max , R. Output: Matrix Z N ×R , with rows corresponding to text embeddings.1: Compute v max and v min as the maximum and minimum values, over all coordinates of the word vectors v of {x i } N i=1 , from any pretrained word embeddings (e.g.Word2Vec, GloVe or PSL999).2: for j = 1, . . ., R do Generate a random document ω j consisting of D j number of random words drawn as Compute f x i and f ω j using a popular weighting scheme (e.g.NBOW or TF-IDF).

6:
Compute the WME feature vector Algorithm 1 summarizes the overall procedure to generate feature vectors for text of any length such as sentences, paragraphs, and documents.KNN-WMD, which uses the WMD distance together with KNN based classification, requires O(N 2 ) evaluations of the WMD distance, which in turn has O(L 3 log(L)) complexity, assuming that documents have lengths bounded by L, leading to an overall complexity of O(N 2 L 3 log(L).In contrast, our WME approximation only requires super-linear complexity of O(N RLlog(L)) when D is constant.This is because in our case each evaluation of WMD only requires O(D 2 L log(L)) (Bourgeois and Lassalle, 1971), due to the short length D of our random documents.This dramatic reduction in computation significantly accelerates training and testing when combined with empirical risk minimization classifiers such as SVMs.A simple yet useful trick is to pre-compute the word distances to avoid redundant computations since a pair of words may appear multiple times in different pairs of documents.Note that the computation of the ground distance between each pair of word vectors in documents has a O(L 2 d) complexity, which could be close to one WMD evaluation if document length L is short and word vector dimension d is large.This simple scheme leads to additional improvement in runtime performance of our WME method that we show in our experiments.

Convergence of WME
In this section, we study the convergence of our embedding (3) to the exact kernel (2) under the framework of Random Features (RF) approximation (Rahimi and Recht, 2007).Note that the standard RF convergence theory applies only to the shift-invariant kernel operated on two vectors, while our kernel (2) operates on two documents x, y ∈ X that are sets of word vectors.In (Wu et al., 2018a), a general RF convergence theory is provided for any distance-based kernel as long as a finite covering number is given w.r.t. the given distance.In the following lemma, we provide the covering number for all documents of bounded length under the Word Mover's Distance.Without loss of generality, we will assume that the word embeddings {v} are normalized s.t.v ≤ 1.
Lemma 1.There exists an -covering E of X under the WMD metric with Euclidean ground distance, so that: where L is a bound on the length of document x ∈ X .
Equipped with Lemma 1, we can derive the following convergence result as a simple corollary of the theoretical results in (Wu et al., 2018a).We defer the proof to the appendix A.
Theorem 1.Let ∆ R (x, y) be the difference between the exact kernel (2) and the random approximation (3) with R samples, we have uniform convergence where d is the dimension of word embedding and L is a bound on the document length.In other words, to guarantee |∆ R (x, y)| ≤ with probability at least 1 − δ, it suffices to have

Experiments
We conduct an extensive set of experiments to demonstrate the effectiveness and efficiency of the proposed method.We first compare its performance against 7 unsupervised document embedding approaches over a wide range of text classification tasks, including sentiment analysis, news categorization, amazon review, and recipe identification.We use 9 different document corpora, with 8 of these drawn from (Kusner et al., 2015;Huang et al., 2016); Table 1 provides statistics of the different datasets.We further compare our method against 10 unsupervised, semi-supervised, and supervised document embedding approaches on the 22 datasets from SemEval semantic textual similarity tasks.Our code is implemented in Matlab, and we use C Mex for the computationally intensive components of WMD (Rubner et al., 2000).Effects of R. We investigate how the performance changes when varying the number of Random Features R from 4 to 4096 with fixed D. Fig. 2 shows that both training and testing accuracies generally converge very fast when increasing R from a small number (R = 4) to a relatively large number (R = 1024), and then gradually reach to the optimal performance.This confirms our analysis in Theory 1 that the proposed WME can guarantee the fast convergence to the exact kernel.
Effects of D. We further evaluate the training and testing accuracies when varying the length of random document D with fixed R. As shown in Fig. 3, we can see that near-peak performance can usually be achieved when D is small (typically D ≤ 6).This behavior illustrates two important aspects: (1) using very few random words (e.g.D = 1) is not enough to generate useful Random Features when R becomes large; (2) using too many random words (e.g.D ≥ 10) tends to generate similar and redundant Random Features when increasing R.
Conceptually, the number of random words in a random document can be thought of as the number of the global topics in documents, which is generally small.This is an important desired feature that confers both a performance boost as well as computational efficiency to the WME method.

Comparison with KNN-WMD
Baselines.We now compare two WMD-based methods in terms of testing accuracy and total training and testing runtime.We consider two variants of WME with different sizes of R. WME(LR) stands for WME with large rank that achieves the best accuracy (using R up to 4096) with more computational time, while WME(SR) stands for WME with small rank that obtains comparable accuracy in less time.We also consider two variants of both methods where +P denotes that we precompute the ground distance between each pair of words to avoid redundant computations.
Setup.Following (Kusner et al., 2015;Huang et al., 2016), for datasets that do not have a predefined train/test split, we report average and standard deviation of the testing accuracy and average run-time of the methods over five 70/30 train/test splits.For WMD, we provide the results (with respect to accuracy) from (Kusner et al., 2015); we also reran the experiments of KNN-WMD and found them to be consistent with the reported results.For all methods, we perform 10-fold cross validation to search for the best parameters on the training documents.We employ a linear SVM implemented using LIBLINEAR (Fan et al., 2008) on WME since it can isolate the effectiveness of the feature representation from the power of the nonlinear learning solvers.For additional results on all KNN-based methods, please refer to Appendix B.3. tages of WME compared to KNN-WMD in terms of both accuracy and runtime.First, WME(SR) can consistently achieve better or similar accuracy compared to KNN-WMD while requiring order-ofmagnitude less computational time on all datasets.Second, both methods can benefit from precomputation of the ground distance between a pair of words but WME gains much more from prefetch (typically 3-5x speedup).This is because the typical length D of random documents is very short where computing ground distance between word vectors may be even more expensive than the corresponding WMD distance.Finally, WME(LR) can achieve much higher accuracy compared to KNN-WMD while still often requiring less computational time, especially on large datasets like 20NEWS and relatively long documents like OHSUMED.

Comparisons with Word2Vec & Doc2Vec
Baselines.We compare against 6 document representations methods: 1) Smooth Inverse Frequency (SIF) (Arora et al., 2017): a recently proposed simple but tough to beat baseline for sentence embeddings, combining a new weighted scheme of word embeddings with dominant component removal; 2) Word2Vec+nbow: a weighted average of word vectors using NBOW weights; 3) Word2Vec+tfidf : a weighted average of word vectors using TF-IDF weights; 4) PV-DBOW (Le and Mikolov, 2014) Setup.Word2Vec+nbow, Word2Vec+tf-idf and WME use pre-trained Word2Vec embeddings while SIF uses its default pre-trained GloVe embeddings.Following (Chen, 2017), to enhance the performance of PV-DBOW, PV-DM, and Doc2VecC these methods are trained transductively on both train and test, which is indeed beneficial for generating a better document representation (see Appendix B.4).In contrast, the hyperparameters of WME are obtained through a 10-fold cross validation only on training set.For a fair comparison, we run a linear SVM using LIBLINEAR on all methods.
Results.Table 3 shows that WME consistently outperforms or matches existing state-of-the-art document representation methods in terms of testing accuracy on all datasets except one (OHSUMED).The first highlight is that simple average of word embeddings often achieves better performance than SIF(Glove), indicating that removing the first principle component could hurt the expressive power of the resulting representation for some of classification tasks.Surprisingly, these two methods often achieve similar or better performance than PV-DBOW and PV-DM, which may be because of the high-quality pre-trained word embeddings.On the other hand, Doc2VecC achieves much better testing accuracy than these previous methods on two datasets (20NEWS, and RECIPE_L).This is mainly because that it benefits significantly from transductive training (See Appendix B.4).Finally, the better performance of WME over these strong baselines stems from fact that WME is empowered by two important building blocks, WMD and Word2Vec, to yield a more informative representation of the documents by considering both the word alignments and the semantics of words.We refer the readers to additional results on the Imdb dataset in Appendix B.4, which also demonstrate the clear advantage of WME even compared to the supervised RNN method as well as the aforementioned baselines.
Setup.There are total 22 textual similarity datasets from STS tasks (2012-2015) (Agirre et al., 2012(Agirre et al., , 2013(Agirre et al., , 2014(Agirre et al., , 2015)), SemEval 2014 Semantic Relatedness task (Xu et al., 2015), and SemEval 2015 Twitter task (Marelli et al., 2014).The goal of these tasks is to predict the similarity between two input sentences.Each year STS usually has 4 to 6 different tasks and we only report the averaged Pearson's scores for clarity.Detailed results on each dataset are listed in Appendix B.5. 4 shows that WME consistently matches or outperforms other unsupervised and supervised methods except the SIF method.Indeed, compared with ST and nbow, WME improves Pearson's scores substantially by 10% to 33% as a result of the consideration of word alignments and the use of TF-IDF weighting scheme.tf-idf also improves over these two methods but is slightly worse than our method, indicating the importance of taking into account the alignments between the words.SIF method is a strong baseline for textual similarity tasks but WME still can beat it on STS'12 and achieve close performance in other cases.Interestingly, WME is on a par with three supervised methods RNN, LSTM(no), and LSTM(o.g.) in most cases.The final remarks stem from the fact that, WME can gain significantly benefit from the supervised word embeddings similar to SIF, both showing strong performance on PSL.

Related Work
Two broad classes of unsupervised and supervised methods have been proposed to generate sentence and document representations.The former primarily generate general purpose and domain independent embeddings of word sequences (Socher et al., 2011;Kiros et al., 2015;Arora et al., 2017); many unsupervised training research efforts have focused on either training an auto-encoder to learn the latent structure of a sentence (Socher et al., 2013), a paragraph, or document (Li et al., 2015); or generalizing Word2Vec models to predict words in a paragraph (Le and Mikolov, 2014;Chen, 2017) or in neighboring sentences (Kiros et al., 2015).However, some important information could be lost in the resulting document representation without considering the word order.Our proposed WME overcomes this difficulty by considering the alignments between each pair of words.
The other line of work has focused on developing compositional supervised models to create a vector representation of sentences (Kim et al., 2016;Gong et al., 2018b).Most of this work proposed composition using recursive neural networks based on parse structure (Socher et al., 2012(Socher et al., , 2013)), deep averaging networks over bag-of-words models (Iyyer et al., 2015;Wieting et al., 2015a), convolutional neural networks (Kim, 2014;Kalchbrenner et al., 2014;Xu et al., 2018), and recurrent neural networks using long short-term memory (Tai et al., 2015;Liu et al., 2015).However, these methods are less well suited for domain adaptation settings.

Conclusion
In this paper, we have proposed an alignment-aware text kernel using WMD for texts of variable lengths, which takes into account both word alignments and pre-trained high quality word embeddings in learning an effective semantics-preserving feature representation.The proposed WME is simple, efficient, flexible, and unsupervised.Extensive experiments show that WME consistently matches or outperforms state-of-the-art models on various text classification and textual similarity tasks.WME embeddings can be easily used for a wide range of downstream supervised and unsupervised tasks.
for γ chosen to be ≤ 1.This gives us Combining ( 4) and ( 5), we have P max Choosing = t/6γ yields the result.

B Appendix B: Additional Experimental Results and Details
B.1 Experimental settings and parameters for WME Setup.We choose 9 different document corpora where 8 of them are overlapped with datasets in (Kusner et al., 2015;Huang et al., 2016).A complete data summary is in Table 1.These datasets come from various applications, including news categorization, sentiment analysis, product identification, and have various number of classes, varying number of documents, and a wide range of document lengths.Our code is implemented in Matlab and we use the C Mex function for computationally expensive components of Word Mover's Distance1 (Rubner et al., 2000) and the freely available Word2Vec word embedding2 which has pre-trained embeddings for 3 millon words/phrases (from Google News) (Mikolov et al., 2013a).All computations were carried out on a DELL dual socket system with Intel Xeon processors 272 at 2.93GHz for a total of 16 cores and 250 GB of memory, running the SUSE Linux operating system.To accelerate the computation of WMDbased methods, we use multithreading with total 12 threads for WME and KNN-WMD in all experiments.For all experiments, we generate random document from uniform distribution with mean centered in Word2Vec embedding space since we observe the best performance with this setting.We perform 10-fold cross-validation to search for best parameters for γ and D max as well as parameter C for LIBLINEAR on training set for each dataset.We simply fix the D min = 1, and vary D max in the range of 3 to 21, γ in the range of [1e-2 3e-2 0.10 0.14 0.19 0.28 0.39 0.56 0.79 1.0 1.12 1.58 2.23 3.16 4.46 6.30 8.91 10], and C in the range of [1e-5 1e-4 1e-3 1e-2 1e-1 1 1e1 1e2 3e2 5e2 8e2 1e3 3e3 5e3 8e3 1e4 3e4 5e4 8e4 1e5 3e5 5e5 8e5 1e6 1e7 1e8] respectively in all experiments.We collect all document corpora from these public websites: BBCSPORT 3 , TWITTER 4 , RECIPE (h) 20NEWS   [3,12] generally yields a near-peak performance except BBCSPORT.

B.3 More results on Comparisons against distance-based methods
Setup.We preprocess all datasets by removing all words in the SMART stop word list (Buckley et al., 1995).For 20NEWS, we remove the words appear- ing less than 5 times.For LDA, we use the Matlab Topic Modeling Toolbox (Griffiths and Steyvers, 2007) and use sample code that first run 100 burn-in iterations and then run the chain for additional 1000 iterations.For mSDA, we use high-dimensional function mSDAhd where the parameter dd is set as 0.2 times BOW Dimension.For all datasets, a 10fold cross validation on training set is performed to get the optimal K for KNN classifier, where K is searched in the range of [1,21].
Results.Table 5 clearly demonstrates the superior performance of our method WME compared to other KNN-based methods in terms of testing accuracy.Indeed, BOW and TF-IDF performs poorly compared to other methods which may be the result of frequent near-orthogonality of their highdimensional sparse feature representation in KNN classifier.KNN-WMD achieves noticeably better testing accuracy than LSI, LDA and mSDA since WMD takes into account the word alignments and leverages the power of Word2Vec.Remarkably, our proposed method WME achieves much higher accuracy compared to other methods including KNN-WMD on all datasets except one (CLASSIC).The substantially improved accuracy of WME suggests that a truly p.d. kernel implicitly admits expressive feature representation of documents learned from the Word2Vec embedding space in which the alignments between words are considered by using WMD.

B.4 More results on comparisons against
Word2Vec and Doc2Vec-based document representations Setup and results.For PV-DBOW, PV-DM, and Doc2VecC, we set the word and document vector dimension d = 300 to match the pre-trained word embeddings we used for WME and other Word2Vec-based methods in order to make a fair comparison.For other parameters, we use recommended parameters in the papers but we search for the best parameter C in LIBLINEAR for these methods.Additionally, we also train Doc2VecC with different corruption rate in the range of [0.1 0.3 0.5 0.7 0.9].Following (Chen, 2017), these  (Chen, 2017) and (Arora et al., 2017).

Figure 1 :
Figure 1: An illustration of the WMD and WME.All non-stop words are marked as bold face.WMD measures the distance between two documents.WME approximates a kernel derived from WMD with a set of random documents.

4. 1
Effects of R and D on WME Setup.We first perform experiments to investigate the behavior of the WME method by varying the number of Random Features R and the length D of random documents.The hyper-parameter γ is set via cross validation on training set over the range [0.01, 10].We simply fix the D min = 1, and vary D max over the range [3, 21].Due to limited space, we only show selected subsets of our results, with the rest listed in the Appendix B.2.

Figure 2 :Figure 3 :
Figure 2: Train (Blue) and Test (Red) accuracy when varying R with fixed D.
: distributed bag of words model of Para-graph Vectors; 5) PV-DM (Le and Mikolov, 2014): distributed memory model of Paragraph Vectors; 6) Doc2VecC (Chen, 2017): a recently proposed document-embedding via corruptions, achieving state-of-the-art performance in text classification.

Figure 4 :Figure 5 :
Figure 4: Train (Blue) and test (Red) accuracy when varying R with fixed D.

Table 1 :
Properties of the datasets

Table 2 :
Test accuracy, and total training and testing time (in seconds) of WME against KNN-WMD.Speedups are computed between the best numbers of KNN-WMD+P and these of WME(SR)+P when achieving similar testing accuracy.Bold face highlights the best number for each dataset.

Table 3 :
Testing accuracy of WME against Word2Vec and Doc2Vec-based methods.

Table 5 :
Testing accuracy comparing WME against KNN-based methods

Table 6 :
Testing accuracy of WME against Word2Vec and Doc2Vec-based methods.

Table 7 :
Testing accuracy of WME against other document representations on Imdb dataset (50K).Results are collected from