Multivariate Gaussian Document Representation from Word Embeddings for Text Categorization

Recently, there has been a lot of activity in learning distributed representations of words in vector spaces. Although there are models capable of learning high-quality distributed representations of words, how to generate vector representations of the same quality for phrases or documents still remains a challenge. In this paper, we propose to model each document as a multivariate Gaussian distribution based on the distributed representations of its words. We then measure the similarity between two documents based on the similarity of their distributions. Experiments on eight standard text categorization datasets demonstrate the effectiveness of the proposed approach in comparison with state-of-the-art methods.


Introduction
During the past decade, there has been a significant increase in the availability of textual information mainly due to the exploding popularity of the World Wide Web. This tremendous amount of textual information growth has established the need for the development of effective text-mining approaches.
Traditionally, documents are represented as bag-of-words (BOW) vectors. The BOW representation is very simple and it has proven effective in easy and moderate tasks, however, for more demanding tasks, such as short text modeling, its performance drops significantly.
In order to overcome the weakness of BOW, researchers proposed methods that try to learn a latent low-dimensional representation of documents. Latent Semantic Analysis (Deerwester et al., 1990) and Latent Dirichlet Allocation (Blei et al., 2003) are the main employed methods for this task. However, these methods do not systematically yield improved performance compared to the BOW representation.
Recently, there has been a growing interest in methods for learning distributed representations of words (Bengio et al., 2003;Collobert et al., 2011;Mikolov et al., 2013;Mnih and Kavukcuoglu, 2013;Pennington et al., 2014;Lebret and Collobert, 2014). In the embedding space, semantically similar words are likely to be close to each other. Moreover, simple linear operations on word vectors can produce meaningful results. For example, the closest vector to "Vietnam" + "capital" is found to be "Hanoi" (Mikolov et al., 2013).
Several recent works make use of distributed representations of phrases to tackle various NLP problems (Bahdanau et al., 2015;. There is therefore a clear need for methods that generate meaningful phrase or document representations based on the representations of their words. The most straightforward approach generates phrase or document representations by simply summing the vector representations of the words appearing in the phrase or document.
In this paper, we propose to model documents as multivariate Gaussian distributions. The mean of each distribution is the average of the vector representations of its words and its covariance matrix measures the variation of the dimensions from the mean with respect to each other. Empirical evaluation proves the superiority of the proposed representation over the standard BOW representation and other baseline approaches in a host of different datasets.
The rest of this paper is organized as follows. Section 2 provides an overview of the related work. Section 3 provides a description of the proposed approach. Section 4 evaluates the proposed representation. Finally, Section 5 concludes.
2 Related Work Mitchell and Lapata (2008) proposed a general framework for generating representations of phrases or sentences. They computed vector representations of short phrases as a mixture of the original word vectors, using several different element-wise vector operations. Later, their work was extended to take into account syntactic structure and grammars (Erk and Padó, 2008;Baroni and Zamparelli, 2010;Coecke et al., 2010).  proposed to learn representations for documents by averaging their word representations. Their model learns word representations suitable for summation. Le and Mikolov (2014) presented an algorithm to learn vector representations for paragraphs by inserting an additional memory vector in the input layer. Song and Roth (2015) presented three mechanisms for generating dense representations of short documents by combining Wikipedia-based explicit semantic analysis representations with distributed word representations.
Neural networks with convolutional and pooling layers have also been widely used for generating representations of phrases or documents. These networks allow the model to learn which sequences of words are good indicators of each topic, and then, combine them to produce vector representations for documents. These architectures have been proved effective in many NLP tasks, such as document classifcation (Johnson and Zhang, 2015), short-text categorization (Wang et al., 2015), sentiment classification (Kalchbrenner et al., 2014;Kim, 2014) and paraphrase detection (Yin and Schütze, 2015).

Gaussian Document Representation from Word Embeddings
Let D = {d 1 , d 2 , . . . , d m } be a set of m documents. The documents are pre-processed (tokenization, punctuation and special character removal) and the vocabulary of the corpus V is extracted. To obtain a distributed representation for each word w ∈ V, we employed the word2vec model (Mikolov et al., 2013). Specifically, for our experiments, we used a publicly available model 1 M consisting of 300-dimensional vectors trained on a Google News dataset of about 100 billion words. Words contained in the vocabulary w ∈ V, but not contained in the model w ∈ M were initialized to random vectors.
To generate a representation for each document, we assume that its words were generated by a multivariate Gaussian distribution. Specifically, we regard the embeddings of all words w present in a document as i.i.d. samples drawn from a multivariate Gaussian distribution: where w is the distributed representation of a word w, µ is the mean vector of the distribution and Σ its covariance matrix. We set µ and Σ to their Maximum Likelihood estimates, given by the sample mean and the empirical covariance matrix respectively. More specifically, the sample mean of a document corresponds to the centroid of its words, i. e. we add the vectors of the words present in the text and normalize the sum by the total number of words. For an input sequence of words d, its mean vector µ is given by: where |d| is the cardinality of d, i. e. its number of words. The empirical covariance matrix is then defined as: Hence, each document is represented as a multivariate Gaussian distribution and the problem transforms from classifying textual documents to classifying distributions. To measure the similarity between pairs of documents, we compare their Gaussian representations. There are several well-known definitions of similarity or distance between distributions. Some examples include the Kullback-Leibler divergence, the Fisher kernel, the χ 2 distance and the Bhattacharyya kernel. However, most of these measures are very time consuming. In our setting where µ and Σ are very high-dimensional (if n is the dimensionality of the distributed representations, then µ ∈ R n and Σ ∈ R n×n ), the complexity of these measures is prohibitive, even for small document collections.
We proceed by defining a more efficient function for measuring the similarity between two distributions. More specifically, the similarity between two documents d 1 and d 2 is set equal to the convex combination of the similarities of their mean vectors µ 1 and µ 2 and their covariance matrices Σ 1 and Σ 2 . The similarity between the mean vectors µ 1 and µ 2 is calculated using cosine similarity: where · is the Euclidean norm for vectors. The similarity between the covariance matrices Σ 1 and Σ 2 can be computed using the following formula: where (· • ·) is the Hadamard or element-wise product between matrices (we sum over all its elements) and · F is the Frobenius norm for matrices. Hence, the similarity between two documents is equal to: where α ∈ [0, 1]. It is trivial to show that the above similarity measure is also a valid kernel function.

Experiments
We evaluate the proposed approach as well as the baselines in the context of text categorization on eight standard datasets.

Baselines
We next present the baselines against which we compared our approach: 1) BOW (binary) Documents are represented as bag-of-words vectors. If a word is present in the document its entry in the vector is 1, otherwise 0. To perform text categorization, we employed a linear SVM classifier.
2) NBSVM It combines a Naive Bayes classifier with an SVM and achieves remarkable results on several tasks (Wang and Manning, 2012). We used a combination of both unigrams and bigrams as features.  3) Centroid Documents are projected in the word embedding space as the centroids of their words. This representation corresponds to the mean vector µ of the Gaussian representation presented in Section 3. Similarity between documnets is computed using cosine similarity (Equation 4). 4) WMD Distances between documents are computed using the Word Mover's Distance (Kusner et al., 2015). To compute the distances, we used pre-trained vectors from word2vec. A k-nn algorithm is then employed to classify the documents based on the distances between them. As in (Kusner et al., 2015), we used values of k ranging from 1 to 19. 5) CNN A convolutional neural network architecture that has recently showed state-of-theart results on sentence classification (Kim, 2014). We used a model with pre-trained vectors from word2vec where all word vectors are kept static during training. As regards the hyperparameters, we used the same settings as in (Kim, 2014): rectified linear units, filter windows of 3, 4, 5 with 100 feature maps each, dropout rate of 0.5, l 2 constraint of 3, mini-batch size of 50, and 25 epochs.

Datasets
In our experiments, we used several standard datasets: (1) Reuters: contains stories collected from the Reuters news agency. (2) Amazon: product reviews acquired from Amazon over four different sub-collections (Blitzer et al., 2007). (3) TREC: a set of questions classified into 6 different types (Li and Roth, 2002). (4) Snippets: consists of snippets that were collected from the results of Web search transactions (Phan et al., 2008). (5) BBCSport: consists of sports news articles from the BBC Sport website (Greene and Cunningham, 2006). (6) Polarity: consists of positive and negative snippets acquired from Rotten Tomatoes (Pang and Lee, 2005 Table 2: Performance (accuracy and macro-average F1-score) in text categorization on the 8 datasets.
Subjectivity: contains subjective sentences gathered from Rotten Tomatoes and objective sentences gathered from the Internet Movie Database (Pang and Lee, 2004). (8) Twitter: contains a set of tweets, each labeled with its sentiment (Sanders, 2011). Table 1 shows statistics of the 8 datasets.

Text Categorization
To perform text categorization, we employed an SVM classifier (Boser et al., 1992). Since the proposed similarity function (Equation 6) is a kernel, we directly built the kernel matrices 2 . We tuned parameter α of the proposed approach using crossvalidation on the training set of TREC and used the same value on all datasets (α = 0.5).
To assess the effectiveness of the different approaches, we employed two well-known evaluation metrics: accuracy and macro-average F1score. Table 2 shows the performance of the considered approaches on the eight text categorization datasets. On all datasets except three (Snippets, Polarity, Subjectivity), the proposed approach outperforms the other methods. Furthermore, on two of the remaining three datasets (Snippets, Subjectivity), it achieves performance comparable to the best-performing methods. WMD is the worstperforming method on most datasets. This may be due to the k-nn algorithm that is employed to classify the documents. NBSVM achieves impressive results on all datasets, considering that it does not utilize word embeddings. It is also important to note that the approaches that use word embeddings (Centroid, WMD, CNN, Gaussian) achieve an immense increase in performance on the Snippets dataset. One possible explanation is that these snippets belong to domains that are highly related to these of the articles on which the word2vec model was trained. Overall, our results demonstrate the effectiveness of the proposed method and the benefit of using word embeddings for measuring the similarity between pairs of documents. As regards the proposed method, we also computed the sensitivity of the classification to the value of parameter α. Specifically, Figure 1 shows how the classification accuracy changes with respect to parameter α on the TREC dataset. As you can see, the highest accuracy is achieved for val-ues of α close to 0.5. Furthermore, when dropping the second term of Equation 6 (α = 1), the method is equivalent to the Centroid baseline and the performance drops significantly.

Conclusion
We proposed an approach that models each document as a Gaussian distribution based on the embeddings of its words. We then defined a function that measures the similarity between two documents based on the similarity of their distributions. Empirical evaluation demonstrated the effectiveness of the approach across a range of datasets. We attribute this performance gain of the proposed approach to the high quality of the embeddings and its ability to effectively utilize these embeddings.