SCDV : Sparse Composite Document Vectors using soft clustering over distributional representations

We present a feature vector formation technique for documents - Sparse Composite Document Vector (SCDV) - which overcomes several shortcomings of the current distributional paragraph vector representations that are widely used for text representation. In SCDV, word embeddings are clustered to capture multiple semantic contexts in which words occur. They are then chained together to form document topic-vectors that can express complex, multi-topic documents. Through extensive experiments on multi-class and multi-label classification tasks, we outperform the previous state-of-the-art method, NTSG. We also show that SCDV embeddings perform well on heterogeneous tasks like Topic Coherence, context-sensitive Learning and Information Retrieval. Moreover, we achieve a significant reduction in training and prediction times compared to other representation methods. SCDV achieves best of both worlds - better performance with lower time and space complexity.


Introduction
Distributed word embeddings represent words as dense, low-dimensional and real-valued vectors that can capture their semantic and syntactic properties. These embeddings are used abundantly by machine learning algorithms in tasks such as text classification and clustering. Traditional bagof-word models that represent words as indices into a vocabulary don't account for word ordering and long-distance semantic relations. Representations based on neural network language models *Represents equal contribution (Mikolov et al., 2013b) can overcome these flaws and further reduce the dimensionality of the vectors. The success of the method is recently mathematically explained using the random walk on discourses model (Arora et al., 2016a). However, there is a need to extend word embeddings to entire paragraphs and documents for tasks such as document and short-text classification.
Representing entire documents in a dense, lowdimensional space is a challenge. A simple weighted average of the word embeddings in a large chunk of text ignores word ordering, while a parse tree based combination of embeddings (Socher et al., 2013) can only extend to sentences. (Le and Mikolov, 2014) trains word and paragraph vectors to predict context but shares wordembeddings across paragraphs. However, words can have different semantic meanings in different contexts. Hence, vectors of two documents that contain the same word in two distinct senses need to account for this distinction for an accurate semantic representation of the documents. (Ling et al., 2015), (Liu et al., 2015a) map word embeddings to a latent topic space to capture different senses in which words occur. However, they represent complex documents in the same space as words, reducing their expressive power. These methods are also computationally intensive.
In this work, we propose the Sparse Composite Document Vector(SCDV) representation learning technique to address these challenges and create efficient, accurate and robust semantic representations of large texts for document classification tasks. SCDV combines syntax and semantics learnt by word embedding models together with a latent topic model that can handle different senses of words, thus enhancing the expressive power of document vectors. The topic space is learnt efficiently using a soft clustering technique over embeddings and the final document vectors are made sparse for reduced time and space complexity in tasks that consume these vectors.
The remaining part of the paper is organized as follows. Section 2 discusses related work in document representations. Section 3 introduces and explains SCDV in detail. This is followed by extensive and rigorous experiments together with analysis in section 4 and 5 respectively.
2 Related Work (Le and Mikolov, 2014) proposed two models for distributional representation of a document, namely, Distributed Memory Model Paragraph Vectors (PV-DM) and Distributed BoWs paragraph vectors (PV-DBoW). In PV-DM, the model is learned to predict the next context word using word and paragraph vectors. In PV-DBoW, the paragraph vector is directly learned to predict randomly sampled context words. In both models, word vectors are shared across paragraphs. While word vectors capture semantics across different paragraphs of the text, documents vectors are learned over context words generated from the same paragraph and potentially capture only local semantics (Singh and Mukerjee, 2015). Moreover, a paragraph vector is embedded in the same space as word vectors though it can contain multiple topics and words with multiple senses. As a result, doc2vec (Le and Mikolov, 2014) doesn't perform well on Information Retrieval as described in (Ai et al., 2016a) and (Roy et al., 2016). Consequently, we expect a paragraph vector to be embedded in a higher dimensional space.
A paragraph vector also assumes all words contribute equally, both quantitatively (weight) and qualitatively (meaning). They ignore the importance and distinctiveness of a word across all documents (Singh and Mukerjee, 2015). Mukerjee et al. (Singh and Mukerjee, 2015) proposed idfweighted averaging of word vectors to form document vectors. This method tries to address the above problem. However, it assumes that all words within a document belong to the same semantic topic. Intuitively, a paragraph often has words originating from several semantically different topics. In fact, Latent Dirichlet Allocation (Blei et al., 2003) models a document as a distribution of multiple topics.
These shortcomings are addressed in three novel composite document representations called Topical word embedding (TWE-1,TWE-2 and TWE-3) by (Liu et al., 2015a). TWE-1 learns word and topic embeddings by considering each topic as a pseudo word and builds the topical word embedding for each word-topic assignment. Here, the interaction between a word and the topic to which it is assigned is not considered. TWE-2 learns a topical word embedding for each word-topic assignment directly, by considering each word-topic pair as a pseudo word. Here, the interaction between a word and its assigned topic is considered but the vocabulary of pseudo-words blows up. For each word and each topic, TWE-3 builds distinct embeddings for the topic and word and concatenates them for each word-topic assignment. Here, the word embeddings are influenced by the corresponding topic embeddings, making words in the same topic less discriminative. (Liu et al., 2015a) proposed an architecture called Neural tensor skip-gram model (NTSG-1, NTSG-2, NTSG-3, NTSG-4), that learns multiprototype word embeddings and uses a tensor layer to model the interaction of words and topics to capture different senses. N T SG outperforms other embedding methods like T W E −1 on the 20 newsgroup data-set by modeling contextsensitive embeddings in addition to topical-word embeddings. LT SG (Law et al., 2017) builds on N T SG by jointly learning the latent topic space and context-sensitive word embeddings. All three, T W E, N T SG and LT SG use LDA and suffer from computational issues like large training time, prediction time and storage space. They also embed document vectors in the same space as terms. Other works that harness topic modeling like W T M (Fu et al., 2016), w2v−LDA (Nguyen et al., 2015), T V + M eanW V , LT SG (Law et al., 2017), Gaussian − LDA (Das et al., 2015), T opic2V ec (Niu et al., 2015), (Moody, 2016) and M vT M  also suffer from similar issues. (Gupta et al., 2016) proposed a method to form a composite document vector using word embeddings and tf-idf values, called the Bag of Words Vector (BoWV). In BoW V , each document is represented by a vector of dimension D = K * d+K, where K is the number of clusters and d is the dimension of the word embeddings. The core idea behind BoW V is that semantically different words belong to different topics and their word vectors should not be averaged. Further, BoW V computes inverse cluster frequency of each clus-ter (icf) by averaging the idf values of its member terms to capture the importance of words in the corpus. However, BoW V does hard clustering using K-means algorithm, assigning each word to only one cluster or semantic topic but a word can belong to multiple topics. For example, the word apple belongs to topic food as a fruit, and belongs to topic Information Technology as an IT company. Moreover, BoW V is a non-sparse, high dimensional continuous vector and suffers from computational problems like large training time, prediction time and storage requirements.

Sparse Composite Document Vectors
In this section, we present the proposed Sparse Composite Document Vector (SCDV) representation as a novel document vector learning algorithm. The feature formation algorithm can be divided into three steps.

Word Vector Clustering
We begin by learning d dimensional word vector representations for every word in the vocabulary V using the skip-gram algorithm with negative sampling (SGNS) (Mikolov et al., 2013a). We then cluster these word embeddings using the Gaussian Mixture Models(GMM) (Reynolds, 2015) soft clustering technique. The number of clusters, K, to be formed is a parameter of the SCDV model. By inducing soft clusters, we ensure that each word belongs to every cluster with some probability P (c k |w i ).

Document Topic-vector Formation
For each word w i , we create K different wordcluster vectors of d dimensions ( wcv ik ) by weighting the word's embedding with its probability distribution in the k th cluster, P (c k |w i ). We then concatenate all K word-cluster vectors ( wcv ik ) into a K×d dimensional embedding and weight it with inverse document frequency of w i to form a word-topics vector ( wtv i ). Finally, for all words appearing in document D n , we sum their wordtopic vectors wtv i to obtain the document vector dv Dn .
Cluster word vectors wv using GMM clustering into K clusters; 4 Obtain soft assignment P (c k |w i ) for word w i and cluster c k ; / * Loop 5-10 can be pre-computed * / 5 for each word w i in vocabulary V do 6 for each cluster c k do is concatenation

Sparse Document Vectors
After normalizing the vector, we observed that most values in dv Dn are very close to zero. Figure 3 verifies this observation. We utilize this fact to make the document vector dv Dn sparse by zeroing attribute values whose absolute value is close to a threshold (specified as a parameter), which results in the Sparse Composite Document Vector SCDV Dn .
In particular, let p be percentage sparsity threshold parameter, a i the value of the i th attribute of the non-Sparse Composite Document Vector and n represent the n th document in the training set: Flowcharts depicting the formation of wordtopics vector and Sparse Composite Document Vectors are shown in figure 1 and figure 2 respectively. Algorithm 1 describes SCDV in detail.

Experiments
We perform multiple experiments to show the effectiveness of SCDV representations for multiclass and multi-label text classification. For all experiments and baselines, we use Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz, 40 working cores, 128GB RAM machine with Linux Ubuntu 14.4. However, we utilize multiple cores only during Word2Vec training and when we run the one-vsrest classifier for Reuters.
We use the best parameter settings as reported in all our baselines to generate their results. We use

Text Classification
We run multi-class experiments on 20NewsGroup dataset 2 and multi-label classification experiments on Reuters-21578 dataset 3 . We use the script 4 for preprocessing the Reuters-21578 dataset. We use LinearSVM for multi-class classi-fication and Logistic regression with OneVsRest setting for multi-label classification in baselines and SCDV.
For SCDV, we set the dimension of wordembeddings to 200 and the number of mixture components in GMM to 60. All mixture components share the same spherical co-variance matrix. We learn word vector embedding using Skip-Gram with window size of 10, Negative Sampling (SGNS) of 10 and minimum word frequency as 20. We use 5-fold cross-validation on F1 score to tune parameter C of SVM and the sparsity threshold for SCDV.

Multi-class classification
We evaluate classifier performance using standard metrics like accuracy, macro-averaging precision, recall and F-measure. Table 1 shows a comparison with the current state-of-art (NTSG) document representations on the 20Newsgroup dataset. We observe that SCDV outperforms all other current models by fair margins. We also present the classwise precision and recall for 20Newsgroup on an almost balanced dataset with SVM over Bag of Words model and the SCDV embeddings in Table  2 and observe that SCDV improves consistently over all classes.

Multi-label classification
We evaluate multi-label classification performance using Precision@K, nDCG@k (Bhatia et al., 2015), Coverage error, Label ranking average precision score (LRAPS) 5 and F1-score. All measures are extensively used for the multilabel classification task. However, F1-score is an appropriate metric for multi-label classification as it considers label biases when train-test splits are random. Table 3 show evaluation results for multi-label text classification on the Reuters-21578 dataset.

Effect of Hyper-Parameters
SCDV has three parameters: the number of clusters, word vector dimension and sparsity threshold parameter. We vary one parameter by keeping the other two constant. Performance on varying all three parameters in shown in Figure 4. We observe that performance improves as we increase the number of clusters and saturates at 60. The performance improves until a word vector dimension of 300 after which it saturates. Similarly, we observe that the performance improves as we increase p till 4 after which it declines. At 4% thresholding, we reduce the storage space by 80% compared to the dense vectors. We observe that SCDV is robust to variations in training Word2Vec

Topic Coherence
We evaluate the topics generated by GMM clustering on 20NewsGroup for quantitative and qualitative analysis. Instead of using perplexity (Chang et al., 2011), which doesn't correlate with semantic coherence and human judgment of individual topics, we used the popular topic coherence (Mimno et al., 2011), (Arora et al., 2013), (Chen and Liu, 2014) measure. A higher topic coherence score indicates a more coherent topic. We used Bayes rule to compute the P (w k |c i ) for a given topic c i and given word w j and compute the score of the top 10 words for each topic.
Here, #(w k ) denotes the number of times word w k appears in the corpus and V represents vocabulary size.
We calculated the topic coherence score for all topics for SCDV , LDA and LT SG (Law et al., 2017). Averaging the score of all 80 topics, GMM clustering scores -85.23 compared to -108.72 of LDA and -92.23 of LTSG. Thus, SCDV creates more coherent topics than both LDA and LTSG. Table 4 shows top 10 words of 3 topics from GM M clustering, LDA model and LT SG model on 20NewsGroup and SCDV shows higher topic coherence. Words are ranked based on their probability distribution in each topic. Our results also support the qualitative results of (Randhawa et al., 2016), (Sridhar, 2015) paper, where kmeans, GMM was used respectively over word vectors to find topics.

Context-Sensitive Learning
In order to demonstrate the effects of soft clustering (GMM) during SCDV formation, we select some words (w j ) with multiple senses from 20Newsgroup and their soft cluster assignments to find the dominant clusters. We also select top scoring words (w k ) from each cluster (c i ) to represent the meaning of that cluster. Table 5 shows polysemic words and their dominant clusters with assignment probabilities. This indicates that using soft clustering to learn word vectors helps combine multiple senses into a single embedding vector. (Arora et al., 2016b) also reported similar results for polysemous words. (Ai et al., 2016b) used (Mikolov et al., 2013b)'s paragraph vectors to enhance the basic language model based retrieval model. The language model(LM) probabilities are estimated from the corpus and smoothed using a Dirichlet prior (Zhai and Lafferty, 2004). In (Ai et al., 2016b), this language model is then interpolated with the paragraph vector (PV) language model as follows. P (w|d) = (1 − λ)P LM (w|d) + λP P V (w|d)

Information Retrieval
and the score for document d and query string Q is given by where P (w) is obtained from the unigram query model and score(q, d) is used to rank documents. (Ai et al., 2016b) do not directly make use of paragraph vectors for the retrieval task, but improve the document language model. To directly make use of paragraph vectors and make computations more tractable, we directly interpolate the language model query-document score score(q, d) with the similarity score between the normalized query and document vectors to generate score P V (q, d), which is then used to rank documents.
score P V (q, d) = (1 − λ)score(q, d) + λ q. d Directly evaluating the document similarity score with the query paragraph vector rather than collecting similarity scores for individual words in the query helps avoid confusion amongst distinct query topics and makes the interpolation operation faster. In Table 6 We observe consistent improvement in MAP for all datasets. We marginally improve the MAP reported by (Ai et al., 2016b) on the Robust04 task.
In addition, we also report the improvements in MAP score when Model based relevance feedback (Zhai and Lafferty, 2001) is applied over the initially retrieved results from both models. Again, we notice a consistent improvement in MAP.

Analysis and Discussion
SCDV overcomes several challenges encountered while training document vectors, which we had mentioned above.  1. Clustering word-embeddings to discover topics improves performance of classification as Figure 4 (left) indicates, while also generating coherent clusters of words (Table 4). Figure 5 shows that clustering gives more discriminative representations of documents than paragraph vectors do since it uses K × d dimensions while paragraph vectors embed documents and words in the same space. This enables SCDV to represent complex documents. Fuzzy clustering allows words to belong to multiple topics, thereby recognizing polysemic words, as Table 5 indicates.
Thus it mimics the word-context interaction in NTSG and LTSG.
2. Semantically different words are assigned to different topics. Moreover, a single document can contain words from multiple different topics. Instead of a weighted averaging of word embeddings to form document vectors, as in most previous work, concatenating word embeddings for each topic (cluster) avoids merging of semantically different topics.
3. It is well-known that in higher dimensions, structural regularizers such as sparsity help overcome the curse of dimensionality (Wainwright, 2014). Figure 3 demonstrates this, since majority of the features are close to zero. Sparsity also enables linear SVM to scale to large dimensions. On 20News-Groups, BoWV model takes up 1.1 GB while SCDV takes up only 236MB( 80% decrease).
Since GMM assigns a non-zero probability to every topic in the word embedding, noise can accumulate when document vectors are created and tip the scales in favor of an unrelated topic. Sparsity helps to reduce this by zeroing out very small values of probability.
4. SCDV uses Gaussian Mixture Model (GMM) while T W E, N T SG and LT SG use LDA for finding semantic topics respectively. GMM time complexity is O(V N T 2 ) while that of LDA is O(V 2 N T ). Here, V = Vocabulary size, N = number of documents and T = number of topics. Since number of topics T < vocabulary size V, GMM is faster. Empirically, compared to T W E, SCDV reduces document vector formation, training and prediction time significantly. Table 7 shows training and prediction times for BoWV, SCDV and TWE models.

Conclusion
In this paper, we propose a document feature formation technique for topic-based document representation. SCDV outperforms state-of-the-art models in multi-class and multi-label classification tasks. SCDV introduces sparsity in document vectors to handle high dimensionality. Table 7 in-  dicates that SCDV shows considerable improvements in feature formation, training and prediction times for the 20NewsGroups dataset. We show that fuzzy GMM clustering on word-vectors lead to more coherent topic than LDA and can also be used to detect Polysemic words. SCDV embeddings also provide a robust estimation of the query and document language models, thus improving the MAP of language model based retrieval systems. In conclusion, SCDV is simple, efficient and creates a more accurate semantic representation of documents.
Chengxiang Zhai and John Lafferty. 2001. Modelbased feedback in the language modeling approach to information retrieval. In Proceedings of the tenth international conference on Information and knowledge management, pages 403-410. ACM.