A Document Descriptor using Covariance of Word Vectors

In this paper, we address the problem of finding a novel document descriptor based on the covariance matrix of the word vectors of a document. Our descriptor has a fixed length, which makes it easy to use in many supervised and unsupervised applications. We tested our novel descriptor in different tasks including supervised and unsupervised settings. Our evaluation shows that our document covariance descriptor fits different tasks with competitive performance against state-of-the-art methods.


Introduction
Retrieving documents that are similar to a query using vectors has a long history. Earlier methods modeled documents and queries using vector space models via bag-of-words (BOW) representation (Salton and Buckley, 1988). Other representations include latent semantic indexing (LSI) (Deerwester et al., 1990), which can be used to define dense vector representation for documents and/or queries. The past few years have witnessed a big interest in distributed representation for words, sentences, paragraphs and documents. This was achieved by leveraging deep learning methods that learn word vector representation. Introduction of neural language models (Bengio et al., 2003) using deep learning allowed to learn word vector representation (word embedding for simplicity). The seminal work of Mikolov et al. introduced an efficient way to compute dense vectorized representation of words (Mikolov et al., 2013a,b). A more recent step was taken to move beyond distributed representation of words. This is to find a distributed representation for sentences, paragraphs and documents. Most of the presented works study the interrelationship between words in a text snip- pet (Hill et al., 2016;Kiros et al., 2015;Le and Mikolov, 2014) in an unsupervised fashion. Other methods build a task specific representation (Kim, 2014;Collobert et al., 2011).
In this paper we propose to use the covariance matrix of the word vectors in some document to define a novel descriptor for a document. We call our representation DoCoV descriptor. Our descriptor obtains a fixed-length representation of the paragraph which captures the interrelationship between the dimensions of the word embedding via the covariance matrix elements. This makes our work distinguished from to the work of (Le and Mikolov, 2014;Hill et al., 2016;Kiros et al., 2015) where they study the interrelationship of words in the text snippet.

Toy Example
We show a toy example to highlight the differences between DoCoV vector, the Mean vector and paragraph vector (Le and Mikolov, 2014). First, we used Gensim library 1 to generate word vectors and paragraph vectors using a dummy training corpus. Next, we formed two hypothetical documents; first document contains words about "pets" and second document contains words about "travel". In figure 1 we show on the top part the first two dimensions of a word embedding for each document separately. On the bottom Left, we show embedding of the two documents' words in the same space. We also show the Mean vectors and the paragraph vectors. In the word embedding space the covariance matrices are represented via the confidence ellipses. On the bottom right we show the corresponding covariance matrices as points in a new space after vectorization step.

Motivation and Contributions
Below we describe our motivation towards the proposal of our novel representation: (1) Some neural-based paragraph representations such as paragraph vectors (Le and Mikolov, 2014) , FastSent (Hill et al., 2016) use a shared space between the words and paragraphs. This is counter intuitive, as the paragraph is a different entity other than the words. Figure 1 illustrates that point, we do not see a clear interpretation of why the paragraph vectors (Le and Mikolov, 2014) are positioned in the space as in figure 1.
(2) The covariance matrix represents the second order summary statistic of multivariate data. This distinguishes the covariance matrix from the mean vector. In figure 1 we visualize the covariance matrix using confidence ellipse representation.We see that the covariance encodes the shape of the density composed of the words of interest. In the earlier example the Mean vectors of two dissimilar documents are put close by the word embedding. On the other hand, the covariance matrices capture the distinctness of the two documents.
(3) The use of the covariance as a spatial descriptor for multivariate data has a great success in different domains like computer vision (Tuzel et al., 2006;Hussein et al., 2013;Sharaf et al., 2015) and brain signal analysis (Barachant et al., 2013). With this global success of this representation, we believe this method can be useful for text-related tasks.
(4) The computation of the covariance descriptor is known to be fast and highly parallelizable. Moreover, there is no inference steps involved while computing the covariance matrix given its observations. This is an advantage compared to existing methods for generating paragraph vectors, such as (Le and Mikolov, 2014;Hill et al., 2016).
Our contribution in this work is two-fold: (1) We propose the Document-Covariance descriptor (DoCoV) to represent every document as the covariance of the word embedding of its words. To the best of our knowledge, we are the first to explicitly compute covariance descriptors on word embedding such as word2vec (Mikolov et al., 2013b) or similar word vectors.
(2) We empirically show the effectiveness of our novel descriptor in comparison to the state-of-theart methods in various unsupervised and supervised classification tasks. Our results show that our descriptor can attain comparable accuracy to state-ofthe-art methods in a diverse set of tasks.

Related Work
We can see the word embedding at the core of recent state-of-art methods for solving many tasks like semantic textual similarity, sentiment analysis and more. Among the approaches of finding word embedding are (Pennington et al., 2014;Levy and Goldberg, 2014;Mikolov et al., 2013b). These alternatives share the same objective of finding a fixed-length vectorized representation for words to capture the semantic and syntactic regularities between words.
These efforts paved the way for many researchers to judge document similarity based on word embedding. Some efforts aimed at finding a global representation of a text snippet using a paragraph-level representation such as paragraph vectors (Le and Mikolov, 2014). Recently other neural-based sentence and paragraph level representations appeared to provide a fixed length representation like Skip-Thought Vectors (Kiros et al., 2015) and FastSent (Hill et al., 2016). Some efforts focused on defining a Word Mover Distance(WMD) based on word level representation (Kusner et al., 2015).
Prior to this work, we proposed earlier trials for using covariance features in community question answering (Malhas et al., 2016b,a;Torki et al., 2017). In these trials we used the covariance features in combination with lexical and semantic features. Close to our work is (Nikolentzos et al., 2017), they build an implicit representation of documents using multidimensional Gaussian distribution. Then they compute a similarity kernel to be used in document classification task. Our work is distinguished from (Nikolentzos et al., 2017) as we compute an explicit descriptor for any document. Moreover, we use linear models which scale much better than non-linear kernels as introduced in (Nikolentzos et al., 2017).

Document Covariance Descriptor
We present our DoCoV descriptor. First, we define a document observation matrix. Second, we show how to extract our DoCoV descriptor. Document Observation Matrix Given a d-dimensional word embedding model and an n-terms document. We can define a document observation matrix O ∈ R n×d . In the matrix O, a row represents a term in the document and columns represent the d-dimensional word embedding representation for that term.
Assume that we have observed n terms of a ddimensional random variable; we have a data matrix O(n × d) : The rows The "sample mean vector" of the n observations ∈ R d is given by the vectorx of the meansx j of the d variables: From hereafter, when we mention the Mean vector we mean the sample Mean Vectorx.

Document-Covariance Descriptor (DoCoV)
Given an observation matrix O for a document, we compute the covariance matrix entriesfor every pair of dimensions (X, Y ).
The matrix C ∈ R d×d is a symmetric matrix and is defined as We Compute a vectorized representation of the matrix C as the stacking of the upper triangular part of matrix C as in eq. 5. This process produces a vector v ∈ R d(d+1)/2 . The Euclidean distance between vectorized matrices is equivalent to the Frobenius norm of the original covariance matrices.

Experimental Evaluation
We show an extensive comparative evaluation for unsupervised paragraph representation approaches. We test the unsupervised semantic textual similarity task. Next, we show a comparative evaluation for text classification benchmarks.

Effect of Word Embedding Source and Dimensionality on Classification Results
We evaluate classification performance over the IMDB movie reviews dataset using error rate as the evaluation measure. We report our results using a linear SVM classifier.We chose the default parameters for Linear SVM classifier in scikit-learn library 2 . The IMDB movie review dataset was first proposed by Maas et al. (Maas et al., 2011) as a benchmark for sentiment analysis. The dataset consists of 100K IMDB movie reviews and each review has several sentences. The 100K reviews are divided into three datasets: 25K labelled training instances, 25K labelled test instances and 50K unlabelled training instances. Each review has one label representing the sentiment of it: Positive or Negative. These labels are balanced in both the training and the test set.
The objective is to show that theDoCoV descriptor can be used with different alternatives for word representations. Also, the experiment shows that pre-trained models are giving the best results, namely the word2vec model built on Google news. This alleviates the need of computing a problem specific word embedding. In some cases there is no available data to construct the word embedding. To illustrate that we tried different alternatives for word representation.
(1) We computed our own skipgram models using Gensim Library. We used the Training and unlabelled subsets of IMDB dataset to obtain different embedding by setting number of dimensions to 100, 200 and 300.
(2) We used pre-trained GloVe models trained on wikipedia2014 and Gigaword5. We tested the available different dimensionality 100, 200 and 300. We also used the 300 dimensions GloVe model that used commoncrawl with 42 Billion tokens We call the last one Lrg. This model provides word vectors of 300 dimensions for each word.
(3) We used pre-trained word2vec model trained on Google news. We call it Gnews. This model provides word vectors of 300 dimensions for each word. Table 1 shows the results when using DoCoV computed at different dimensions of word embedding in classification. The table also compares classification performance when using DoCoV to the performance when using the Mean of word embedding as a baseline. Also, we show the effect of fusing DoCoV with other feature sets. We mainly experiment with the following sets: DoCoV, Mean, and bag-of-words (BOW). We use the mean and DoCoV features.

Observations
From the results we can observe the following (1) We observe that the DoCoV is consistently outperforming the Mean vector for different dimensionality of the word embedding regardless of the embedding source.
(3) The best performing feature concatenation is DoCoV+BOW. This ensures that the concatenation in fact is benefiting from both representations.
(3) In general the best results are achieved using the available 300-dimensions Gnews word embedding. In the subsequent experiments we will use that embedding such that we do not need to build a different word embedding for every task on hand.

Unsupervised Semantic Textual Similarity
We conduct a comparative evaluation against the state-of-the-art approaches in unsupervised paragraph representation. We follow the setup used in (Hill et al., 2016).

Datasets and Baselines
We contrast our results against the methods reported in (Hill et al., 2016). The competing meth-ods are the paragraph vectors (Le and Mikolov, 2014), skip-thought vectors (Kiros et al., 2015), Fastsent (Hill et al., 2016), Sequential (Denoising) Autoencoders (SDAE) (Hill et al., 2016). The Mean vector baseline is also implemented. Also, we use the sum of the similarities generated by the DoCoV and the mean vectors. All of our results are reported using the freely available Gnews word2vec of dim = 300. We use same evaluation measures (Hill et al., 2016). We use the Pearson correlation and Spearman correlation with the manual relatedness judgements.
The semantic sentence relatedness datasets used in the comparative evaluation the SICK dataset (Marelli et al., 2014) consists of 10,000 pairs of sentences and relatedness judgements and the STS 2014 dataset (Agirre et al., 2014) consists of 3,750 pairs and ratings from six linguistic domains.

Results and Discussion
We show the correlation values between the similarities computed via DoCoV and the human judgements. We contrast the performance of other representations in table 2.
We observe that DoCoV representation outperforms other representations in this task. Other models such as skipthought vectors (Kiros et al., 2015) and SDAE (Hill et al., 2016) requires building an encoder-decoder model which takes time 3 to learn. For other models like paragraph vectors (Le and Mikolov, 2014) and Fastsent vectors (Hill et al., 2016), they require a gradient descent inference step to compute the paragraph/sentence vectors. Using the DoCoV, we just require a pre-trained word embedding model and we do not need any additional training like encoder-decoder models or inference steps via gradient descent.

Text Classification Benchmarks
The datasets used in this experiment form a textclassification benchmark for sentence and paragraph classification. Towards the end of this section we can clearly identify the value of the DoCoV descriptor as a generic descriptor for text classification tasks.

Datasets and Baselines
We contrast our results against the same methods of unsupervised paragraph representations. In addition to the results of DoCoV we examined concatenation of BoW with tf-idf weighting and Mean vectors with our DoCoV descriptors. We use linear   SVM for all the tasks. All of our results are reported using the freely available Gnews word2vec of dim = 300. We use classification accuracy as the evaluation measure for this experiment as (Hill et al., 2016). The subsets used in comparative benchmark evaluation are: Movie Reviews MR (Pang and Lee, 2005), Subjectivity Subj (Pang and Lee, 2004),Customer Reviews CR (Hu and Liu, 2004) and TREC Question TREC (Li and Roth, 2002). Table 3 shows the results of our variants against state-of-art algorithms that use unsupervised paragraph representation.

Results and Discussion
We observe that DoCoV is consistently better than the Mean vector and BOW with tf-idf weights. Also, DoCoV is improving consistently when concatenated with baselines such as Mean vector and BOW vectors. This means each feature is capturing different discriminating information. This justifies the choice of concatenating DoCoV with other features. We further observe that DoCoV is consistently better than the paragraph vectors (Le and Mikolov, 2014), Fastsent and SDAE (Hill et al., 2016). The overall accuracy of DoCoV is highlighted and it outperforms other methods on the text classification benchmark.

Conclusion
We presented a novel descriptor to represent text on any level such as sentences, paragraphs or documents. Our representation is generic which makes it useful for different supervised and unsupervised tasks. It has fixed-length property which makes it useful for different learning algorithms. Also, our descriptor requires minimal training. We do not require a encoder-decoder model or a gradient descent iterations to be computed.
Empirically we showed the effectiveness of the descriptor in different tasks. We showed better performance against other state-of-the-art methods in supervised and unsupervised settings.