Does the Geometry of Word Embeddings Help Document Classification? A Case Study on Persistent Homology-Based Representations

We investigate the pertinence of methods from algebraic topology for text data analysis. These methods enable the development of mathematically-principled isometric-invariant mappings from a set of vectors to a document embedding, which is stable with respect to the geometry of the document in the selected metric space. In this work, we evaluate the utility of these topology-based document representations in traditional NLP tasks, specifically document clustering and sentiment classification. We find that the embeddings do not benefit text analysis. In fact, performance is worse than simple techniques like tf-idf, indicating that the geometry of the document does not provide enough variability for classification on the basis of topic or sentiment in the chosen datasets.


Introduction
Given a embedding model mapping words to n dimensional vectors, every document can be represented as a finite subset of R n . Comparing documents then amounts to comparing such subsets. While previous work shows that the Earth Mover's Distance (Kusner et al., 2015) or distance between the weighted average of word vectors (Arora et al., 2017) provides information that is useful for classification tasks, we wish to go a step further and investigate whether useful information can also be found in the 'shape' of a document in word embedding space.
Persistent homology is a tool from algebraic topology used to compute topological signatures (called persistence diagrams) on compact metric * *The indicated authors contributed equally to this work. spaces. These have the property of being stable with respect to the Gromov-Haussdorff distance (Gromov et al., 1981). In other words, compact metric spaces that are close, up to an isometry, will have similar embeddings. In this work, we examine the utility of such embeddings in text classification tasks. To the best of our knowledge, no previous work has been performed on using topological representations for traditional NLP tasks, nor has any comparison been made with state-ofthe-art approaches.
We begin by considering a document as the set of its word vectors, generated with a pretrained word embedding model. These form the metric space on which we build persistence diagrams, using Euclidean distance as the distance measure. The diagrams are a representation of the document's geometry in the metric space. We then perform clustering on the Twenty Newsgroups dataset with the features extracted from the persistence diagram. We also evaluate the method on sentiment classification tasks, using the Cornell Sentence Polarity (CSP) (Pang and Lee, 2005) and IMDb movie review datasets (Maas et al., 2011).
As suggested by Zhu (2013), we posit that the information about the intrinsic geometry of documents, found in the persistence diagrams, might yield information that our classifier can leverage, either on its own or in combination with other representations. The primary objective of our work is to empirically evaluate these representations in the case of sentiment and topic classification, and assess their usefulness for real-world tasks.

Word embeddings
As a first step we compute word vectors for each document in our corpus using a word2vec (Mikolov et al., 2013) model trained on the Google News dataset 1 . In addition to being a widely used word embedding technique, word2vec has been known to exhibit interesting linear properties with respect to analogies (Mikolov et al., 2013), which hints at rich semantic structure.

Gromov-Haussdorff Distance
Given a dictionary of word vectors of dimension n, we can represent any document as a finite subset of R n . The Haussdorff distance gives us a way to evaluate the distance between two such sets. More precisely, the Haussdorff distance d H between two finite subsets A, B of R n is defined as: However, this distance is sensitive to translations and other isometric 2 transformations. Hence, a more natural metric is the Gromov-Haussdorff distance (Gromov et al., 1981), simply defined as where E n is the set of all isometries of R n . Figure 1 provides an example of practical Gromov-Haussdorff (GH) distance computation between two sets of three points each. Both sets are embedded in R 2 (middle panel) using isometries i.e the distance between points in each set is conserved. The Haussdorff distance between the two embedded sets corresponds to the length of the black segment. The GH distance is the minimum Haussdorff distance under all possible isometric embeddings.
We want to compare documents based on their intrinsic geometric properties. Intuitively, the GH distance measures how far two sets are from being isometric. This allows us to define the geometry of a document more precisely: Definition 1 (Document Geometry) We say that two documents A, B have the same geometry if d GH (A, B) = 0, ie if they are the same up to an isometry.
Mathematically speaking, this amounts to defining the geometry of a document as its equivalence class under the equivalence relation induced by the GH distance on the set of all documents. Figure 1: Gromov-Haussdorff distance between two sets (red, green). The black bar represents the actual distance (given that the isometric embedding is optimal).
Comparison to the Earth Mover Distance : Kusner et al. (2015) proposed a new method for computing a distance between documents based on an instance of the Earth Mover Distance (Rubner et al., 1998) called Word Mover Distance (WMD). While WMD quantifies the total cost of matching all words of one document to another, the GH distance is the cost, up to an isometry, of the worst-case matching.

Persistence diagrams
Efficiently computing the GH distance is still an open problem despite a lot of recent work in this area (Mémoli and Sapiro, 2005;Bronstein et al., 2006;Mémoli, 2007;Agarwal et al., 2015).
Fortunately, Carrière et al. (2015) provides us with a way to derive a signature which is stable with respect to the GH distance. More specifically, given a finite point cloud A ⊂ R n , the persistence diagram of the Vietori-Rips filtration on A, Dg(A), can be computed. This approach is inspired by persistent homology, a subfield of algebraic topology.
The rigorous definition of these notions is not the crux of this paper and we will only present them informally. The curious reader is invited to refer to Zhu (2013) for a short introduction. More details are in Delfinado and Edelsbrunner (1995);Edelsbrunner et al. (2002); Robins (1999).
A persistence diagram is a scatter plot of 2-D points representing the appearance and disappearance of geometric features 3 under varying resolutions. This can be imagined as replacing each point by a sphere of increasing radius.
We use the procedure described in Carrière et al. and In other words, the resulting signatures V A and V B are stable with respect to the GH distance. The size of the vectors are dependent on the underlying sets A and B. However, as is argued in Carrière et al. (2015), we can truncate the vectors to a dimension fixed across our dataset while preserving the stability property (albeit losing some of the representative ability of the signatures).

Experiments
The pipeline for our experiments is shown in Figure 2. In order to build a persistence diagram, we convert each document to the set of its word vectors. We then use Dionysus (Morozov, 2008(Morozov, -2016, a C++ library for computing persistence diagrams, and form the signatures described in 2.3. We will subsequently refer to these diagrams as Persistent Homology (PH) embeddings. Once we have the embeddings for each document, they can be used as input to standard clustering or classification algorithms. As a baseline document representation, we use the average of the word vectors for that document (subsequently called AW2V embeddings).
For clustering, we experiment with K-means and Gaussian Mixture Models (GMM) on a subset 4 of the Twenty Newsgroups dataset. The subset was selected to ensure that most documents are from related topics, making clustering nontrivial, and the documents are of reasonable length to compute the representation.
For classification, we perform both sentencelevel and document-level binary sentiment classification using logistic regression on the CSP and IMDb corpora respectively.

Hyper-parameters
Our method depends on very few hyperparameters. Our main choices are listed below.
Choice of distance We experimented with both euclidean distance and cosine similarity (angular distance). After preliminary experiments, we determined that both performed equally and hence, we only report results with the euclidean distance.
Persistence diagram computation The hyperparameters of the diagram computation are monotonic and mostly control the degree of approximation. We set them to the highest values that allowed our experiment to run in reasonable time 5 .

Document Clustering
We perform clustering experiments with the baseline document features (AW2V), tf-idf and our PH signatures. Figure 3 shows the B-Cubed precision, recall and F1-Score of each method (metrics as defined in Amigó et al. (2009)). To further assess the utility of PH embeddings, we concatenate them with AW2V to obtain a third representation, AW2V+PH.
With GMM and AW2V+PH, the F1-Score of clustering is 0.499. In terms of F1 and precision, we see that tf-idf representations perform better than PH, for reasons that we will discuss in later sections. In terms of recall, PH as well as AW2V perform fairly well. Importantly, we see that all the metrics for PH are significantly above the random baseline, indicating that some valuable information is contained in them.

Sentence-Level Sentiment Analysis
We evaluate our method on the CSP dataset 6 . The results are presented in Table 1. For comparison, we provide results for one of the state of the art models, a CNN-based sentence classifier (Kim,   Table 2: Performance on the IMDb dataset 2014). We observe that by themselves, PH embeddings are not useful at predicting the sentiment of each sentence. AW2V gives reasonable performance in this task, but combining the two representations does not impact the accuracy at all.

Document-Level Sentiment Analysis
We perform document-level binary sentiment classification on the IMDb Movie Reviews Dataset (Maas et al., 2011). We use sentence vectors in this experiment, each of which is the average of the word vectors in that sentence. The results are presented in Table 2. We compare our results with the paragraph-vector approach (Le and Mikolov, 2014). We observe that PH embeddings perform poorly on this dataset. Similar to the CSP dataset, AW2V embeddings give acceptable results. The combined representation performs slightly better, but not by a margin of significance.

Discussion and Analysis
As seen in Figure 3, the PH representation does not outperform tf-idf or AW2V, and in fact often doesn't perform much better than chance.
One possible reason is linked to the nature of our datasets: the computation of the persistence diagram is very sensitive to the size of the documents. The geometry of small documents, where the number of words is negligible with respect to the dimensionality of the word vectors, is not very rich. The resulting topological signatures are very sparse, which is a problem for CSP as well as documents in IMDb and Twenty Newsgroups that contain only one line. On the opposite side of the spectrum, persistence diagrams are intractable to compute without down-sampling for very long documents (which in turn negatively impacts the representation of smaller documents).
We performed an additional experiment on a subset of the IMDb corpus that only contained documents of reasonable length, but obtained similar results. This indicates that the poor performance of PH representations, even when combined with other features (AW2V), cannot be explained only by limitations of the data.
These observations lead to the conclusion that, for these datasets, the intrinsic geometry of documents in the word2vec semantic space does not help text classification tasks.

Related Work
Learning distributed representations of sentences or documents for downstream classification and information retrieval tasks has received recent attention owing to their utility in several applications, be it representations trained on the sen-tence/paragraph level Le and Mikolov (2014);Kiros et al. (2015) or purely word vector based methods Arora et al. (2017). Document classification and clustering (Willett, 1988;Hotho et al., 2005;Steinbach et al., 2000;Huang, 2008;Xu and Gong, 2004;Kuang et al., 2015;Miller et al., 2016) and sentiment classification (Nakagawa et al., 2010;Kim, 2014;Wang and Manning, 2012) are relatively well studied.
Topological data analysis has been used for various tasks such as 3D shapes classification (Chazal et al., 2009) or protein structure analysis (Xia and Wei, 2014). However, such techniques have not been used in NLP, primarily because the theory is inaccessible and suitable applications are scarce. Zhu (2013) offers an introduction to using persistent homology in NLP, by creating representations of nursery-rhymes and novels, as well as highlights structural differences between child and adolescent writing. However, these techniques have not been applied to core NLP tasks.

Conclusion
Based on our experiments, using persistence diagrams for text representation does not seem to positively contribute to document clustering and sentiment classification tasks. There are certainly merits to the method, specifically its strong mathematical foundation and its domain-independent, unsupervised nature. Theoretically, algebraic topology has the ability to capture structural context, and this could potentially benefit syntaxbased NLP tasks such as parsing. We plan to investigate this connection in the future.