Representing Sentences as Low-Rank Subspaces

Sentences are important semantic units of natural language. A generic, distributional representation of sentences that can capture the latent semantics is beneficial to multiple downstream applications. We observe a simple geometry of sentences – the word representations of a given sentence (on average 10.23 words in all SemEval datasets with a standard deviation 4.84) roughly lie in a low-rank subspace (roughly, rank 4). Motivated by this observation, we represent a sentence by the low-rank subspace spanned by its word vectors. Such an unsupervised representation is empirically validated via semantic textual similarity tasks on 19 different datasets, where it outperforms the sophisticated neural network models, including skip-thought vectors, by 15% on average.


Introduction
Real-valued word representations have brought a fresh approach to classical problems in NLP, recognized for their ability to capture linguistic regularities: similar words tend to have similar representations; similar word pairs tend to have similar difference vectors (Bengio et al., 2003;Mnih and Hinton, 2007;Mikolov et al., 2010;Collobert et al., 2011;Huang et al., 2012;Dhillon et al., 2012;Mikolov et al., 2013;Pennington et al., 2014;Levy and Goldberg, 2014;Arora et al., 2015;Stratos et al., 2015). Going beyond words, sentences capture much of the semantic information. Given the success of lexical representations, a natural question of great topical interest is how to extend the power of distributional representations to sentences.
There are currently two approaches to represent sentences. A sentence contains rich syntactic information and can be modeled through sophisticated neural networks (e.g., convolutional neural networks (Kim, 2014;Kalchbrenner et al., 2014), recurrent neural networks (Sutskever et al., 2014;Le and Mikolov, 2014;Hill et al., 2016) and recursive neural networks (Socher et al., 2013)). Another simple and common approach ignores the latent structure of sentences: a prototypical approach is to represent a sentence by summing or averaging over the vectors of the words in this sentence (Wieting et al., 2015;Adi et al., 2016;Kenter et al., 2016). Recently, Wieting et al. (2015); Adi et al. (2016) reveal that even though the latter approach ignores all syntactic information, it is simple, straightforward, and remarkably robust at capturing the sentential semantics. Such an approach successfully outperforms the neural network based approaches on textual similarity tasks in both supervised and unsupervised settings.
We follow the latter approach but depart from representing sentences in a vector space as in these prior works; we present a novel Grassmannian property of sentences. The geometry is motivated by (Gong et al., 2017;Mu et al., 2016) where an interesting phenomenon is observed -the local context of a given word/phrase can be well represented by a low rank subspace. We propose to generalize this observation to sentences: not only do the word vectors in a snippet of a sentence (i.e., a context for a given word defined as several words surrounding it) lie in a low-rank subspace, but the entire sentence (on average 10.23 words in all Se-mEval datasets with standard deviation 4.84) follows this geometric property as well: Geometry of Sentences: The word representations lie in a low-rank subspaces (rank 3-5) for all words in a target sen-tence.
The observation indicates that the subspace contains most of the information about this sentence, and therefore motivates a sentence representation method: the sentences should be represented in the space of subspaces (i.e., the Grassmannian manifold) instead of a vector space; formally: Sentence Representation: A sentence can be represented by a low-rank subspace spanned by its word representations.
Analogous to word representations of similar words being similar vectors, the principle of sentence representations is: similar sentences should have similar subspaces. Two questions arise: (a) how to define the similarity between sentences and (b) how to define the similarity between subspaces.
The first question has been already addressed by the popular semantic textual similarity (STS) tasks. Unlike textual entailment (which aims at inferring directional relation between two sentences) and paraphrasing (which is a binary classification problem), STS provides a unified framework of measuring the degree of semantic equivalence (Agirre et al., 2012(Agirre et al., , 2013(Agirre et al., , 2014(Agirre et al., , 2015 in a continuous fashion. Motivated by the cosine similarity between vectors being a good index for word similarity, we generalize this metric to subspaces: the similarity between subspaces defined in this paper is the 2 -norm of the singular values between two subspaces; note that the singular values are in fact the cosine of the principal angles. The key justification for our approach comes from empirical results that outperform state-ofthe-art in some cases, and being comparable in others. In summary, representing sentences by subspaces outperforms representing sentences by averaged word vectors (by 14% on average) and sophisticated neural networks (by 15%) on 19 different STS datasets, ranging over different domains (News, WordNet definition, and Twitter).

Geometry of Sentences
Assembling successful distributional word representations (for example, GloVe (Pennington et al., 2014)) into sentence representations is an active research topic. Different from previous studies (for example, doc2vec (Mikolov et al., 2013), skip-thought vectors , Siamese CBOW (Kenter et al., 2016)), our main contribution is to represent sentences using non-vector space representations: a sentence can be well represented by the subspace spanned by the context word vectors -such a method naturally builds on any word representation method. Due to the widespread use of word2vec and GloVe, we use their publicly available word representations -word2vec (Mikolov et al., 2013) trained using Google News 1 and GloVe (Pennington et al., 2014) trained using Common Crawl 2 -to test our observations.
Observation Let v(w) ∈ R d be the ddimensional word representation for a given word w ∈ V , and s = (w 1 , . . . , w n ) be a given sentence. Consider the following sentence where n = 32: They would not tell me if there was any pension left here, and would only tell me if there was (and how much there was) if they saw I was entitled to it.
After stacking the (non-functional) word vectors v(w) to form a d × n matrix, we observe that most energy (80% for GloVe and 72% for word2vec) of such a matrix is contained in a rank-N subspace, where N is much smaller than n (for comparison, we choose N to be 4 and therefore N/n ≈ 13%). Figure 1 provides a visual representation of this geometric phenomenon, where we have projected the d-dimensional word representations into 3dimensional vectors and use these 3-dimensional word vectors to get the subspace for this sentence (we set N = 2 here for visualization), and plot the subspaces as 2-dimensional planes.

Geometry of Sentences
The example above generalizes to a vast majority of the sentences: the word representations of a given sentence roughly reside in a low-rank subspace, which can be extracted by principal component analysis (PCA).
Verification We empirically validate this geometric phenomenon by collecting 53,396 sentences from the SemEval STS share tasks (Agirre et al., 2012(Agirre et al., , 2013(Agirre et al., , 2014(Agirre et al., , 2015 and plotting the fraction of energy being captured by the top N components of PCA in Figure 2 for N = 3, 4, 5, from where we can observe that on average 70% of the energy is captured by a rank-3 subspace, and 80% for a rank-4 subspace and 90% for rank-5 subspace. For comparison, the fraction of energy of random sentences (generated i.i.d. from the unigram distribution) are also plotted in Figure 2. Representation The observation above motivates our sentence representation algorithm: since the words in a sentence concentrate on a few directions, the subspace spanned by these directions could in principle be a proper representation for this sentence. The direction and subspace in turn can be extracted via PCA as in Algorithm 1.

Similarity Metric
The principle of sentence representations is that similar sentences should have similar representations. In this case, we expect the similarity between subspaces to be a good index for the semantic similarity of sentences. In our paper, we define the similarity between subspaces as follows: let u 1 (s), ..., u N (s) be the N orthonormal basis for a sentence s. After stacking the N vectors in a d × N matrix Algorithm 1: The algorithm for sentence representations.
Input : a sentence s, word embeddings v(·), and a PCA rank N . 1 Compute the first N principle components of samples v(w ), w ∈ c, : α n u n , α n ∈ R Output: N orthonormal basis u 1 , ..., u N and a subspace S.
U (s) = (u 1 (s), ..., u N (s)), we define the corresponding cosine similarity as, CosSim(s 1 , Note that σ t = cos(θ t ) where θ t is the t-th "principle angle" between two subspaces. Such a metric is naturally related to the cosine similarity between vectors, which has been empirically validated to be a good measure of word similarity.

Experiments
In this section we evaluate our sentence representations empirically on the STS tasks. The objective of this task is to test the degree to which the algorithm can capture the semantic equivalence between two sentences. For example, the similarity between "a kid jumping a ledge with a bike" and "a black and white cat playing with a blanket" is 0 and the similarity between "a cat standing on tree branches" and "a black and white cat is high up on tree branches" is 3.6. The algorithm is then evaluated in terms of Pearson's correlation between the predicted score and the human judgment.

Baselines and Preliminaries
Our main comparisons are with algorithms that perform unsupervised sentence representation: average of word representations (i.e., avg. (of GloVe and skipgram) where we use the average of word vectors), doc2vec (D2V) (Le and Mikolov, 2014)   , Siamese CBOW (SC) (Kenter et al., 2016)). In order to enable a fair comparison, we use the Toronto Book Corpus  to train word embeddings. In our experiment, we adapt the same setting as in (Kenter et al., 2016) where we use skip-gram (Mikolov et al., 2013) of and GloVe (Pennington et al., 2014) to train a 300-dimensional word vectors for the words that occur 5 times or more in the training corpus. The rank of subspaces is set to be 4 for both word2vec and GloVe.

Results
The detailed results are reported in Table 1, from where we can observe two phenomena: (a) representing sentences by its averaged word vectors provides a strong baseline and the performances are remarkably stable across different datasets; (b) our subspace-based method outperforms the average-based method by 14% on average and the neural network based approaches by 15%. This suggests that representing sentences by subspaces maintains more information than simply taking the average, and is more robust than highly-tuned sophisticated models. When we average over the words, the average vector is biased because of many irrelevant words (for example, function words) in a given sentence. Therefore, given a longer sentence, the effect of useful word vectors become smaller and thus the average vector is less reliable at capturing the semantics. On the other hand, the subspace representation is immune to this phenomenon: the word vectors capturing the semantics of the sentence tend to concentrate on a few directions which dominate the subspace representation.

Conclusion
This paper presents a novel unsupervised sentence representation leveraging the Grassmannian geometry of word representations. While the current approach relies on the pre-trained word representations, the joint learning of both word and sentence representations and in conjunction with supervised datasets such the Paraphrase Database (PPDB) (Ganitkevitch et al., 2013) is left to future research. Also interesting is the exploration of neural network architectures that operate on subspaces (as opposed to vectors), allowing for downstream evaluations of our novel representation.