EigenSent: Spectral sentence embeddings using higher-order Dynamic Mode Decomposition

Distributed representation of words, or word embeddings, have motivated methods for calculating semantic representations of word sequences such as phrases, sentences and paragraphs. Most of the existing methods to do so either use algorithms to learn such representations, or improve on calculating weighted averages of the word vectors. In this work, we experiment with spectral methods of signal representation and summarization as mechanisms for constructing such word-sequence embeddings in an unsupervised fashion. In particular, we explore an algorithm rooted in fluid-dynamics, known as higher-order Dynamic Mode Decomposition, which is designed to capture the eigenfrequencies, and hence the fundamental transition dynamics, of periodic and quasi-periodic systems. It is empirically observed that this approach, which we call EigenSent, can summarize transitions in a sequence of words and generate an embedding that can represent well the sequence itself. To the best of the authors’ knowledge, this is the first application of a spectral decomposition and signal summarization technique on text, to create sentence embeddings. We test the efficacy of this algorithm in creating sentence embeddings on three public datasets, where it performs appreciably well. Moreover it is also shown that, due to the positive combination of their complementary properties, concatenating the embeddings generated by EigenSent with simple word vector averaging achieves state-of-the-art results.


Relevant concepts
Word embeddings are dense vectors that capture the semantic and contextual information of a word, and are ubiquitous in natural language processing tasks across many domains (Camacho-Collados and Pilehvar, 2018). Several different algorithms and models for constructing these embeddings have been proposed and evaluated in literature (Perone et al., 2018).
A natural next step is to extend the notion of word embeddings to the level of a sentence (or paragraph, or document). Such representations are known as sentence embeddings, often interchangeably used with the terms paragraph embeddings or document embeddings, and should, ideally, capture the meaning of a sentence (Le and Mikolov, 2014).
More recently, the concept of universal sentence embeddings has gained traction, as they leverage models trained on large text corpuses in a way which is task-agnostic. These pre-trained models can then be used in a wide array of downstream tasks, often performing better in those tasks when little training data is available (Subramanian et al., 2018).

A brief review of literature
Word embedding methods vary from complex neural language models (Bengio et al., 2003) and semi-supervised approaches (Turian et al., 2010), to simpler and faster methods such as Word2Vec , GloVe (Pennington et al., 2014), ELMo (Peters et al., 2018) and BERT (Devlin et al., 2019), that can be trained on much larger volumes of data.
In order to learn high-quality word embeddings, a method must capture the contextually-relevant semantic meaning of a word. This is often done by training a language model on a dataset; an example is the method known as Embedding from Language Models or ELMo (Peters et al., 2018), which uses representations from the internal layers of a bi-directional LSTM that is trained with a language model objective. Very recently, Devlin et al. (2019) introduced another generalizable language model, named as BERT or bi-directional Encoder Representations from Transformers, consisting of layers of transformers (Vaswani et al., 2017) with bi-directional self-attention, which delivered stateof-the-art results in numerous benchmarks.
Similar to word embeddings, there has been a substantial amount of research on constructing sentence embeddings, in recent years. Several self-supervised approaches have been proposed, such as the extension of the Word2Vec model to include sentences and learn their representations (Le and Mikolov, 2014), and encoder-decoder approaches that try to reconstruct the surrounding sentences of an encoded passage . Recently, bi-directional LSTM models were trained in a strongly supervised fashion on the Stanford Natural Language Inference (SNLI) dataset (Bowman et al., 2015) by Conneau et al. (2017). This method, known as InferSent, has produced state-of-the-art results, as a universal sentence encoder, on various other downstream tasks.
Aside from these state-of-the-art approaches which require some sort of model training, a common approach to embed sentences is to simply compute the dimension-wise arithmetic mean of the embeddings of the words in a particular sentence. Further improvements were made, using weighted averages of the word embeddings and modifying them using singular-value decomposition, by Arora et al. (2017) and, recently, powermean embeddings (Rücklé et al., 2018). These methods have narrowed the performance deficit to other complex sentence embedding methods such as InferSent.

Hypothesis and contribution of this work
As a complement to most of the aforementioned work, in this paper, we aim at utilizing spectral methods in order to construct sentence embeddings. Spectral analysis is widely used in signal processing to decompose a signal into its component frequencies, thereby revealing the important dynamics that make up the signal and summarizing the transitions in it. The hypothesis of this paper is: if we could use similar techniques on sentences, which are also composed of meaningful transitions (between words), as we do with signals, then it should be possible to capture the important transitional dynamics that make up the respective sentences. The first step towards our goal is to represent a sentence as a signal, which has some meaningful transitional properties that we can capture. In order to do this, we rely on the word embeddings.
A key observation which motivates the use of word embeddings to represent a sentence as a signal, is the fascinating property of word vectors to approximately obey the laws of algebra, as they seem to capture word relationships and analogies. The original paper by  presented an example wherein vector("King") -vector("Man") + vector("Woman") results in a vector that is most similar to the representation for the word Queen. Following this observation, we posit that using spectral techniques, it should be possible to capture the dynamic properties of a sentence by treating it as a multi-dimensional signal over time, where the vector representation of each word in the sentence is a single point in the signal.
The major innovation introduced in this paper is the use of the higher-order Dynamic Mode Decomposition (HODMD) (Le Clainche and Vega, 2017) algorithm to exploit the temporal dynamics in a sequence of word vectors, in order to construct sentence embeddings. HODMD is an efficient extension of the basic Dynamic Mode Decomposition (DMD) algorithm, which has been widely used in fluid dynamics in order to capture the fundamental frequencies of complex fluid flows (Schmid, 2010). We compare the generated sentence embeddings using the said method against state-of-the-art methods such as BERT (Devlin et al., 2019), ELMo (Peters et al., 2018) and p-means (Rücklé et al., 2018), followed by comparisons with Discrete Cosine Transform (DCT) (Ahmed et al., 1974), which is a spectral method that has been used widely for data compression, and Principal Components Analysis (PCA) (Hotelling, 1933), by extensive experiments on three public datasets. We also show that by concatenating the embeddings generated by HODMD, which captures the dynamics of a sequence, with a method such as word vector averaging, which grasps the notion of scale, we can further improve the resultant embeddings for downstream tasks.

Paper structure
Having introduced the key concepts and motivators of this work in Section 1, we proceed by describing the higher-order Dynamic Mode Decomposition algorithm and the other relevant benchmark methods in Section 2. Section 3 outlines the datasets and the software implementations that have been used in this paper, followed by the ex-perimental procedures being stated in Section 4.1 and the results being analyzed in Section 4.2. Finally, in Section 5 we summarize the important conclusions of this paper.

Methodology
This section introduces the relevant concepts pertaining to higher-order Dynamic Mode Decomposition, the algorithm that has been used in this work to construct sentence embeddings, as well as competing methods that are being tested against.

Higher-order Dynamic Mode
Decomposition and motivation for use be a sentence composed of words w 1 , ..., w N , where w i is the i th word in the sentence. A word, w i , can be replaced by a pretrained word embedding, such as the one provided by  1 , or can be learned for the experiment dataset itself, using any of the methods mentioned in Section 1.2. Then the sentence can be written as a multidimensional signal as follows: (1) where, m is the dimension of the word vector and N is the number of words in a sentence.
In order to apply standard DMD (Schmid, 2010) to such a signal, the first-order Koopman assumption is employed, which can be written in multiple different forms: where k = 2, ..., N and A is a m × m square matrix. Thus, the assumption is that each sentence has words in it which lie in a constant subspace generated by A (Brunton et al., 2016), or that the words in a sentence transition from one another smoothly, transformed by the constant operator A. This assumption can be seen as an extension of the observation that word vectors seem to approximately obey the laws of simple algebra. The operator A then captures the overall transition dynamics of the sentence and summarizing A would lead to the construction of the desired sentence embedding.
The first-order Koopman assumption, although a good starting point, constrains a snapshot of a system, i.e., a word in a sentence in our case, to transition solely from the previous one. To further relax this constraint in an attempt to make our assumption more realistic, we look towards the work of Le Clainche and Vega (2017), who propose a higher-order Koopman assumption: where k = 2, ..., N −d+1 and d can be understood as the order parameter.
This may also be written in a form similar to equation 2: where, with I being a m × m identity matrix. Furthermore, the modified sentence matrix can be written as: With this relaxation, a particular word in a sentence is not only related to the preceding word, but to a number of preceding words in a window of size d, which is tunable, and d = 1 falls back to the first-order case. This, more realistic, relaxation to the original DMD algorithm is what motivates us to use HODMD in order to capture the transition dynamics in a sentence in order to construct sentence embeddings.

Generating sentence embeddings
The starting point is the following, which is a matrix form of equation 4: Performing SVD on S N −1 1 gives us: Algorithm 1: HODMD algorithm for constructing a sentence embedding, EigenSent Data: Sequence of word vectors in a sentence S N 1 = [w 1 , w 2 , ..., w N ], order parameter d, number of dynamic modes to choose n Result: Sentence embedding, where EigV ec contains the eigenvectors, one per column, sorted by the magnitude of the eigenvalues in EigV al; 5 Sentence embedding: where Σ is the diagonal matrix containing the SVD singular values, sorted in decreasing order, while the columns in U and T are the spatial and temporal SVD-modes. Using equations 7 and 8, we can derive: as, T T = T −1 and U T = U −1 Now that we have characterized the higherorder Koopman operator, A, using equation 9, the dynamic modes and mode amplitudes can simply be calculated by obtaining its eigenvalues and eigenvectors using any eigendecomposition technique. Since the dynamic modes (or eigenvectors) corresponding to the largest dynamic mode amplitudes (or eigenvalues) capture the largest-scale dynamics present in the sequence of words, the top-K modes, as sorted by the mode amplitudes, are concatenated, to be used as the sentence embedding for the corresponding sentence.
The overall process is depicted in Algorithm 1. For a chosen order d, the size of the sentence embedding is m * d.

State-of-the-art
We compare our method, as explained in Algorithm 1, to three recent state-of-the-art methods.
The first one, p-means (Rücklé et al., 2018), is a method that concatenates different types of means, known as power-means (Hardy et al., 1952), of the word embeddings in a sentence. The hypothesis of the authors of (Rücklé et al., 2018) is that the average of word vectors is only one type of order-statistic and there are several others available, which might add useful information when constructing sentence embeddings.
The second method, ELMo (Peters et al., 2018), trains a bi-directional LSTM, using word level and sub-word level features, with a language model objective on a large dataset, and then uses the representations of words from its internal layers to provide rich and contextual word embeddings. A pre-trained model, trained on the One Billion Words benchmark (Chelba et al., 2013), was used for our experiments, and an averaging bag-ofwords scheme was employed to produce the sentence embeddings based on the word representation features from all three layers of the ELMo model.
The final approach, BERT (Devlin et al., 2019), aims to produce a general-purpose language model by training a deep network of bidirectional transformers with self attention, using a masked language-model objective. By taking bi-directionality into account, it improves on previous efforts, such as the Generative Pre-trained Transformer (Radford et al., 2018), which were unidirectional. For our experiments, a pre-trained model, trained on the concatenation of BooksCorpus  and English Wikipedia, was used. Sentence embeddings were constructed by averaging the token representations from the second-to-last hidden layer of the model, as this approach produced good results in the original work.

Discrete Cosine Transform and PCA
Aside from comparing EigenSent to the state-ofthe-art, it is also prudent to compare the proposed method to other approaches rooted in the frequency domain. Two very popular methods for summarizing or compressing information are Discrete Cosine Transform (DCT) (Ahmed et al., 1974) and Principal Components Analysis (PCA) (Hotelling, 1933).
DCT is a special case of the Fourier transform, which aims to decompose a signal into the frequencies that make up the signal. In the case of DCT the basis vectors are infinite-scale cosine  functions of increasing frequencies. The goal of DCT in the multidimensional case is to determine a set of vectors (or components) which can be used to linearly combine the cosine functions to retrieve the original signal. The DCT components are arranged in order of importance in recreating the original signal and the top-K components are concatenated to form the embedding of a corresponding sentence on which DCT is performed. The comparison to PCA is only natural, as DMD works analogously to it. However, DMD contains information about the transition dynamics of a sequence, whereas PCA lacks this property (Schmid, 2010). This is because DMD is based on the eigendecomposition performed on the Koopman operator derived from the multidimensional signal, which captures the transition dynamics in that signal, whereas PCA is based on the covariance matrix produced from the signal, which does not. In a fashion similar to above, the top-K principal components are concatenated to form the sentence embedding.
We choose the aforementioned methods to benchmark EigenSent against because together they provide a significant coverage of logical competing ideas. The p-means method is based on the algebraic manipulation of the sequence of word embeddings, in order to create a sentence embedding, and does not require training, much like other methods such as (Arora et al., 2017). ELMo is another state-of-the-art method and it represents other such methods which leverage language models in order to capture contextual information to form embeddings. As for DCT and PCA, they are well-studied methods which are used for the spectral representation of a signal. DCT is a nonadaptive method, in a sense that it fixes the basis vectors to be cosine functions, whereas PCA is adaptive and learns a set of orthogonal bases.

Datasets
In order to compare the embeddings generated by our method to the benchmark methods, described in Section 2.2, we use three public datasets, per-taining to text and sentiment classification, of varying degrees of complexity.
The 20 newsgroups (20-NG) and the Reuters-8 (R-8) 2 are popular text classification datasets which have been widely used in literature, comprising documents that appeared in the Reuters newswire in 1987. The former has 20 different conceptual classes for the textual content to be classified into, while the latter has 8.
The Stanford Sentiment Treebank (SST-5) 3 (Socher et al., 2013) is a dataset for sentiment categorization where a corpus of movie review excerpts from the rottentomatoes.com are categorized either 5 classes representing sentiments varying from very positive to very negative.
More metadata information about these datasets are provided in Table 1.

Software and resources
Resources: In order to construct sentence embeddings, all the competing methods except ELMo require a set of word vectors. In this work, we use the pre-trained set of word embeddings provided by Mikolov et al 4 . For experimenting with ELMo and evaluating it on the chosen datasets, we use a pre-trained model 5 , trained on the One Billion Words benchmark (Chelba et al., 2013). In the case of BERT, we use the BERT-Large, Uncased 6 model, which is a 24-layer deep transformer network that was trained on Wikipedia and the BooksCorpus  for 1 million update steps.
Implementations: For DCT, we use an implementation provided in the SciPy Python library (Jones et al., 2001-) 7 , which uses Fast Fourier Transform (FFT) to get the cosine transform components, while PCA is made available in scikit- BERT Pre-trained model fetched as mentioned in Section 3.2; sentence embeddings were constructed based on word representation features from the second-to-last layer of the BERT-Large, Uncased model.

ELMo
Model downloaded as mentioned in Section 3.2 and sentence embeddings were constructed based on word representation features from all three layers of the ELMo model.

DCT
Components were varied from 1 to 6, after which performance plateaued or diminished.

PCA
Components were varied from 1 to 3, after which performance plateaued or diminished.

EigenSent
There are two components to tune for the HODMD algorithm: the window, d, and the number of components to retain, n, as described in Algorithm 1. d is varied between 1, 2, 3, [1-2], [1-3] and [1-6], while n is chosen as 1 or 2. Linear SVM L2 regularization parameter is varied between 0.001, 0.1, 1, 10 and 100. learn package 8 (Pedregosa et al., 2011). We use the p-means implementation provided by Rücklé et al. themselves 9 (2018) and leverage Tensorflow graphs (Abadi et al., 2016) in order to use ELMo. For BERT, we leverage a fast in-memory messagequeue based implementation, bert-as-service 10 . Finally, for an implementation of higher-order Dynamic Mode Decomposition (HODMD), we look towards a Python implementation by Demo et al. 11 (2018).
In order to foster reproducibility and openness, all of the experimental code is released 12 and results can easily be reproduced by re-running the provided code.

Experiments
In this work, we perform extensive experiments to compare the performance of our EigenSent method to the other competing methods. Our experimental protocol is described clearly next: 1. We choose an algorithm (EigenSent, pmeans, BERT, ELMo, DCT or PCA) and a set of hyperparameter values pertaining to it (e.g., the number of components to keep in PCA, or the powers in p-means) to evaluate. 2. Then, choose a dataset (20-NG, R-8 or . For every word in every sentence in the train and test splits of the dataset, retrieve 8 https://scikit-learn.org/stable/modules/ generated/sklearn.decomposition.PCA.html 9 https://github.com/UKPLab/ arxiv2018-xling-sentence-embeddings/blob/ master/model/sentence_embeddings.py 10 https://bert-as-service.readthedocs.io 11 https://mathlab.github.io/PyDMD/hodmd. html 12 https://github.com/DeepK/ hoDMD-experiments the corresponding word vector using the pretrained model stated in Section 3.2. 3. Followed by the application an algorithmhyperparameter combination on the sequence of word vectors constructed in the previous step to create sentence embeddings. 4. Finally, we train a simple linear-kernel support vector machine (Cortes and Vapnik, 1995) using the created sentence embeddings corresponding to the train-split of a dataset, and evaluate the trained model on the testsplit, by calculating Precision, Recall and their harmonic mean, the F1-score. Table 2 holds more metadata details about the experiments performed, for the purposes of reproducibility.

Results
The results of experiments, corresponding to the configurations listed in Table 2 and shown in Table  3, are analyzed next.

Dataset analysis
Amongst the datasets, the Reuters-8 dataset is relatively easier to tackle, as shown by the consistently high F1-scores across all the methods and configurations. The 20 newsgroups dataset is slightly more complex, owing to the much larger number of classes that it contains.
Finally, the Stanford Sentiment Treebank dataset is observed to be very nuanced and it is much harder to manage high scores on it. As an example of the degree of complexity of the SST-5 dataset, consider the following training sentence, which is labeled as very positive: "The entire movie establishes a wonderfully creepy mood". While the word wonderfully is usually used with a  positive intent, creepy is most often negative. It is their combination, i.e., wonderfully creepy, which makes the description an example of a very positive sentiment.

Analysis of the competing methods
The results obtained for the benchmark methods can be observed to follow intuition. P-means, BERT and ELMo outperform PCA and DCTbased embedding creation techniques; the two latter methods do achieve respectable results, given that they are not attuned to creating embeddings, but are simply decomposing a sequence of word vectors into components, which we use to construct embeddings. It can be seen that the DCTbased embedding creation technique needs more components to achieve reasonable performance, as compared to PCA, because PCA learns the basis vectors in a data-driven way while DCT assumes cosine functions as bases. However, since it does not need to learn the bases, and therefore makes less errors than PCA, DCT is more performant than PCA when we utilize more components. Among ELMo, BERT and p-means, ELMo performs better on the SST-5 dataset because it takes context into account in a much more sophisiticated way (owing to the bi-directional LSTM-based language model), than p-means. The performance of BERT is in-between the two. Both BERT and ELMo have not been fine-tuned in any way, for them to be fairly comparable to the other methods discussed in this work, none of which are taskspecific.

Analysis of EigenSent
We thoroughly analyze our proposed method, next, from various different perspectives.
Choice of higher-order DMD vs standard DMD: Observing the results of the EigenSent method in Table 3, it is clear that the exploiting the higher-order assumption (see Equation 3) is beneficial, since the results are unanimously better for higher values of the order parameter, d.
Effect of adding more dynamic modes: Recall that in Algorithm 1, the number of dynamic modes to choose n was a parameter. This determines the number of eigenvectors that are retained after performing eigendecomposition on the Koopman operator (see Section 2.1.2). It can be observed that retaining the fundamental eigenvector, or the largest mode, is enough to secure a good performance, when it comes to constructing sentence embeddings with EigenSent, with small improvements made with choosing the first two, at the cost of embedding dimensionality.
Performance with respect to PCA and DCT: EigenSent, using HODMD, is consistently superior, as compared to the other spectral techniques tested in this work. This is because it captures information about the sequential behaviour of the word vectors which form a sentence, while the other methods do not.
Performance with respect to state-of-the-art: HODMD is designed to capture the dynamics in a multidimensional sequence but it does not directly capture the scale, which methods like p-means (or simply averaging the word vectors) do. This is reflected in the performance of EigenSent on the datasets tested with, as its performance is somewhat between ELMo (or BERT) and p-means. For the SST-5 dataset, which exhibits more complex behaviour and interplay amongst words, it performs better than p-means (and much better than simple averaging), because of its ability to capture the dynamics, which is probably the more important attribute in this case.
Concatenating with word vector average:  The storylines are woven together skilfully, the magnificent swooping aerial shots are breathtaking, and the overall experience is awesome.
The camera soars above the globe in dazzling panoramic shots that make the most of the large-screen format, before swooping down on a string of exotic locales, scooping the whole world up in a joyous communal festival of rhythm.

0.796
What 's most memorable about Circuit is that it's shot on digital video, whose tiny camera enables Shafer to navigate spaces both large ...and small... with considerable aplomb.
The large-format film is well suited to capture these musicians in full regalia and the incredible IMAX sound system lets you feel the beat down to your toes.

0.771
George Lucas returns as a visionary with a tale full of nuance and character dimension.
The script by David Koepp is perfectly serviceable and because he gives the story some soul ...he elevates the experience to a more mythic level. 0.758 Table 5: Examples of best-matching sentences based on the cosine-similarity between the embeddings obtained using EigenSent The intuitions of dynamics and scale, corroborated with the performance observed in Table 3, as explained above, led us to combine the embeddings generated by EigenSent with those by simply averaging the word vectors, to capture both of these properties in a sentence. The summary of results is provided in Table  4, where the best results for each method are provided. It also has an additional result where the most performant EigenSent-based embeddings have been concatenated with the average word vector embedding for a sentence. It can be readily seen that this concatenation significantly improves performance, as the resultant embeddings can now capture both the scale and dynamics of a sentence.
Examples of similar sentences with EigenSent: Apart from the extensive quantitative evaluation of the proposed method, we provide motivating examples of similar sentences from the Stanford Sentiment Treebank dataset, as deemed by our method, in Table 5. It can be noted that none of the sentence-pairs share common words, apart from stop-words, and the similarity is semantic. The first example shows sentences which are similar because they both praise the camerawork in a movie, while in the second example, the commonality is about the video format. In the last example, the sentences point to movies having interesting characters, soul and depth. All of these examples suggest that EigenSent can capture the very nuanced qualities of a sentence.

Conclusion
In this paper, we have proposed a novel method to construct sentence embeddings, by exploiting the dynamic properties of a sequence of word vectors that the sentence is made up of. We do this using a spectral decomposition method rooted in fluid-dynamics, known as higher-order Dynamic Mode Decomposition, which is known to capture the fundamental transition dynamics of a multidimensional signal. Thorough empirical validation of the proposed method, which we call EigenSent, against known state-of-the-art methods shows the promise of this technique in capturing the dynamics of a word vector sequence to distill sentence embeddings, which may be concatenated with word vector average embeddings to state-ofthe-art performance.
The main contributions of the paper are: 1. We use signal summarization as an approach for creating sentence embeddings, a first to the best of our knowledge, using an algorithm from fluid dynamics called higher-order Dynamic Mode Decomposition (HODMD). 2. The rationale and intuition behind using the said method to capture the dynamic properties of a sentence are motivated, and the mathematical preliminaries of HODMD in the context of constructing sentence embeddings are clearly delineated.

A detailed experimental validation of
EigenSent, is performed on three public datasets, of varying degrees of complexity and purpose, and against algorithms which are both state-of-the-art and diverse, to formulate general conclusions about EigenSent. 4. We postulate, and later observe, that our method can successfully capture the dynamics present in a sentence. In cases where dynamics alone does not capture the essence of a sentence, our embeddings may be concatenated with those obtained via word vector averaging to obtain state-of-the-art results.