Unsupervised Random Walk Sentence Embeddings: A Strong but Simple Baseline

Using a random walk model of text generation, Arora et al. (2017) proposed a strong baseline for computing sentence embeddings: take a weighted average of word embeddings and modify with SVD. This simple method even outperforms far more complex approaches such as LSTMs on textual similarity tasks. In this paper, we first show that word vector length has a confounding effect on the probability of a sentence being generated in Arora et al.’s model. We propose a random walk model that is robust to this confound, where the probability of word generation is inversely related to the angular distance between the word and sentence embeddings. Our approach beats Arora et al.’s by up to 44.4% on textual similarity tasks and is competitive with state-of-the-art methods. Unlike Arora et al.’s method, ours requires no hyperparameter tuning, which means it can be used when there is no labelled data.


Introduction
Distributed representations of words, better known as word embeddings, have become fixtures of current methods in natural language processing. Word embeddings can be generated in a number of ways (Bengio et al., 2003;Collobert and Weston, 2008;Pennington et al., 2014;Mikolov et al., 2013) by capturing the semantics of a word using the contexts it appears in. Recent work has tried to extend that intuition to sequences of words, using methods ranging from a weighted average of word embeddings to convolutional, recursive, and recurrent neural networks Kiros et al., 2015;Luong et al., 2013;Tai et al., 2015). Still, Wieting et al. (2016b) found that these sophisticated architectures are often outperformed, particularly in transfer learning settings, by sentence embeddings generated as a simple average of tuned word embeddings. Arora et al. (2017) provided a more powerful approach: compute the sentence embeddings as weighted averages of word embeddings, then subtract from each one the vector projection on their first principal component. The weighting scheme, smoothed inverse frequency (SIF), is derived from a random walk model where words in a sentence s are produced by the random walk of a latent discourse vector c s . A word unrelated to c s can be produced by chance or if it is part of frequent discourse such as stopwords. This approach evens outperforms more complex models such as LSTMs on textual similarity tasks. Arora et al. argued that the simplicity and effectiveness of their method make it a tough-to-beat baseline for sentence embeddings. Though they call their approach unsupervised, others have noted that it is actually 'weakly supervised', since it requires hyperparameter tuning (Cer et al., 2017).
In this paper, we first propose a class of worst-case scenarios for Arora et al.'s (2017) random walk model. Specifically, given some sentence g that is dominated by words with zero similarity, and some sentence h that is dominated by identical words, we show that their approach can return two discourse vectors c g and c h such that p(g|c g ) ⇡ p(h|c h ), provided that the word vectors for g have a sufficiently greater length than those for h. In other words, word vector length has a confounding effect on the probability of a sentence being generated, and this effect can be strong enough to yield completely unintuitive results. This problem is not endemic to these scenarios, though they are the most illustrative of it; because of the underlying log-linear word production model, Arora et al.'s model is fundamentally sensitive to word vector length.
Our contributions in this paper are three-fold. First, we propose a random walk model that is robust to distortion by vector length, where the probability of a word vector being generated by a discourse vector is inversely related to the angular distance between them. Second, we derive a weighting scheme from this model and compute a MAP estimate for the sentence embedding as follows: normalize the word vectors, take a weighted average of them, and then subtract from each weighted average vector the projection on their first m principal components. We call the weighting scheme derived from our random walk model unsupervised smoothed inverse frequency (uSIF). It is similar to SIF (Arora et al., 2017) in practice, but requires no hyperparameter tuning at all -it is completely unsupervised, allowing it to be used when there is no labelled data. Lastly, we show that our approach outperforms Arora et al.'s by up to 44.4% on textual similarity tasks, and is even competitive with state-of-the-art methods. Given the simplicity, effectiveness, and unsupervised nature of our method, we suggest it be used as a baseline for computing sentence embeddings.

Related Work
Word Embeddings Word embeddings are distributed representations of words, typically in a low-dimensional continuous space. These word vectors can capture semantic and lexical properties of words, even allowing some relationships to be captured algebraically (e.g., v Berlin v Germany + v France ⇡ v Paris ) (Mikolov et al., 2013). Word embeddings are generally obtained in two ways: (a) from internal representations of words in shallow neural networks (Bengio et al., 2003;Mikolov et al., 2013;Collobert and Weston, 2008); (b) from low rank approximations of co-occurrence matrices (Pennington et al., 2014).
Word Sequence Embeddings Embeddings for sequences of words (e.g., sentences) are created by composing word embeddings. This can be done simply, by doing coordinate-wise multiplication (Mitchell and Lapata, 2008) or taking an unweighted average (Mikolov et al., 2013) of the word vectors. More sophisticated architectures can also be used: for instance, recursive neural networks (Socher et al., 2011, LSTMs (Tai et al., 2015), and convolutional neural networks (Kalchbrenner et al., 2014) can be defined and trained on parse and dependency trees.
Other approaches are based on the presence of a latent vector for the entire sequence. Paragraph vectors  are latent representations that influence the distribution of words. Skip-thought vectors (Kiros et al., 2015) are hidden representations of a neural network that encodes a sentence by trying to reconstruct its surrounding sentences.  leverage transfer learning by using the hidden representation of a sentence in an LSTM trained for another task, such as textual entailment. The inspiration for Arora et al. (2017) is Wieting et al. (2016b), who use word averaging after updating word embeddings by tuning them on paraphrase pairs. A later work by Wieting et al. (2017a) tried trigram-averaging and LSTM-averaging in addition to word-averaging. In that approach, vectors were tuned on the ParaNMT-50M dataset, created by using neural machine translation to translate 51M Czech-English sentence pairs into English-English pairs. This yielded state-of-the-art results on textual similarity tasks, beating the previous baseline by a wide margin.

The Log-Linear Random Walk Model
In Arora et al.'s original model (2016), words are generated dynamically by the random walk of a time-variant discourse vector c t 2 R d , representing "what is being talked about". Words are represented as v w 2 R d . The probability of a word w being generated at time t is given by a log-linear production model (Mnih and Hinton, 2007): Assuming that the discourse vector c t does not change much over the course of the sentence, Arora et al. replace the sequence of discourse vectors {c t } across all time steps with a single discourse vector c s . The MAP estimate of c s is then the unweighted average of word vectors (ignoring any scalar multiplication). Arora et al.'s improved random walk model (2017) allows words to also be generated: (a) by chance, with probability a · p(w), where a is some scalar and p(w) is the frequency; (b) if the word is correlated with the common discourse vector, which represents frequent discourse such as stopwords. We use c 0 to denote the common discourse vector, to be consistent with the literature. Among other things, these changes help explain words that appear frequently despite being poorly correlated with the discourse vectors -words like the, for example. The probability of a word w being generated by a discourse vector c s is then given as: where a, b are scalar hyperparameters, V is the vocabulary, e c s is a linear combination of the discourse and common discourse vectors parameterized by b , and Z e c s is the partition function. The sentence embedding for a sentence is defined as the MAP estimate of the discourse vector c s that generated the sentence. To compute this tractably, Arora et al. (2017) assume that word vectors v w are roughly uniformly dispersed in the latent space. This implies that the partition function Z e c s is roughly the same for all e c s , allowing it to be replaced with a constant Z. Assuming a uniform prior over e c s , the maximum likelihood estimator for e c s on the unit sphere (ignoring normalization) is then approximately proportional to: Since Z cannot be evaluated, and a is not known, a is a hyperparameter that needs tuning. This weighting scheme is called smoothed inverse frequency (SIF) and places a lower weight on more frequent words. The first principal component of all { e c s } in the corpus is used as the estimate for the common discourse vector c 0 . The final discourse vector c s is then produced by subtracting the projection of the weighted average on the common component (common component removal): Arora et al. call their approach unsupervised, but others (Cer et al., 2017) have correctly noted that it is weakly supervised, since the hyperparameter a needs to be tuned on a validation set.

The Confounding Effect of Vector Length
We now propose worst-case scenarios where word vector length clearly distorts p(s|c s ) due to the underlying log-linear word production model. Note that we discuss these scenarios because they are illustrative, not because they circumscribe the universe of all scenarios in which word vector length has a confounding effect. Consider a sentence g comprising two rare words x and y, where x and y have zero similarity. Also consider some sentence h, where the only word z appears twice. g might not occur naturally, but its weighted average e c g would be similar to that of some longer sentence where x, y are the only non-stopwords (i.e., those with non-negligible weight). For simplicity, further assume that common component removal has negligible effect: Words x, y, z are so infrequent that the probability of them being produced by chance or by the common discourse vector is negligible; the likelihood of them being produced is therefore proportional to the inner product of the discourse and word vectors. Given that the words x, y 2 g have zero similarity, and given that the only word z 2 h is identical to its discourse vector, we would expect: However, (3) does not always hold. Suppose that the word embeddings lie in R 2 . Then any scalar k can be used to create a valid set of assignments for word embeddings v x , v y , v z that satisfy (2): Assuming the words x, y, z have roughly the same frequency, they should have the same SIF-weight. Then the weighted averages, and by extension the discourse vectors (2), are the same: Thus it is possible for g to be generated by discourse vector c g with roughly the same probability as h by c h , contradicting (3). How is this possible, given that the words in g have zero similarity with each other while those in h are identical to each other? The answer can be found in the word vector lengths. Because ||v x || 2 = p 2||v z || 2 , and p(w|c s ) depends on the inner product of the word and discourse vectors (1), words with longer word vectors are more likely to be produced. In fact, if v x and v y were multiplied by some scalar greater than 1, then p(h|c h ) would be less than p(g|c g ).
Generalizing Worst-Case Scenarios By manipulating the word vector length, we can also come up with a more general class of assignments that can contradict (3): For simplicity, we assume that the words x, y, z across the two sentences are so infrequent that the probability of them being generated by chance is zero. Then the conditional probabilities of the sentences being generated are: In this general formulation, not all scenarios are worst-case. This describes a spectrum of scenarios ranging from acceptable (e.g., v x = v y = v z when b = 2, s = 0.5) to completely counter-intuitive (see (4)). Though these assignments only apply for word vectors in R 2 , they can easily be extended to higher-dimensional spaces. The confound of vector length persists for longer, naturally occurring sentences. Ultimately, the underlying log-linear word production model (1) means that words with longer word vectors are more likely to be generated. Because this confound is due to model design, rather than the MLE, removing it requires redesigning the model. The exact degree of the confound varies across sentences, but in theory, it is unbounded.

An Angular Distance-Based Random Walk Model
To address the confounding effect of word vector length, we propose a random walk model where the probability of observing a word w at time t is inversely related to the angular distance between the time-variant discourse vector c t 2 R d and the word vector v w 2 R d : where arccos (cos (v w , c t )) is the angular distance.
For the intuition behind the use of this distance metric, note that the angular distance between two vectors is equal to the geodesic distance between them on the unit sphere. Thus the angular distance can also be interpreted as the length of the shortest path between the L 2 normalized word vector and the L 2 normalized discourse vector on the unit sphere. Since the angular distance lies in [0, p], we divide it by p to bound it in [0, 1]. Our choice of angular distance -as opposed to, say, the exponentiated cosine similarity -is critical to avoiding hyperparameter tuning.
Assuming that the discourse vector c t does not change much over the course of the sentence, the sequence of discourse vectors {c t } across all time steps can be replaced with a single discourse vector c s for the sentence s. To model sentences more realistically, we allow words to be generated in two additional ways, as proposed in Arora et al. (2017): (a) by chance, with probability a · p(w), where a is some scalar and p(w) is the frequency; (b) if the word is correlated with one of m common discourse vectors {c 0 m }, which represent various types of frequent discourse, such as stopwords. The probability of a word w being generated by discourse vector c s is then: where a, b , {l i } are scalar hyperparameters, V is the vocabulary, e c s is a linear combination of the discourse and common discourse vectors parameterized by b and {l i }, and Z e c s is the partition function. Instead of searching for the optimal hyperparameter values over some large space, as Arora et al. (2017) did, we make some simple assumptions to directly compute them.
We define the sentence embedding for some sentence s to be the MAP estimate of the discourse vector c s that generates s. Assuming a uniform prior over possible c s , the MAP estimate is also the MLE estimate for c s . The log-likelihood of a sentence s is: To maximize log p(s|c s ), we can approximate log p(w|c s ) using a first-degree Taylor polynomial: Where a , (1 a)/(aZ e c s ), C is a constant, and v 0 w is a vector orthogonal to v w with length kv w k 1 : The MLE for e c s on the unit sphere (ignoring normalization) is then approximately proportional to: The MLE of e c s is approximately a weighted average of word vectors, where more frequent words are down-weighted. In fact, it very closely resembles the SIF weighting scheme (Arora et al., 2017)! However, there are two key differences. For one, as we show later in this subsection, we have derived this weighting scheme from a model that is robust to the confounding effect of word vector length. Secondly, in SIF, a is a hyperparameter that needs to be tuned on a validation set. We now show that in our approach, we can calculate a directly as a function of the vocabulary V and the number of words in the sentence, |s|.
Normalization Before weighting the word vectors, we normalize them along each dimension: we construct a matrix [v w 1 ...w w |s| ] and take the L 2 norm of each row, which corresponds to a single dimension in R d . We then multiply this ddimensional vector element-wise with every vector in the sentence. This helps reduce the difference in variance across the dimensions.

Partition Function
To calculate Z e c s , we borrow the key assumption from Arora et al. (2017) that the word vectors v w are roughly uniformly dispersed in the latent space. Then the expected geodesic distance between a latent discourse vector and a word vector on the unit sphere is p/2, so Odds of Random Production a is the probability that a word w will be produced by chance instead of by the discourse or common discourse vectors. To estimate a, we first consider the probability that a random word w will be produced by a discourse vector c s at least once over n steps of a random walk: The number of steps taken during the random walk is itself a random variable, so we let n = E s2S |s|. We assume that if the frequency is greater than this expectation, then the word is always produced by chance; less than this expectation, and it is always produced by the discourse or common discourse vectors. a is the proportion of the vocabulary with p(w) above this threshold: Since we can directly calculate Z e c s and a, we can also directly calculate a = (1 a)/(aZ e c s ). Common Discourse Vectors We estimate the m common discourse vectors as the first m singular vectors from the singular value decomposition of the weighted average vectors. {l i } are the weights on the common discourse vectors. In reality, these are unique to the word for which p(w|c s ) is being evaluated. However, we let l i be: where s i is the singular value for c 0 i . l i can be interpreted as the proportion of variance explained by {c 0 1 , ..., c 0 m } that is explained by c 0 i . If removing the common discourse vectors is a form of denoising (Arora et al., 2017), increasing m, in theory, should improve results. Because the variance explained by a singular vector falls with every additional vector that is included, the choice of m is thus a trade-off between variance explained and computational cost. When m = 1, this is equivalent to the removal in Arora et al. (2017). We fix m at 5, since we find empirically that singular vectors beyond that do not explain much more variance. To get c s , we subtract from e c s the weighted projection on each singular vector: We call this piecewise common component removal. Because our weighting scheme requires no hyperparameter tuning, it is completely unsupervised. For this reason, we call it unsupervised smoothed inverse frequency (uSIF). The full algorithm is given in Algorithm 1. Note that while it is certainly possible to tune the hyperparameters in our model to achieve optimal results, it is not necessary to do so, which allows our method to be used when there is no labelled data. By contrast, in Arora et al.'s model (2017), hyperparameter tuning is a necessity.

Confound of Vector Length
To understand why this model is not prone to the confound of word vector length, we reconsider the class of assignments for v x , v y , v z in (5) and the resulting values for e c g and e c h . Recall that in our example, sentence g comprises words x, y and sentence h comprises two instances of the word z. Under our new weighting scheme, C in (5) is replaced with C 0 = a p(x)+ 1 2 a . Note that we use p(x) in C 0 because of the simplifying assumption that p(x) = p(y) = p(z). Assuming again that p(x) ⇡ 0 and that piecewise common component removal has negligible effect, we can see how p(g|c g ) and p(h|c h ) change in our random walk model: Because p(g|c g ) is ultimately based on the cosine similarities between the discourse vector and word vectors, it is a function of the parameter s 2 [0, 1] that controls the degree of similarity between v x and v y . For example, for the worst-case assignments (4), p(g|c g ) µ 9/16. Conversely, when v x = v y = v z , we get p(g|c g ) = p(h|c h ) µ 1. Recall that in Arora et al.'s model (2017), b 2 was sufficient to ensure the counter-intuitive result of p(g|c g ) p(h|c h ) (6), where b was a scalar that controlled the word vector length. In contrast, in our random walk model, the effect of b -and thus the confound of vector length -is entirely absent; only the similarity between the word vectors is influential.

Textual Similarity Tasks
We test our approach on the SemEval semantic textual similarity (STS) tasks (2012-2015) (Agirre et al., 2012(Agirre et al., , 2013(Agirre et al., , 2014(Agirre et al., , 2015, the SemEval 2014 Relatedness task (SICK'14) (Marelli et al., 2014), and the STS Benchmark dataset (Cer et al., 2017). In these tasks, the goal is to determine the semantic similarity between a given pair of sentences; the evaluation criterion is the Pearson correlation coefficient between the predicted and actual similarity scores. To predict the similarity score, we simply encode each sentence and take the cosine similarity of their vectors. The individual scores for STS tasks are in Table 4 in the Appendix and the average scores are in Table 1. The STS benchmark scores are in Table 2. We compare our results with those from several methods, which are categorized by Cer et al. (2017) as 'unsupervised', 'weakly supervised', or 'supervised'.

Experimental Settings
For a fair comparison with Arora et al. (2017), we use the unigram probability distribution used by them, based on the enwiki dataset (Wikipedia, 3B words). Our preprocessing of the sentences is limited to tokenization. We try our method with three types of word vectors: GloVe vectors (Pennington et al., 2014), PARAGRAM-SL999 (PSL) vectors (Wieting et al., 2015), tuned on the Sim-Lex999 dataset, and ParaNMT-50 vectors (Wieting and Gimpel, 2017a), tuned on 51M English-English sentence pairs translated from English-Czech sentence pairs. The value of n in (11) is E s2S |s| ⇡ 11 and was estimated using sentences from all corpora. The value of a in (9) is then 1.2 ⇥ 10 3 . Our results are denoted as X+UP, where X 2 {'GloVe', 'PSL', 'ParaNMT'}, U denotes uSIF-weighting, and P denotes piecewise common component removal.

Results
Our model outperforms Arora et al.'s by up to 44.4% on individual tasks (see GloVe+UP vs. GloVe+WR for the STS'12 MSRpar task in Table 4) and by up to 15.5% on yearly averages (see GloVe+UP vs. GloVe+WR for STS'12 in Table 1).
Our approach proves most useful in cases where Arora et al. (2017) underperform others, such as for STS'12, where our models -GloVe+UP and PSL+UP -outperform their equivalents in Arora  6%, but the improvement is highly variable. This may be because the hyperparameter values we derived may be closer to the optima for some corpora more than others or because our other improvements -normalization and piecewise common component removal -are more effective for certain datasets. Our best model, ParaNMT+UP, is also competitive with the state-of-the-art model, ParaNMT Trigram-Word, an average of trigram and word embeddings tuned on the ParaNMT-dataset. ParaNMT+UP outperforms ParaNMT Trigram-Word on STS'12, STS'13, and STS'14; it is narrowly outperformed on STS'15 and the STS benchmark. ParaNMT Trigram-Word's inclusion of trigram embeddings gives it an edge over our model for out-of-vocabulary words (Wieting and Gimpel, 2017a). It should be noted that ParaNMT+UP outperforms both ParaNMT Word Avg. and ParaNMT BiLSTM Avg., implying that our model composes words better than both simple averaging and BiLSTMs. Similarly, our model PSL+UP outperforms PP-XXL (Wieting et al., 2016b), despite the latter using the same word vectors and a learned projection instead.
Ablation Study On average, our weighting scheme alone is responsible for a roughly 4.4% Table 2: Results (Pearson's r ⇥ 100) on the STS Benchmark dataset. The highest score is in bold. The scores of our approaches are underlined. improvement over Arora et al. The piecewise common component removal alone is responsible for a roughly 5.1% improvement, and the normalization alone is responsible for a roughly 6.7% improvement. This suggests that the benefits of our individual contributions have much overlap. The choice of tuned word vectors (e.g., ParaNMT over GloVe) can also improve results by up to 11.2%.

Supervised Tasks
We also test our approach on three supervised tasks: the SICK similarity task (SICK-R), the SICK entailment task (SICK-E), and the Stanford Sentiment Treebank (SST) binary classification task . To a large extent, performance on these tasks depends on the architecture that is trained with the sentence embeddings. We take the embeddings that perform best on the textual similarity tasks, ParaNMT+UP, and follow the setup in Wieting et al. (2016b). As seen in Table 3, both SIF-weighting with common component removal (Arora et al., 2017) and uSIF-weighting with piecewise common component removal (ours) perform slightly better than simple word averaging, but not as well as more sophisticated models. Past work has found that tuning the word embeddings in addition to the parameters of the model yields much better performance (Wieting et al., 2016b), as does increasing the size of the hidden layer in the classifier (Arora et al., 2017). The results here, however, suggest that regardless of such changes, our approach would not be any more effective than Arora et al.'s on these tasks. Still, our approach retains the advantage of being a completely unsupervised method that can be used when there is no labelled data.  (Arora et al., 2017) 80.5 83.9 80.9 ParaNMT+UP † (ours) 80.7 83.8 81.1 Other Approaches BiLSTM-Max (on AllNLI)  84.6 88.4 86.3 skip-thought (Kiros et al., 2015) 82.0 85.8 82.3 BYTE mLSTM (Radford et al., 2017) 91.8 79.2 - Table 3: Results on the SST, SICK-R, and SICK-E tasks. The best score for each task is bolded. † indicates our implementation.

Future Work
There are several possibilities for future work. For one, the values we derived for Z e c s , a, a and {l i } are not necessarily optimal. While they are based on reasonable assumptions, there are likely sentence-specific and task-specific values that yield better results. Hyperparameter search is one way of finding these values, but that would require supervision. It may be possible, however, to theoretically derive more optimal values.

Conclusion
We first showed that word vector length has a confounding effect on the log-linear random walk model of generating text (Arora et al., 2017), the basis of a strong baseline method for sentence embeddings. We then proposed an angular distancebased random walk model where the probability of a sentence being generated is robust to distortion from word vector length. From this model, we derived a simple approach for creating sentence embeddings: normalize the word vectors, compute a weighted average, and then modify it using SVD. Unlike in Arora et al., our approach does not require hyperparameter tuning -it is completely unsupervised and can therefore be used when there is no labelled data. Our approach outperforms Arora et al.'s by up to 44.4% on textual similarity tasks and is even competitive with state-of-the-art methods. Because our simple approach is tough-to-beat, robust, and unsupervised, it is an ideal baseline for computing sentence embeddings.