Sentence Meta-Embeddings for Unsupervised Semantic Textual Similarity

We address the task of unsupervised Semantic Textual Similarity (STS) by ensembling diverse pre-trained sentence encoders into sentence meta-embeddings. We apply, extend and evaluate different meta-embedding methods from the word embedding literature at the sentence level, including dimensionality reduction (Yin and Schütze, 2016), generalized Canonical Correlation Analysis (Rastogi et al., 2015) and cross-view auto-encoders (Bollegala and Bao, 2018). Our sentence meta-embeddings set a new unsupervised State of The Art (SoTA) on the STS Benchmark and on the STS12-STS16 datasets, with gains of between 3.7% and 6.4% Pearson’s r over single-source systems.


Introduction
Word meta-embeddings have been shown to exceed single-source word embeddings on wordlevel semantic benchmarks (Yin and Schütze, 2016;Bollegala and Bao, 2018).Presumably, this is because they combine the complementary strengths of their components.
There has been recent interest in pre-trained "universal" sentence encoders, i.e., functions that encode diverse semantic features of sentences into fixed-size vectors (Conneau et al., 2017).Since these sentence encoders differ in terms of their architecture and training data, we hypothesize that their strengths are also complementary and that they can benefit from meta-embeddings.
To test this hypothesis, we adapt different meta-embedding methods from the word embedding literature.These include dimensionality reduction (Yin and Schütze, 2016), cross-view autoencoders (Bollegala and Bao, 2018) and Generalized Canonical Correlation Analysis (GCCA) (Rastogi et al., 2015).The latter method was also used by Poerner and Schütze (2019) for domainspecific Duplicate Question Detection.
Our sentence encoder ensemble includes three models from the recent literature: Sentence-BERT (Reimers and Gurevych, 2019), the Universal Sentence Encoder (Cer et al., 2017) and averaged ParaNMT vectors (Wieting and Gimpel, 2018).Our meta-embeddings outperform every one of their constituent single-source embeddings on STS12-16 (Agirre et al., 2016) and on the STS Benchmark (Cer et al., 2017).Crucially, since our meta-embeddings are agnostic to the size and specifics of their ensemble, future improvements may be possible by adding new encoders.

Related work
Word embeddings are functions that map word types to vectors.They are typically trained on unlabeled corpora and capture word semantics (e.g., Mikolov et al. (2013); Pennington et al. (2014)).
Word meta-embeddings combine ensembles of word embeddings by various operations: Yin and Schütze (2016) use concatenation, SVD and linear projection, Coates and Bollegala (2018) show that averaging word embeddings has properties similar to concatenation.Rastogi et al. (2015) apply generalized canonical correlation analysis (GCCA) to an ensemble of word vectors.
Sentence embeddings are methods that produce one vector per sentence.They can be grouped into two categories: (a) Word embedding average sentence encoders take a (potentially weighted) average of pretrained word embeddings.
Despite their inability to understand word order, they are surprisingly effective on sentence similarity tasks (Arora et al., 2017;Wieting and Gimpel, 2018;Ethayarajh, 2018) (b) Complex contextualized sentence encoders, such as Long Short Term Memory Networks (LSTM) (Hochreiter and Schmidhuber, 1997) or Transformers (Vaswani et al., 2017).Contextualized encoders can be pre-trained as unsupervised language models (Peters et al., 2018;Devlin et al., 2019), but they are usually improved on supervised transfer tasks such as Natural Language Inference (Cer et al., 2018).
Sentence meta-embeddings are less frequently explored than their word-level counterparts.Kiela et al. (2018) create meta-embeddings by training an LSTM sentence encoder on top of a set of dynamically combined word embeddings.Since this approaches requires labeled data, it is not applicable to unsupervised STS.
Tang and de Sa (2019) train a Recurrent Neural Network (RNN) and a word embedding average encoder jointly on a large corpus to predict similar representations for neighboring sentences.Their approach trains both encoders from scratch, i.e., it cannot be used to combine existing encoders.Poerner and Schütze (2019) propose a GCCAbased sentence meta-embedding that combines domain-specific and generic sentence encoders for unsupervised Duplicate Question Detection.In this paper, we extend their approach by exploring a wider range of meta-embedding methods and an ensemble that is more suited to STS.
Semantic Textual Similarity (STS) is the task of rating the similarity of two natural language sentences.Related applications are semantic search, duplicate detection and sentence clustering.
Supervised SoTA systems for STS typically apply cross-sentence attention (Devlin et al., 2019;Wang et al., 2019).This means that they are unable to cache embeddings, making them impractical for many downstream applications.Supervised "siamese" models (Reimers and Gurevych, 2019) are not competitive with cross-sentence attention, but they can cache embeddings.Our metaembeddings are also cacheable (and hence efficient), but we do not need supervision.

Sentence Meta-Embeddings
Below, we assume access to an ensemble of pretrained sentence encoders, denoted F 1 . . .F J .Every F j maps from the (infinite) set of possible sentences S to a fixed-size d j -dimensional vector.
Word meta-embeddings are usually learned from a finite vocabulary of word types (Yin and Schütze, 2016).Sentence embeddings lack such a "vocabulary", as they can encode any member of S. Therefore, we train on a sample S ⊂ S.
The method can easily be extended to sentence metaembeddings: and let USV T ≈ X be its d-truncated SVD.The meta-embedding of a new sentence s ′ is: GCCA.Given random vectors x 1 , x 2 , Canonical Correlation Analysis (CCA) finds linear projections such that θ T 1 x 1 and θ T 2 x 2 are maximally correlated.Generalized CCA (GCCA) extends CCA to three or more random vectors.Bach and Jordan (2002) show that a variant of GCCA reduces to a generalized eigenvalue problem on block matrices: where Σ are (cross-)covariance matrices of F 1 . . .F J estimated on S. For stability, we add τ E , where τ is a hyperparameter.Given the top-d eigenvectors Θ j ∈ R d j ×d , the meta-embedding of sentence s ′ is: F Autoencoders.Autoencoder meta-embeddings are trained by gradient descent to minimize some cross-embedding reconstruction loss.For example, Bollegala and Bao (2018) train feed-forward networks (FFN) to encode two sets of word embeddings into a shared space, and then reconstruct them such that mean squared error with the original embeddings is minimized.We extend their approach to sentence encoders as follows: Every sentence encoder F j has a trainable encoder E j : R d j → R d and a trainable decoder D j : R d → R d j , where d is a hyperparameter.The meta-embedding of a new sentence s ′ is: Our training objective is to reconstruct every embedding x j ′ from every E j (x j ).This results in J2 loss terms: L(x 1 . . . Neill and Bollegala (2018) propose different reconstruction loss functions for l: Mean Squared Error (MSE), Mean Absolute Error (MAE), KL-Divergence (KLD) or squared cosine distance (1-COS) 2 .

Experiments
Data.
We train on all sentences of length < 60 from the first file (news.en-00001-of-00100) of the tokenized, lowercased Billion Word Corpus (Chelba et al., 2014) (∼302K sentences).We evaluate on STS12-16 (Agirre et al., 2016) and the unsupervised STS Benchmark test set (Cer et al., 2017). 2 These datasets consist of triples (s 1 , s 2 , y), where s 1 , s 2 are sentences and y is their ground truth semantic similarity.The task is to predict similarity scores ŷ that correlate well with y.We predict ŷ = cos(F(s 1 ), F(s 2 )).
Metrics.Previous work on STS differs with respect to (a) the correlation metric and (b) how to aggregate the sub-testsets of STS12-16.To maximize comparability, we report both Pearson's r and Spearman's ρ.On STS12-16, we aggregate by a non-weighted average, which diverges from the original shared tasks (Agirre et al., 2016) but ensures comparability with more recent baselines (Wieting and Gimpel, 2018;Ethayarajh, 2018).Results for individual STS12-16 subtestsets can be found in the Appendix. Ensemble.
Our ensemble contains three sentence encoders: Universal Sentence Encoder (USE) (Cer et al., 2018), Sentence-BERT (SBERT) (Reimers and Gurevych, 2019) and ParaNMT (Wieting and Gimpel, 2018).SBERT is a pre-trained BERT Transformer (Devlin et al., 2019) finetuned on Natural Language Inference. 3SE is a Transformer trained on skip-thought, conversation response prediction and Natural Language Inference.ParaNMT averages word and 3-gram vectors trained on synthetic similar sentence pairs generated by Machine Translation.
Hyperparameters.We set d = 1024 in all experiments, which corresponds to the embedding size of SBERT.The value of τ (GCCA), as well as the autoencoder depth and loss function are tuned on the development set (Table 1).We train the autoencoder for a fixed number of 500 epochs with a batch size of 10,000 and the Adam optimizer (Kingma and Ba, 2014).
Efficiency.All meta-embeddings are efficient to train, either because they have closed-form solutions (GCCA and SVD) or because they are lightweight FFNs (autoencoder).The underlying sentence encoders are more complex and slow, but since we do not train them, we can cache and reuse their outputs.At inference time, our metaembeddings are cacheable, i.e., scalable.For instance, to calculate the pairwise similarities of N sentences, systems that use cross-sentence attention -such as the current STS SoTA by Wang et al. (2019) -have to compute N 2 sentence pair embeddings, while we -like Reimers and Gurevych (2019) -compute N sentence embeddings and N 2 pairwise cosines.Baselines.Our main baselines are our singlesource embeddings.Wieting and Kiela (2019) warn that high-dimensional sentence embeddings can have an advantage over low-dimensional ones.To exclude this possibility, we also up-project smaller embeddings by a random d × d j matrix sampled from [−d −0.5 j , d −0.5 j ].Since the upprojected embeddings perform slightly worse than their originals (see Table 2, rows 4-5), we are confident that performance gains by our metaembeddings are due to content rather than size.
Discussion.Table 2 shows that even the worst of our meta-embeddings consistently outperforms its single-source components.This underlines the overall usefulness of ensembling sentence encoders, irrespective of the meta-embedding method.GCCA is the most successful metaembedding method, beating the other methods on five out of six STS datasets.By contrast, SVD and the autoencoder fail to improve over naive concatenation or averaging.Note that concatenation is not directly comparable to the other meta-embeddings due to dimensionality.
To the best of our knowledge, our GCCA metaembeddings set a new unsupervised SoTA on the STS Benchmark and on STS12-16.We close the gap between unsupervised systems and the supervised siamese SoTA of Reimers and Gurevych (2019), from 7% down to 2% Spearman's ρ on the STS Benchmark.
Ablation.Table 3 shows that all single-source embeddings contribute positively to the GCCA meta-embedding, which supports their hypothesized complementarity.The result suggests that further improvements may be possible by extending the ensemble.

Conclusion
Inspired by success on word meta-embeddings, we have shown how to apply different metaembedding techniques to ensembles of sentence encoders.
We have shown that all metaembeddings consistently outperform their individual single-source components on the STS Benchmarks and the STS-16 datasets, with a new unsupervised SoTA set by our GCCA metaembeddings.Because sentence meta-embeddings are agnostic to the size and specifics of their ensemble, we hope that this performance can improve further when we add new sentence encoders.