Learning Compressed Sentence Representations for On-Device Text Processing

Vector representations of sentences, trained on massive text corpora, are widely used as generic sentence embeddings across a variety of NLP problems. The learned representations are generally assumed to be continuous and real-valued, giving rise to a large memory footprint and slow retrieval speed, which hinders their applicability to low-resource (memory and computation) platforms, such as mobile devices. In this paper, we propose four different strategies to transform continuous and generic sentence embeddings into a binarized form, while preserving their rich semantic information. The introduced methods are evaluated across a wide range of downstream tasks, where the binarized sentence embeddings are demonstrated to degrade performance by only about 2% relative to their continuous counterparts, while reducing the storage requirement by over 98%. Moreover, with the learned binary representations, the semantic relatedness of two sentences can be evaluated by simply calculating their Hamming distance, which is more computational efficient compared with the inner product operation between continuous embeddings. Detailed analysis and case study further validate the effectiveness of proposed methods.


Introduction
Learning general-purpose sentence representations from large training corpora has received widespread attention in recent years.The learned sentence embeddings can encapsulate rich prior knowledge of natural language, which has been demonstrated to facilitate a variety of downstream tasks (without fine-tuning the encoder weights).The generic sentence embeddings can be trained either in an unsupervised manner (Kiros et al.,  2015; Hill et al., 2016; Jernite et al., 2017; Gan    *  Equal contribution.et al., 2017; Logeswaran and Lee, 2018; Pagliardini et al., 2018), or with supervised tasks such as paraphrase identification (Wieting et al., 2016), natural language inference (Conneau et al., 2017), discourse relation classification (Nie et al., 2017), machine translation (Wieting and Gimpel, 2018), etc.
Significant effort has been devoted to designing better training objectives for learning sentence embeddings.However, prior methods typically assume that the general-purpose sentence representations are continuous and real-valued.However, this assumption is sub-optimal from the following perspectives: i) the sentence embeddings require large storage or memory footprint; ii) it is computationally expensive to retrieve semanticallysimilar sentences, since every sentence representation in the database needs to be compared, and the inner product operation is computationally involved.These two disadvantages hinder the applicability of generic sentence representations to mobile devices, where a relatively tiny memory footprint and low computational capacity are typically available (Ravi and Kozareva, 2018).
In this paper, we aim to mitigate the above issues by binarizing the continuous sentence embeddings.Consequently, the embeddings require much smaller footprint, and similar sentences can be obtained by simply selecting those with closest binary codes in the Hamming space (Kiros and Chan, 2018).One simple idea is to naively binarize the continuous vectors by setting a hard threshold.However, we find that this strategy leads to significant performance drop in the empirical results.Besides, the dimension of the binary sentence embeddings cannot be flexibly chosen with this strategy, further limiting the practice use of the direct binarization method.
In this regard, we propose three alternative strategies to parametrize the transformation from pre-trained generic continuous embeddings to their binary forms.Our exploration spans from simple operations, such as a random projection, to deep neural network models, such as a regularized autoencoder.Particularly, we introduce a semantic-preserving objective, which is augmented with the standard autoenoder architecture to encourage abstracting informative binary codes.InferSent (Conneau et al., 2017) is employed as the testbed sentence embeddings in our experiments, but the binarization schemes proposed here can easily be extended to other pretrained general-purpose sentence embeddings.We evaluate the quality of the learned general-purpose binary representations using the SentEval toolkit (Conneau et al., 2017).It is observed that the inferred binary codes successfully maintain the semantic features contained in the continuous embeddings, and only lead to around 2% performance drop on a set of downstream NLP tasks, while requiring merely 1.5% memory footprint of their continuous counterparts.
Moreover, on several sentence matching benchmarks, we demonstrate that the relatedness between a sentence pair can be evaluated by simply calculating the Hamming distance between their binary codes, which perform on par with or even superior than measuring the cosine similarity between continuous embeddings (see Table 1).Note that computing the Hamming distance is much more computationally efficient than the inner product operation in a continuous space.We further perform a K-nearest neighbor sentence retrieval experiment on the SNLI dataset (Bowman et al., 2015), and show that those semanticallysimilar sentences can indeed be efficiently retrieved with off-the-shelf binary sentence representations.Summarizing, our contributions in this paper are as follows: i) to the best of our knowledge, we conduct the first systematic exploration on learning general-purpose binarized (memory-efficient) sentence representations, and four different strategies are proposed; ii) an autoencoder architecture with a carefullydesigned semantic-preserving loss exhibits strong empirical results on a set of downstream NLP tasks; iii) more importantly, we demonstrate, on several sentence-matching datasets, that simply evaluating the Hamming distance over binary repre-sentations performs on par or even better than calculating the cosine similarity between their continuous counterparts (which is less computationallyefficient).

Related Work
Sentence representations pre-trained from a large amount of data have been shown to be effective when transferred to a wide range of downstream tasks.Prior work along this line can be roughly divided into two categories: i) pre-trained models that require fine-tuning on the specific transferring task (Dai and Le, 2015;Ruder and Howard, 2018;Radford et al., 2018;Devlin et al., 2018;Cer et al., 2018); ii) methods that extract general-purpose sentence embeddings, which can be effectively applied to downstream NLP tasks without finetuning the encoder parameters (Kiros et al., 2015;Hill et al., 2016;Jernite et al., 2017;Gan et al., 2017;Adi et al., 2017;Logeswaran and Lee, 2018;Pagliardini et al., 2018;Tang and de Sa, 2018).Our proposed methods belong to the second category and provide a generic and easy-to-use encoder to extract highly informative sentence representations.However, our work is unique since the embeddings inferred from our models are binarized and compact, and thus possess the advantages of small memory footprint and much faster sentence retrieval.
Learning memory-efficient embeddings with deep neural networks has attracted substantial attention recently.One general strategy towards this goal is to extract discrete or binary data representations (Jang et al., 2016;Shu and Nakayama, 2017;Dai et al., 2017;Chen et al., 2018;Shen et al., 2018;Tissier et al., 2019).Binarized embeddings are especially attractive because they are more memory-efficient (relative to discrete embeddings), and they also enjoy the advantages of fast retrieval based upon a Hamming distance calculation.Previous work along this line in NLP has mainly focused on learning compact representations at the word-level (Shu and Nakayama, 2017;Chen et al., 2018;Tissier et al., 2019), while much less effort has been devoted to extracting binarized embeddings at the sentence-level.Our work aims to bridge this gap, and serves as an initial attempt to facilitate the deployment of state-of-the-art sentence embeddings on on-device mobile applications.
Our work is also related to prior research on semantic hashing, which aims to learn binary text embeddings specifically for the information retrieval task (Salakhutdinov and Hinton, 2009;Zhang et al., 2010;Wang et al., 2014;Xu et al., 2015;Shen et al., 2018).However, these methods are typically trained and evaluated on documents that belong to a specific domain, and thus cannot serve as generic binary sentence representation applicable to a wide variety of NLP taks.In contrast, our model is trained on large corpora and seeks to provide general-purpose binary representations that can be leveraged for various application scenarios.

Proposed Approach
We aim to produce compact and binarized representations from continuous sentence embeddings, and preserve the associated semantic information.Let x and f denote, respectively, an input sentence and the function defined by a pre-trained generalpurpose sentence encoder.Thus, f (x) represents the continuous embeddings extracted by the encoder.The goal of our model is to learn a universal transformation g that can convert f (x) to highly informative binary sentence representations, i.e., g(f (x)), which can be used as generic features for a collection of downstream tasks.We explore four strategies to parametrize the transformation g.

Hard Threshold
We use h and b to denote the continuous and binary sentence embeddings, respectively, and L denotes the dimension of h.The first method to binarize the continuous representations is to simply convert each dimension to either 0 or 1 based on a hard threshold.This strategy requires no training and directly operates on pre-trained continuous embeddings.Suppose s is the hard threshold, we have, for i = 1, 2, ......, L: One potential issue of this direct binarization method is that the information contained in the continuous representations may be largely lost, since there is no training objective encouraging the preservation of semantic information in the produced binary codes (Shen et al., 2018).Another disadvantage is that the length of the resulting binary code must be the same as the original continuous representation, and can not be flexibly chosen.In practice, however, we may want to learn shorter binary embeddings to save more memory footprint or computation.

Random Projection
To tackle the limitation of the above direct binarization method, we consider an alternative strategy that requires no training either: simply applying a random projection over the pre-trained continuous representations.Wieting and Kiela (2018) has shown that random sentence encoders can effectively construct universal sentence embeddings from word vectors, while possessing the flexibility of adaptively altering the embedding dimensions.
Here, we are interested in exploring whether a random projection would also work well while transforming continuous sentence representations into their binary counterparts.
We randomly initialize a matrix W ∈ R D×L , where D denotes the dimension of the resulting binary representations.Inspired by the standard initialization heuristic employed in (Glorot and Bengio, 2010;Wieting and Kiela, 2018), the values of the matrix are initialized as sampled uniformly.For i = 1, 2, . . ., D and j = 1, 2, . . ., L, we have: After converting the continuous sentence embeddings to the desired dimension D with the matrix randomly initialized above, we further apply the operation in (1) to binarize it into the discrete/compact form.The dimension D can be set arbitrarily with this approach, which is easily applicable to any pre-trained sentence embeddings (since no training is needed).This strategy is related to the Locality-Sensitive Hashing (LSH) for inferring binary embeddings (Van Durme and Lall, 2010).

Principal Component Analysis
We also consider an alternative strategy to adaptively choose the dimension of the resulting binary representations.Specifically, Principal Component Analysis (PCA) is utilized to reduce the dimensionality of pre-trained continuous embeddings.
Given a set of sentences {x i } N i=1 and their corresponding continuous embeddings {h i } N i=1 ⊂ R L , we learn a projection matrix to reduce the embedding dimensions while keeping the embeddings distinct as much as possible.After centralizing the embeddings as 4 c 1 7 9 t 6 9 D + 9 z 3 l r w 8 p l j W I D 3 9 Q t O 0 p N e < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " J 4 c 1 7 9 t 6 9 D + 9 z 3 l r w 8 p l j W I D 3 9 Q t O 0 p N e < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " J 4 c 1 7 9 t 6 9 D + 9 z 3 l r w 8 p l j W I D 3 9 Q t O 0 p N e < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " J A V e 4 c 1 7 9 t 6 9 D + 9 z 3 l r w 8 p l j W I D 3 9 Q t O 0 p N e < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " T L h i a f y z + 7 k V W 7 h d y Q 9 4 i j 1 7 7 M 0 = " > A r / D m 5 d 6 7 9 + F 9 z l t L X j F z D A v w v n 4 B h V S T c w = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " T L h i a f y z + 7 k V W 7 h d y Q 9 4 i j 1 7 7 M 0 = " > A r / D m 5 d 6 7 9 + F 9 z l t L X j F z D A v w v n 4 B h V S T c w = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " T L h i a f y z + 7 k V W 7 h d y Q 9 4 i j 1 7 7 M 0 = " > A r / D m 5 d 6 7 9 + F 9 z l t L X j F z D A v w v n 4 B h V S T c w = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " T L h i a f y z + 7 k V W 7 h d y Q 9 4 i j 1 7 7 M 0 = " > A A V e 4 c 1 7 9 t 6 9 D + 9 z 3 l r w 8 p l j W I D 3 9 Q t O 0 p N e < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " J x F U R W U H 6 R + v J i r w M s 9 s g 7 W r H j U = " > A A V e 4 c 1 7 9 t 6 9 D + 9 z 3 l r w 8 p l j W I D 3 9 Q t O 0 p N e < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " J x F U R W U H 6 R + v J i r w M s 9 s g 7 W r H j U = " > A e 4 c 1 7 9 t 6 9 D + 9 z 3 l r w 8 p l j W I D 3 9 e 4 c 1 7 9 t 6 9 D + 9 z 3 l r w 8 p l j W I D 3 9 e 4 c 1 7 9 t 6 9 D + 9 z 3 l r w 8 p l j W I D 3 9 e 4 c 1 7 9 t 6 9 D + 9 z 3 l r w 8 p l j W I D 3 9 A V e 4 c 1 7 9 t 6 9 D + 9 z 3 l r w 8 p l j W I D 3 9 Q t O 0 p N e < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " J A V e 4 c 1 7 9 t 6 9 D + 9 z 3 l r w 8 p l j W I D 3 9 Q t O 0 p N e < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " J A V e 4 c 1 7 9 t 6 9 D + 9 z 3 l r w 8 p l j W I D 3 9 Q t O 0 p N e < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " J A V e 4 c 1 7 9 t 6 9 D + 9 z 3 l r w 8 p l j W I D 3 9 Q t O 0 p N e < / l a t e x i t > 1 < l a t e x i t s h a 1 _ b a s e 6 4 = " T L h i a f y z + 7 k V W 7 h d y Q 9 4 i j 1 7 7 M 0 = " > A r / D m 5 d 6 7 9 + F 9 z l t L X j F z D A v w v n 4 B h V S T c w = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " T L h i a f y z + 7 k V W 7 h d y Q 9 4 i j 1 7 7 M 0 = " > A r / D m 5 d 6 7 9 + F 9 z l t L X j F z D A v w v n 4 B h V S T c w = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " T L h i a f y z + 7 k V W 7 h d y Q 9 4 i j 1 7 7 M 0 = " > A r / D m 5 d 6 7 9 + F 9 z l t L X j F z D A v w v n 4 B h V S T c w = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " T L h i a f y z + 7 k V W 7 h d y Q 9 4 i j 1 7 7 M 0 = " > A r / D m 5 d 6 7 9 + F 9 z l t L X j F z D A v w v n 4 B h V S T c w = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " T L h i a f y z + 7 k V W 7 h d y Q 9 4 i j 1 7 7 M 0 = " > A r / D m 5 d 6 7 9 + F 9 z l t L X j F z D A v w v n 4 B h V S T c w = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " T L h i a f y z + 7 k V W 7 h d y Q 9 4 i j 1 7 7 M 0 = " > A r / D m 5 d 6 7 9 + F 9 z l t L X j F z D A v w v n 4 B h V S T c w = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " T L h i a f y z + 7 k V W 7 h d y Q 9 4 i j 1 7 7 M 0 = " > A A V e 4 c 1 7 9 t 6 9 D + 9 z 3 l r w 8 p l j W I D 3 9 Q t O 0 p N e < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " J A V e 4 c 1 7 9 t 6 9 D + 9 z 3 l r w 8 p l j W I D 3 9 Q t O 0 p N e < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " J A V e 4 c 1 7 9 t 6 9 D + 9 z 3 l r w 8 p l j W I D 3 9 Q t O 0 p N e < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " J A V e 4 c 1 7 9 t 6 9 D + 9 z 3 l r w 8 p l j W I D 3 9 Q t O 0 p N e < / l a t e x i t > 1 < l a t e x i t s h a 1 _ b a s e 6 4 = " T L h i a f y z + 7 k V W 7 h d y Q 9 4 i j 1 7 7 M 0 = " > A A V e 4 c 1 7 9 t 6 9 D + 9 z 3 l r w 8 p l j W I D 3 9 Q t O 0 p N e < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " J A V e 4 c 1 7 9 t 6 9 D + 9 z 3 l r w 8 p l j W I D 3 9 Q t O 0 p N e < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " J A V e 4 c 1 7 9 t 6 9 D + 9 z 3 l r w 8 p l j W I D 3 9 Q t O 0 p N e < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " J A V e 4 c 1 7 9 t 6 9 D + 9 z 3 l r w 8 p l j W I D 3 9 Q t O 0 p N e < / l a t e x i t > 1 < l a t e x i t s h a 1 _ b a s e 6 4 = " T L h i a f y z + 7 k V W 7 h d y Q 9 4 i j 1 7 7 M 0 = " > A r / D m 5 d 6 7 9 + F 9 z l t L X j F z D A v w v n 4 B h V S T c w = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " T L h i a f y z + 7 k V W 7 h d y Q 9 4 i j 1 7 7 M 0 = " > A A A B + X i c b V B N S w M x E J 2 t X 3 X 9 q n r 0 E i y C r / D m 5 d 6 7 9 + F 9 z l t L X j F z D A v w v n 4 B h V S T c w = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " T L h i a f y z + 7 k V W 7 h d y Q 9 4 i j 1 7 7 M 0 = " > A A A B + X i c b V B N S w M x E J 2 t X 3 X 9 q n r 0 E i y C  a matrix H = (h 1 , h , . . ., h N ), has the singular value decomposition (SVD): where Λ is an L × N matrix with descending singular values of X on its diagonal, with U and V orthogonal matrices.Then the correlation matrix can be written as: We select first D rows of U as our projection matrix W = U 1:D , then the correlation matrix of W H is W HH T W T = diag(λ 1 , λ 2 , . . ., λ D ), which indicates that the embeddings are projected to D independent and most distinctive axes.
After projecting continuous embeddings to a representative lower dimensional space, we apply the hard threshold function at the position 0 to obtain the binary representations (since the embeddings are zero-centered).

Autoencoder Architecture
The methods proposed above suffer from the common issue that the model objective does not explicitly encourage the learned binary codes to retain the semantic information of the original continuous embeddings, and a separate binarization step is employed after training.To address this shortcoming, we further consider an autoencoder architecture, that leverages the reconstruction loss to hopefully endow the learned binary representations with more information.Specifically, an encoder network is utilized to transform the continuous into a binary latent vector, which is then reconstructed back with a decoder network.
For the encoder network, we use a matrix operation, followed by a binarization step, to extract useful features (similar to the random projection setup).Thus, for i = 1, 2, . . ., D, we have: where k is the bias term and k (i) corresponds to the i-th element of k. s (i) denotes the threshold determining whether the i-th bit is 0 or 1.During training, we may use either deterministic or stochastic binarization upon the latent variable.For the deterministic case, s (i) = 0.5 for all dimensions; in the stochastic case, s (i) is uniformly sampled as: s (i) ∼ Uniform(0, 1).We conduct an empirical comparison between these two binarization strategies in Section 4.
Prior work have shown that linear decoders are favorable for learning binary codes under the encoder-decoder framework (Carreira-Perpinán and Raziperchikolaei, 2015;Dai et al., 2017;Shen et al., 2018).Inspired by these results, we employ a linear transformation to reconstruct the original continuous embeddings from the binary codes: where W and k are weight and bias term respectively, which are learned.The mean square error between h and ĥ is employed as the reconstruction loss: This objective imposes the binary vector b to encode more information from h (leading to smaller reconstruction error).Straight-through (ST) estimator (Hinton, 2012) is utilized to estimate the gradients for the binary variable.The autoencoder model is optimized by minimizing the reconstruction loss for all sentences.After training, the encoder network is leveraged as the transformation to convert the pre-trained continuous embeddings into the binary form.

Semantic-preserving Regularizer
Although the reconstruction objective can help the binary variable to endow with richer semantics, there is no loss that explicitly encourages the binary vectors to preserve the similarity information contained in the original continuous embeddings.Consequently, the model may lead to small reconstruction error but yield sub-optimal binary representations (Tissier et al., 2019).To improve the semantic-preserving property of the inferred binary embeddings, we introduce an additional objective term.Consider a triple group of sentences (x α , x β , x γ ), whose continuous embeddings are (h α , h β , h γ ), respectively.Suppose that the cosine similarity between h α and h β is larger than that between h β and h γ , then it is desirable that the Hamming distance between b α and b β should be smaller than that between b β and b γ (notably, both large cosine similarity and small Hamming distance indicate that two sentences are semantically-similar).
Let d c (•, •) and d h (•, •) denote the cosine similarity and Hamming distance (in the continuous and binary embedding space), respectively.Define l α,β,γ as an indicator such that, l α,β,γ = 1 if d c (h α , h β ) ≥ d c (h β , h γ ), and l α,β,γ = −1 otherwise.The semantic-preserving regularizer is then defined as: By penalizing L sp , the learned transformation function g is explicitly encouraged to retain the semantic similarity information of the original continuous embeddings.Thus, the entire objective function to be optimized is: where λ sp controls the relative weight between the reconstruction loss (L rec ) and semantic-preserving loss (L sp ).

Discussion
Another possible strategy is to directly train the general-purpose binary embeddings from scratch, i.e., jointly optimizing the continuous embeddings training objective and continuous-to-binary parameterization.However, our initial attempts demonstrate that this strategy leads to inferior empirical results.This observation is consistent with the results reported in (Kiros and Chan, 2018).Specifically, a binarization layer is directly appended over the InferSent architecture (Conneau et al., 2017) during training, which gives rise to much larger drop in terms of the embeddings' quality (we have conducted empirical comparisons with (Kiros and Chan, 2018) in Table 1).
Therefore, here we focus on learning universal binary embeddings based on pretained continuous sentence representations.
4 Experimental setup

Pre-trained Continuous Embeddings
Our proposed model aims to produce highly informative binary sentence embeddings based upon pre-trained continuous representations.In this paper, we utilize InferSent (Conneau et al., 2017) as the continuous embeddings (given its effectiveness and widespread use).Note that all four proposed strategies can be easily extended to other pre-trained general-purpose sentence embeddings as well.Specifically, a bidirectional LSTM architecture along with a max-pooling operation over the hidden units is employed as the sentence encoder, and the model parameters are optimized on the natural language inference tasks, i.e., Standford Natural Language Inference (SNLI) (Bowman et al., 2015) and Multi-Genre Natural Language Inference (MultiNLI) datasets (Williams et al., 2017).

Training Details
Our model is trained using Adam (Kingma and Ba, 2014), with a value 1 × 10 −5 as the learning rate for all the parameters.The number of bits (i.e., dimension) of the binary representation is set as 512, 1024, 2048 or 4096, and the best choice for each model is chosen on the validation set, and the corresponding test results are presented in Table 1.The batch size is chosen as 64 for all model variants.The hyperparameter over λ sp is selected from {0.2, 0.5, 0.8, 1} on the validation set, and 0.8 is found to deliver the best empirical results.The training with the autoencoder setup takes only about 1 hour to converge, and thus can be readily applicable to even larger datasets.

Evaluation
To facilitate comparisons with other baseline methods, we use SentEval toolkit1 (Conneau and Kiela, 2018) to evaluate the learned binary (compact) sentence embeddings.Concretely, the learned representations are tested on a series of downstream tasks to assess their transferability (with the encoder weights fixed), which can be categorized as follows: • Sentence classification, including sentiment analysis (MR, SST), product reviews (CR), subjectivity classification (SUBJ), opinion polarity detection (MPQA) and question type classification (TREC).A linear classifier is trained with the generic sentence embeddings as the input features.The default SentEval settings is used for all the datasets.
• Sentence matching, which comprises semantic relatedness (SICK-R, STS14, STSB) and paraphrase detection (MRPC).Particularly, each pair of sentences in STS14 dataset is associated with a similarity score from 0 to 5 (as the corresponding label).Hamming distance between the binary representations is directly leveraged as the prediction score (without any classifier parameters).
For the sentence matching benchmarks, to allow fair comparison with the continuous embeddings, we do not use the same classifier architecture in SentEval.Instead, we obtain the predicted relatedness by directly computing the cosine similarity between the continuous embeddings.Consequently, there are no classifier parameters for both the binary and continuous representations.The same valuation metrics in SentEval (Conneau and Kiela, 2018) are utilized for all the tasks.For MRPC, the predictions are made by simply judging whether a sentence pair's score is larger or smaller than the averaged Hamming distance (or cosine similarity).

Baselines
We consider several strong baselines to compare with the proposed methods, which include both continuous (dense) and binary (compact) representations.For the continuous generic sentence embeddings, we make comparisons with fastText-BoV (Joulin et al., 2016), Skip-Thought Vectors (Kiros et al., 2015) and InferSent (Conneau et al., 2017).As to the binary embeddings, we consider the binarized version of InferLite (Kiros and Chan, 2018), which, as far as we are concerned, is the only general-purpose binary representations baseline reported.

Experimental Results
We experimented with five model variants to learn general-purpose binary embeddings: HTbinary (hard threshold, which is selected from {0, 0.01, 0.1} on the validation set), Rand-binary (random projection), PCA-binary (reduce the dimensionality with principal component analysis), AE-binary (autoencoder with the reconstruction objective) and AE-binary-SP (autoencoder with both the reconstruction objective and Semantic-Preserving loss).Our code will be released to encourage future research.

Task transfer evaluation
We evalaute the binary sentence representations produced by different methods with a set of transferring tasks.The results are shown in Table 1.
The proposed autoencoder architecture generally demonstrates the best results.Especially while combined with the semantic-preserving loss defined in (7), AE-binary-SP exhibits higher performance compared with a standard autoencoder.It is worth noting that the Rand-binary and PCAbinary model variants also show competitive performance despite their simplicity.These strategies are also quite promising given that no training is required given the pre-trained continuous sentence representations.
Another important result is that, the AE-binary-SP achieves competitive results relative to the In-ferSent, leading to only about 2% loss on most datasets and even performing at par with InferSent on several datasets, such as the MPQA and STS14 datasets.On the sentence matching tasks, the yielded binary codes are evaluated by merely utilizing the hamming distance features (as mentioned above).To allow fair comparison, we compare the predicted scores with the cosine similarity scores based upon the continuous representations (there are no additional parameters for the classifier).The binary codes brings out promising empirical results relative to their continuous counterparts, and even slightly outperform InferSent on the STS14 dataset.
We also found that our AE-binary-SP model variant consistently demonstrate superior results than the InferLite baselines, which optimize the NLI objective directly over the binary representations.This may be attributed to the difficulty of backpropagating gradients through discrete/binary variables, and would be an interesting direction for future research.

Nearest Neighbor Retrieval
Case Study One major advantage of binary sentence representations is that the similarity of two sentences can be evaluated by merely calculating the hamming distance between their binary codes.To gain more intuition regarding the semantic information encoded in the binary embeddings, we convert all the sentences in the SNLI dataset into continuous and binary vectors (with InferSent-G and AE-binary-SP, respectively).The top-3 closet sentences are retrieved based upon the corresponding metrics, and the resulting samples are shown in Table 2.It can be observed that the sentences selected based upon the Hamming distance indeed convey very similar semantic meanings.In some cases, the results with binary codes are even more reasonable compared with the continuous embeddings.For example, for the first query, all three sentences in the left column relate to "watching a movie", while one of the sentences in the right column is about "sleeping".
Retrieval Speed The bitwise comparison is much faster than the element-wise multiplication operation (between real-valued vectors) (Tissier et al., 2019).To verify the speed improvement, we sample 10000 sentence pairs from SNLI and extract their continuous and binary embeddings (with the same dimension of 4096), respectively.We record the time to compute the cosine similarity and hamming distance between the corresponding representations.With our Python implementation, it takes 3.67µs and 288ns respectively, indicating that calculating the Hamming distance is over 12 times faster.Our implementation is not optimized, and the running time of computing Hamming distance can be further improved (to be proportional to the number of different bits, rather than the input length2 ).

The effect of semantic-preserving loss
To investigate the importance of incorporating the locality-sensitive regularizer, we select different values of λ sp (ranging from 0.0 to 1.0) and explore how the transfer results would change accordingly.The λ sp controls the relative weight of the semantic-preserving loss term.As shown in Table 3, augmenting the semantic-preserving loss consistently improves the quality of learned binary embeddings, while the best test accuracy on the MR dataset is obtained with λ sp = 0.8.
Table 2: Nearest neighbor retrieval results on the SNLI dataset.Given a a query sentence, the left column shows the top-3 retrieved samples based upon the hamming distance with all sentences' binary representations, while the right column exhibits the samples according to the cosine similarity of their continuous embeddings.Table 3: Ablation study for the AE-binary-SP model with different choices of λ sp (evaluated with test accuracy on the MR dataset).

Sampling strategy
As discussed in Section 3.4, the binary latent vector b can be obtained with either a deterministic or stochastically sampled threshold.We compare these two sampling strategies on several downstream tasks.As illustrated in Figure 2, setting a fixed threshold demonstrates better empirical performance on all the datasets.Therefore, deterministic threshold is employed for all the autoencoder model variants in our experiments.

The effect of embedding dimension
Except for the hard threshold method, other three proposed strategies all possess the flexibility of adaptively choosing the dimension of learned binary representations.extracted binary embeddings to their dimensions, we run four model variants (Rand-binary, PCAbinary, AE-binary, AE-binary-SP) with different number of bits (i.e., 512, 1024, 2048, 4096), and their corresponding results on the MR dataset are shown in Figure 3.For the AE-binary and AE-binary-SP models, longer binary codes consistently deliver better results.While for the Rand-binary and PCA-binary variants, the quality of inferred representations is much less sensitive to the embedding dimension.Notably, these two strategies exhibit competitive performance even with only 512 bits.Therefore, in the case where less memory footprint or little training is preferred, Rand-binary and PCA-binary could be more judicious choices.

Conclusion
This paper presents a first step towards learning binary and general-purpose sentence representations that allow for efficient storage and fast retrieval over massive corpora.To this end, we ex-plore four distinct strategies to convert pre-trained continuous sentence embeddings into a binarized form.Notably, a regularized autoencoder augmented with semantic-preserving loss exhibits the best empirical results, degrading performance by only around 2% while saving over 98% memory footprint.Besides, two other model variants with a random projection or PCA transformation require no training and demonstrate competitive embedding quality even with relatively small dimensions.Experiments on nearest-neighbor sentence retrieval further validate the effectiveness of proposed framework.
p 7 I R Q Y 9 F L x 4 r 2 l p o l 5 J N s 2 1 o k l 2 S r F C W / g S v e v Y m X v 0 1 H v 0 n p u 0 e b O u D g c d 7 M 8 z M i 1 L B j Q 2 C b 6 + 0 t r 6 x u V X e 9 n d 2 9 / Y P K o d H L Z N k m r I m T U S i 2 x E x T H D F m p Z b w d q p Z k R G g j 1 F o 9 u p / / T M t O G J er T j l I W S D B S P O S X W S Q / Y 9 3 u V a l A L Z k C r B B e k C g U a v c p P t 5 / Q T p 7 I R Q Y 9 F L x 4 r 2 l p o l 5 J N s 2 1 o k l 2 S r F C W / g S v e v Y m X v 0 1 H v 0 n p u 0 e b O u D g c d 7 M 8 z M i 1 L B j Q 2 C b 6 + 0 t r 6 x u V X e 9 n d 2 9 / Y P K o d H L Z N k m r I m T U S i 2 x E x T H D F m p Z b w d q p Z k R G g j 1 F o 9 u p / / T M t O G J er T j l I W S D B S P O S X W S Q / Y 9 3 u V a l A L Z k C r B B e k C g U a v c p P t 5 / Q T

Figure 1 :
Figure 1: Proposed model architectures: (a) direct binarization with a hard threshold s; (b) reducing the dimensionality with either a random projection or PCA, followed by a binarization step; (c) an encoding-decoding framework with an additional semantic-preserving loss.

Figure 2 :
Figure 2: The comparison between deterministic and stochastic sampling for the autoencoder strategy.

Figure 3 :
Figure 3: The test accuracy of different model on the MR dataset across 512, 1024, 2048, 4096 bits for the learned binary representations.

Table 1 :
Performance on the test set for 10 downstream tasks.The STS14, STSB and MRPC are evaluated with Pearson and Spearman correlations, and SICK-R is measured with Pearson correlation.All other datasets are evaluated with test accuracy.InferSent-G uses Glove (G) as the word embeddings, while InferSent-FF employs FastText (F) embeddings with Fixed (F) padding.The empirical results of InferLite with different lengths of binary embeddings, i.e., 256, 1024 and 4096, are considered.
To explore the sensitivity of