Word Embedding Binarization with Semantic Information Preservation

With growing applications of Machine Learning in daily lives Natural Language Processing (NLP) has emerged as a heavily researched area. Finding its applications in tasks ranging from simple Q/A chatbots to Fully fledged conversational AI, NLP models are vital. Word and Sentence embedding are one of the most common starting points of any NLP task. A word embedding represents a given word in a predefined vector-space while maintaining vector relations with similar or dis-similar entities. As such different pretrained embedding such as Word2Vec, GloVe, fasttext have been developed. These embedding generated on millions of words are however very large in terms of size. Having embedding with floating point precision also makes the downstream evaluation slow. In this paper we present a novel method to convert continuous embedding to its binary representation, thus reducing the overall size of the embedding while keeping the semantic and relational knowledge intact. This will facilitate an option of porting such big embedding onto devices where space is limited. We also present different approaches suitable for different downstream tasks based on the requirement of contextual and semantic information. Experiments have shown comparable result in downstream tasks with 7 to 15 times reduction in file size and about 5 % change in evaluation parameters.


Introduction
Natural Language Processing (NLP) is quickly becoming one of the most important branch in Machine Learning field, with companies pouring millions to perfect their NLP engines. NLP is the cornerstone in the road to developing a perfect conversational AI. But NLP also finds its way in more trivial but important tasks in modern life, be it a simple chatbot, a Q/A site, Document Classifier. For most of these tasks, an embedding model or a pre-trained embedding is the perfect starting point. A word embedding is a collection of words, represented as vectors in a predefined space. Within this space the vectors follow all typical vector laws. Similar words are close together, dissimilar words are far off. Vector addition and subtraction are possible. Word Embeddings are very important for downstream tasks like document classification, query response, etc. Though Word Embeddings can be generated to be useful to a specific task on a specific dataset, there are embeddings available that are trained on a very large corpus, making them useful in variety of tasks and in turn making them generic. Word2Vec (Mikolov et al., 2013), Glove (Pennington et al., 2014), FastText (Bojanowski et al., 2016) are some of the best generic embeddings available that are trained on millions of vocabularies.
Though training on a large dataset makes the embedding useful, it also makes them very large in size. A 300-dimensional Word2Vec model trained on 3 million datapoints is aroung 3.4 GB. This large size makes the embedding very unportable, especially in low storage scenarios. As a result, many NLP operations are typically performed on the server. This raises issues like privacy concerns, as user data is continuously sent to server and processed there. The first step to solve this issue is to bring down the size of the embedding, small enough to carry on device such as mobile phones and devices where memory should be used judiciously. This will enable on-device training without depencency on server side. In this paper we propose two methods to binarize a pre-trained word embedding, while keeping all the semantic information intact. Our proposed methods are based on an auto-encoder architecture, which encodes continuous real vectors into its corresponding binary vector. The embedding generated by the proposed methods perform well on different types of tasks, with one approach performing good in tasks involving need of contextual and semantic data, while other performing well on the tasks where such data is not crucial (approaches are discussed Section 3). The embedding produced by both the proposed variations are a fraction of the original size with reduction of upto 7 to 15 times. We also demonstrate the performance of these embedding on different downstream tasks like similarity, classification and analogy.

Related Work
Word Embedding is starting point for many NLP applications, usually these embedding are very large in size and it requires a lot of time for subsequent computations.There is a need to reduce the memory footprint of these embedding and to make the computations faster. The idea is to binarize this embedding so that the amount of memory taken by a binary feature would be much lesser as compared to that of a float valued feature. In their research, (Yi et al., 2015) proposed a nonlinear methodology where the high dimensional data was embedded into a hamming cube while maintaining the structure of the original space. As this works on dimensions far greater than the dimensions that we deal with, this study did not prove to be useful for the scope of our research. (Faruqui et al., 2015) proposed to binarize the real valued embedding by first increasing the dimensions of the original embedding to create a sparse matrix and then apply a binarization step to that embedding. Though this retains certain amount of semantic information, the produced vectors are not small. Fasttext.zip (Joulin et al., 2016) binarizes the embedding by clustering and concatenating the binary representations of 'k' closest centroids for each word. The resulting binary vectors produced cannot be used for generic tasks as this works only for document classification. Even though binary representations can make the computations faster, some NLP applications work only on real valued embedding (Ma and Hovy, 2016). In such applications reconstructing real embedding from binary representations becomes very vital.
The research of (Tissier et al., 2019) showed that we could binarize the word embedding using an autoencoder architecture. But the problem with this method was that it was not able to capture a lot of the semantic information. In this paper we have managed to beat the benchmarks from (Tissier et al., 2019) in most of the downstream tasks, which will be shown in Section 6. We have been able to achieve this feat with the inspiration of (Shen et al., 2019) where they binarize sentence embedding and use them for downstream tasks. We propose an approach of using these 2 studies as an inspiration and develop our binarized Word Embedding, which has done very well on benchmarks as show in Section 6.

Methodology
We propose a method that converts continuous word embedding into a binary form. Let x be the word, A x refers to the continuous embedding for the respective word, B x refers to the binary embedding for the word x, f refers to a function that will convert the continuous embedding, A x to the binarized embedding of word x, B x . Therefore, We aim to learn global function f that can be used to generate binary embedding of the highest quality. Quality in this context refers to how informative the binary embedding are as compared to the original ones, how much semantic data from the original it has captured and how much of a size reduction we have obtained as compared to the original embedding. Usability of these binary embedding in downstream tasks becomes an important aspect, as the main aim for us is to substitute the original embedding with the binarized one, and get an advantage by reducing the amount of memory used while still having comparable performance.
The function f can be implemented in many ways so as to achieve the end goal of obtaining binarized representations for continuous embedding, but not all of them will show the quality required for us to achieve the end goal of substitution of the actual embedding. Some of these methods are hard thresholding, random projections and encoder-decoder models. Hard Thresholding is a method where each feature of the embedding will be converted to the binary form of 1s and 0s. Each feature of the embedding takes the value 1 if it is above a specific threshold value and value 0 if otherwise. The threshold can be selected in many ways, it can be zero, or a random value, or the mean value of the features. The problem with this method is that a huge chunk of the information will be lost directly and there is a good chance that these embedding will not be able to capture the semantic data from the original embedding. Random projection method randomly projects the already learnt continuous embedding onto a different vector space. Once we have projected the embedding, we use the method of Hard Thresholding to convert it into a binary representation. The advantage of this method is that we can alter the dimension of the word embedding to the required dimensions. The binary representations generated using the Hard Threshold and the Random Projection binarization techniques suffer from the limitation of retaining semantic information and the need to have an additional binarization step after the training of the embedding. Auto Encoder model can be used to overcome the above limitations by retaining the semantic information of the continuous embedding without the need of applying an individual binarization step after the training.

Auto Encoder Architecture
The above mentioned limitations of preserving semantic information and the need of an extra binarization step after training can be overcome by the use of an autoencoder architecture, which utilises a reconstruction loss so as to make sure that maximum information is carried from the continuous representations to the binarized representations. We propose the use of encoder decoder architecture, the encoder will be used to convert the continuous embedding into its binary representation, whereas the decoder will be used to convert the binary into a continuous embedding.
The encoder network comprises of a matrix operation followed by a binarization step. Let D be the dimension of the binary embedding, G be the dimension of the original embedding, a be a continuous embedding vector, b be a binarized embedding vector, for i = 0,1,2,3,4. .
Here, W is weight matrix, j(i) is the i th element of the bias term j, t is the threshold that has been selected to determine if the i th feature of the vector is 1 or 0. During training we may use either a static or a dynamic value of t. For the static case we use the value of t to be zero, for the dynamic case we use a mean value of the embedding. All values which are equal to or greater than the value of t are mapped to 1 while all the values lower than t are mapped to 0. Comparison of these two binarization strategies are discussed in sections below. The details of the AutoEncoder architecture is as shown in Figure 1.
From the study of (Carreira-Perpinán and Raziperchikolaei, 2015; Dai and Le, 2015; Shen et al., 2018) encoder decoder framework, it has been shown that the linear decoders are the most favourable in the case for learning binary representations. Furthermore inspired by their work we employ a linear transformation to recreate the original embedding using the generated binary representations: This makes sure that the learnt binary representations learnt, b carry more information from the continuous embedding a. The auto encoder model is optimized by minimizing the reconstruction loss, L rec . Along with the reconstruction loss we use two different kind of regularizers, semantic preservation regularizer and the expansive regularizer, to make sure that we carry the most amount of information in our binarized embedding. After the training we use the trained parameters to create the binary embedding from the original continuous embedding.

Semantic Preservation Regularizer
Even though the reconstruction loss tries to ensure that the binary representations are filled with rich semantics, it still does not mean that these representations have similar information stored like the original continuous embedding. The model may be optimized with a low reconstruction error but will not yield optimal result on downstream tasks. To retain as much semantic information as we can, we introduce another additional component.
Suppose we have a group of words, (x α , x β , x γ , x δ ), their continuous embedding are represented by (a α , a β , a γ , a δ ) and their binary representations be represented by (b α , b β , b γ ,b δ ). If the cosine similarity between a α and a β is large, then the Hamming Distance between b α and b β should be small, a large cosine similarity means that the words x α and x β are similar and we try to capture this in the binary representations. Hamming Distance is used in comparing binary strings of equal length and defined as the number of bit positions in which the strings bits are different. If the cosine similarity between a α and a β is larger than the cosine similarity between a γ and a δ , then it would be preferable to have the Hamming Distance between b α and b β be lesser than the Hamming Distance between b γ and b δ .
Let d c depict the cosine similarity in the continuous embedding space and d h depict the Hamming Distance in the binary space. We introduce a term I α,β,γ,δ such that this indicator is 1 if d c ( a α , a β ) >= d c ( a γ , a δ ) and -1 for any other case. Thus this regularizer is defined as This regularizer encourages the learning function to retain the semantic similarity from the original embedding. While training our auto-encoder model, to calculate the semantic preservation regularizer value, we divide the training data into four sets of equal size and random allocation. We then use the corresponding elements from these 4 groups as x α , x β , x γ and x δ respectively, and calculate the regularizer value as per equation 6. To capture maximum semantic information during each epoch we use a random permutation to get the 4 groups.

Expansive Regularizer
We observed a subpar performance from the learned vector in the case where optimizing the reconstruction loss was the main aim. This is because the learned embedding failed to perserve the semantic information from the original embedding. Due to this, we reinforced the amount of semantic data being preserved by adding the expansive regularizer. This regularisation term in the objective function is defined as: Where W is the learnt weight matrix and I is an identity matrix. This term manages to minimize the correlation between the features of the embedding vector. The reduction in the correlation between the features means that the information stored in the features differ from each other. It also means that more semantic similarity information can be carried across to the features of the resulting binary embedding. As the model aims to build binary representations which are small in size, it becomes imperative to carry non duplicate information. This regularizer makes sure that the correlation between the features of an embedding is as less as possible.
The combination of the above regularizers give the final objective function, the ability to retain the information stored in the continuous real valued embedding, thereby making sure that the vectors that are closer in the real space are closer in the binary space as well. The balance between the reconstruction loss, the semantic regularizer and the expansive regularizer are brought out by the λ sp and λ exp parameters. The global objective function to minimize becomes: 4 Dataset

Binary Embedding
We make the use of the Word2vec Embedding which is very widely used in majority of the NLP tasks. The embedding we used had 300 dimensions and the vocabulary consisted of 3,000,000 words. These embedding were created by training on various news articles from Google. We also used the Glove Embedding to generate Binary representation. Training of these embedding was performed on aggregated global word-word co-occurrence statistics from a corpus. The Glove embedding contain 6 Billion tokens with a 400,000 vocabulary size. Each word vector in the embedding is of 300 dimensions. The experiments are currently focused on English language and related embedding.

Benchmark
For the benchmarks, we use categorisation, similarity and analogy based tasks. For the Categorisation Benchmark, we have used AP, BLESS, Battig, ESSLI 2b, ESSLI 2c and the ESSLI 2a datasets. For the Similarity Benchmarks we make use of the MEN, WS353, SimLex999, RW, RG65, MTurk datasets. For the Analogy based tasks we make the use of Google, MSR and SemEval datasets. More information of these datasets are given in Section 5 under the respective task details.

Experiments
Several tasks have been run to measure the performance of the original continuous embedding and the binarized embedding. These tasks test how much semantic information has been retained in the binary embedding.

Binary Embedding
Three different variations of binary embedding (for Word2Vec and GloVe respectively) were created. Each binary embedding was compared with each other, which is shown in the following sections. The first variation of binary embedding is directly taken from (Tissier et al., 2019) and it uses the reconstruction loss and the expansive regularizer (Variation 1). The second variation of the binary embedding is the static auto encoder architecture which had a threshold value (t) of 0 (Variation 2). The third variation of the binary embedding is the dynamic auto encoder architecture which had the mean value of the embedding as the threshold (t) (Variation 3). Variation 2 and Variation 3 consists of the reconstruction loss, the semantic preservation regularizer and the expansive regularizer as per equation (8). Each of these embeddings had an input vector size of 300 dimension and the binary embedding were scaled up to 640 dimensions. We compare our propsed Variation 2, Variation 3, with original embedding and Variation 1. We make the use of word-embedding-benchmarks 1 to compare the results of the different embedding as expained in the next section.

Categorization
This evaluation is done by using clustering techniques like Agglomerative and k-means. The embedding are evaluated based on purity value returned by this method. A noun categorizer was used to measure the purity of the embedding. The categorization tasks utilise the Almuhareb and Abdulrahman categorization dataset (Almuhareb and Poesio, 2005), Baroni

Word Similarity
This evaluation is done by using the Spearman correlation between the cosine similarity of the model and the human rated similarity of word pairs. We evaluated the embedding using the spearman correlation value on multiple datasets, namely, MEN dataset (it is a dataset for testing similarity and relatedness, score is calculated on different scale but it has been changed to standard scale), WS353 dataset (Finkelstein et al., 2001) (dataset for testing attributional and relatedness similarity), Reubenstein and Goodenough Dataset(RG65) (Rubenstein and Goodenough, 1965) (dataset for testing attributional and relatedness similarity), Rare Words Dataset (dataset for testing attributional similarity), SimLex999 (dataset for testing attributional similarity).

Word Analogy
This evaluation creates a simple analogy solver using the several embedding that were built. The analogy based question varies over several different categories and a comparison is made using the accuracies across all the categories. The accuracy stands for the number of questions that were correctly answered. The embedding were standardized and normalized before they were shared with the analogy solver. The Google Word Rep dataset (Mikolov et al., 2013) (tests both semantic and syntactic analogies), MSR dataset (Gao et al., 2014) (testing performance on syntactic analogies) were used to evaluate the performance of the embedding. These embedding were also run on the SemEval 2012 (Agirre et al., 2012) and the results were evaluated with the spearman correlation score.

Results
We have evaluated our binary embedding on above mentioned downstream tasks. The evaluation was done using 17 standard benchmarks to compare the proposed variations(Variation 2, Variation 3), Variation 1 of binary embedding and the original continuous embedding.
We have compared the reconstruction accuracies for the several variations of binary embedding. Reconstruction Accuracy denotes how much of the orignial embedding has been reconstructed from the generated binary embedding. As seen in Table 1, Variation 1 of Word2Vec and GloVe has been the most successful among the 3 variations in reproducing the original embedding. It is because in the case of Variation 1, unique non collinear data is being stored across the features of the embedding and this variation is doing well to reconstruct the original continuous embedding. This   In Table 2, we compare the different variations of binary embedding on word similarity tasks. When comparing to original Word2Vec and GloVe, proposed Variation 2 performs comparably, given the reduction in the embedding size. When compared to the original binary embedding(Variation 1), our proposed Variation 2 outperforms in 6 out of 8 tasks for Word2Vec and outperforms in 7 out of 8 tasks for GloVe, with significant gains in WS353, WS353R and WS353S. However, binarization of Word2Vec using variation 3 performs poorly in word similarity tasks, when compared to both Variation 1 and Variation 2. Whereas Variation 3 on GloVe performs on par with other variations.    In Table 4, when compared to original Word2Vec embedding, our proposed binary Variation 3 gets similar results in 5 of the performed tasks. We can also see that the proposed Variation 3 outperforms Variation 1 and proposed Variation 2 in 3 of the 6 tasks for Word2Vec embedding. When compared to original GloVe embedding, our proposed binary Variation 3 gets similar results in 4 of the performed tasks. We can also see that the proposed Variation 3 outperforms Variation 1 and proposed Variation 2 in 3 of the 6 tasks and is on par in 1 task.
It is also evident from tables 2,3,4 that our proposed Variation 2 and Variation 3 performs better in different tasks. Variation 3 performs better in tasks where the contextual and semantic information is crucial, such as text categorisation. In the case of a static threshold (as in Variation 2), it focuses more on the meaning of the word as compared to the contextual information. These are proved as the dynamic thresholding method (proposed Variation 3) performs better in the categorization tasks whereas the static thresholding method (proposed Variation 2) performs better in the analogical and word similarity tasks.
As seen in Table 5, it proves that more semantic information can be transferred to the binary embedding when we have high values for the λ sp . The same can be seen from the benchmark tests, for all tasks where more semantic information is key such as categorization, higher values of λ sp work better.
With higher values of λ sp and lower values of λ exp we get better benchmark results for categorization. Whereas with lower values of λ sp and higher values of λ exp we are able to have more word understanding leading to better results in tasks such as word similarity. It also shows that a smaller difference in the value λ sp and λ exp seems to work well for word similarity based tasks. These results were built using binary embedding created from the Variation 2 using Glove Embedding.

Embedding
Word2vec Glove Original (300 dimensions  As seen in Table 6, the file size reductions varies from 7 to 15 times reduction as compared to the original file size. Our binarized embedding are of 640 dimensions and are stored in a text(.txt) file in their hexadecimal format further facilitating the reduction in size. When the original embedding are stored in binary (.bin) file, float values are usually used to store the embedding, so on average 4 bytes are used to store for each dimension of every vector. Due to this we see only a 7 times reduction in the size in the case of Word2Vec as we store our embedding in a text(.txt) format. In the case of Glove we see a 15 times reduction in size because their values are directly stored in a textual format, so 6 precision point float values takes 8 bytes to store in this case. As a result of that we see a 15 times reduction in size as compared to the original embedding. Both the Word2Vec and GloVe were of 300 dimensions and the binarized embedding were of 640 dimensions.

Conclusion and Future works
This paper presents a novel approach to generate binary embedding from any continuous embedding with significant reduction in size. The proposed approaches also allows preservation of semantic infomation in binary vectors. Our proposed approach creates binary embedding which performs similar to original continuous embedding and performs better in most cases than previously proposed binarization approaches on different downstream tasks like word similarity and classification. A 640 dimensional binary vector is 15 times smaller than a 300 dimension real valued vector. However there are some inconsistencies in performances on some downstream tasks. Our future work includes investigating and addressing these inconsistencies. We will also experiment further with different embedding, embedding dimensions and observe the effect on the performance on downstream tasks.