Unsupervised Few-Bits Semantic Hashing with Implicit Topics Modeling

Semantic hashing is a powerful paradigm for representing texts as compact binary hash codes. The explosion of short text data has spurred the demand of few-bits hashing. However, the performance of existing semantic hashing methods cannot be guaranteed when applied to few-bits hashing because of severe information loss. In this paper, we present a simple but effective unsupervised neural generative semantic hashing method with a focus on few-bits hashing. Our model is built upon variational autoencoder and represents each hash bit as a Bernoulli variable, which allows the model to be end-to-end trainable. To address the issue of information loss, we introduce a set of auxiliary implicit topic vectors. With the aid of these topic vectors, the generated hash codes are not only low-dimensional representations of the original texts but also capture their implicit topics. We conduct comprehensive experiments on four datasets. The results demonstrate that our approach achieves significant improvements over state-of-the-art semantic hashing methods in few-bits hashing.


Introduction
Semantic hashing (Salakhutdinov and Hinton, 2009) is an attractive strategy for fast similarity search, which aims to find the most relevant texts for a given query (Wang et al., 2017). The basic idea of semantic hashing is to embed the semantics of texts into a low-dimensional binary vector space, while preserving text similarity. The embedded representations are called hash codes, based on which the calculation of text similarity can be efficiently completed by computing the Hamming distance using XOR operation (Zhang et al., 2010).
While considerable research efforts have been devoted to semantic hashing (Wang et al., 2013;Xu et al., 2015;Chaidaroon and Fang, 2017;Shen et al., 2018;Dong et al., 2019;Hansen et al., 2019;  Dadaneh et al., 2020), none of them have paid attention to few-bits hashing. Their performance also cannot be guaranteed when directly applied to few-bits hashing due to severe information loss. However, compactness is a crucial factor in learning to hash . It is important to keep hash codes as short as possible. For a text collection with c topics, the ideal length of hash codes is just log 2 (c) (Liu et al., 2019). In addition, with the explosive growth of social media and e-commerce, more and more short text data (e.g., tweets and online reviews) are generated everyday on the Web. It would be a huge waste to represent them as long hash codes. Therefore, it is necessary to ensure the performance of few-bits hashing, which is relatively an under-studied problem.
In this paper, we propose a simple but effective unsupervised neural generative semantic hashing method WISH (feW-bIts Semantic Hashing), which focuses on few-bits hashing. The architecture of WISH is shown in Figure 1. Built upon Variational AutoEncoder (VAE) (Kingma and Welling, 2013), WISH learns hash codes directly via the inference network. However, when using these binary codes as the inputs of the generative network, the model may encounter severe information loss (i.e., the information transmitted from the inference network to the generative network may be relatively limited, especially in the few-bits case). As thus, the generative network has little chance to effectively reconstruct the input texts. To address this issue, we introduce a set of auxiliary continuous implicit topic vectors. And we assume each text is generated from one or more of these topic vectors. Specifically, the inference network is used to decide which topic vectors are selected, and the generative network is used to reconstruct the input texts based on the selected topic vectors. Thus, the output of the inference network should be binary. To this end, we model the output of the inference network as either deterministic or stochastic multivariate Bernoulli variables. The inference network and the generative network are optimized jointly by maximizing the variational lower bound of the text log likelihood. And the straight-through estimator (Bengio et al., 2013) is utilized to estimate the gradients with respect to the binary codes. In summary, the main contributions include: • We propose a simple but effective neural generative text hashing method (WISH) to tackle the few-bits semantic hashing problem.
• We leverage auxiliary implicit topic vectors to address the issue of information loss. None of existing methods have used this technique.
• We conduct extensive experiments on four public datasets, the results show that WISH can achieve significant improvements over state-of-the-art semantic hashing methods.

Related Work
Up to now, lots of hashing methods have been proposed (Wang et al., 2017;Luo et al., 2020), which can be roughly categorized into unsupervised methods and supervised methods. In this paper, we focus on unsupervised methods since it is laborious to get labels for large-scale text collections. Unsupervised methods attempt to employ the data properties such as manifold structures and distributions to learn hash functions. For example, graph hashing (Liu et al., 2011) learns the hash function by utilizing the underlying manifold structure. Self-Taught Hashing (STH) (Zhang et al., 2010) decomposes the learning procedure into two steps: first generating hash codes via unsupervised learning and then learning hash functions by treating the previously generated hash codes as pseudo labels.
Owing to the success of deep learning, many deep learning-based hashing methods have been proposed in recent years (Wang et al., 2017;Xu et al., 2015;Dong et al., 2019;Xuan et al., 2019). For text hashing, Chaidaroon and Fang (Chaidaroon and Fang, 2017) were the first to propose a deep generative model called Variational Deep Semantic Hashing (VDSH). Chaidaroon et al. (Chaidaroon et al., 2018) further proposed an improved version of VDSH, which employs unsupervised ranking methods such as BM25 (Robertson and Zaragoza, 2009) to extract weak signals from training data. In consideration of the pervasiveness of text relationships, Node2hash (Chaidaroon et al., 2019) considers both text contents and connection information. However, these methods are not endto-end trainable, because they generate the final hash codes by using the median method (Weiss et al., 2009) for binarization. Shen et al. (Shen et al., 2018) proposed an end-to-end trainable generative semantic hashing method NASH that learns hash codes directly. BMSH (Dong et al., 2019) enhances NASH by imposing mixture priors. In (Hansen et al., 2019), a Ranking based Semantic Hashing (RBSH) method was proposed, which is also an extension of NASH by incorporating text similarity into the hash code generation.
Although the above methods have demonstrated promising results in semantic hashing, they pay no attention to few-bits hashing. Due to severe information loss, their performance also cannot be guaranteed if applying them to few-bits hashing directly. Our model focuses on few-bits hashing and introduces a set of auxiliary implicit topic vectors to mitigate information loss. The learned hash codes are able to capture the implicit topics of texts.

Problem Definition
We denote each text d as a bag-of-words vector such that d ∈ R |V| , where V is the vocabulary set. Let w i , v i ∈ {0, 1} |V| be the one-hot vector representation of the i-th word in d and V. The task of few-bits semantic hashing is to generate a short-length binary hash code z ∈ {0, 1} l for each text d, while preserving their similarity as much as possible. l denotes the number of hash bits.

Model Formulation
As illustrated in Figure 1, our model is built upon the VAE architecture. Its basic idea is to learn the hash code z for each text d via the inference network. Then z is decoded by the generative network to reconstruct d. However, as l is small, the generative network has little chance to well reconstruct d based solely on z. To solve this problem, we introduce a set of auxiliary implicit topic vectors. Let T = [t 1 , t 2 , · · · , t l ] ∈ R d×l be the matrix form of these topic vectors, and the i-th topic vector is denoted as t i ∈ R d , where d represents the topic vector size. There are l topic vectors in total, which is set to be equal to the number of hash bits. Then, each text d is generated based on these topic vectors rather than the binary vector z. Specifically, the generative process is described as follows: • For each text d, Here, γ i ∈ [0, 1] is the i-th entry of γ, which stands for the probability of sampling z i as 1.
integrates all the selected topic vectors to obtain a new representation. g has many choices like summing, averaging, or other more complex methods. And f maps the new representation to a latent vector useful for modeling word probabilities.
We utilize the softmax function to compute the conditional probability over w i . Thus, we have: .
(1) Assume words in d are generated independently, then the text likelihood conditioned on T z is where N denotes the number of words in d. The objective is to maximize the text log likelihood: Note that Eq. (3) holds because for all z, we have P (T z |z) = 1. However, this objective is intractable. By introducing Q(z|d) as an approximation of the true posterior distribution P (z|d), similar to VAE, we derive the tractable variational lower bound of the text log likelihood: where KL(· ·) calculates the Kullback-Leibler divergence and P (z) is the prior distribution of z.
In our approach, the implicit topic vectors play a crucial part in mitigating information loss in fewbits hashing. They are learned automatically according to the data distribution instead of being set up manually. To take full advantage of the implicit topic vectors, it is also useful to make them independent with each other. For this purpose, we add an orthogonal constraint on T . The final objective is then derived as below: where I represents the identity matrix and · F denotes the Frobenius norm. λ is a parameter used to adjust the contribution of the orthogonal constraint.

Model Implementation
Our model is implemented under the VAE framework, comprised of an inference network and a generative network.

The Inference Network
The inference network calculates Q(z|d) to obtain the binary vector z for each text d. Since the prior on z is a multivariate Bernoulli distribution, we restrict Q(z|d) to take the form Q(z|d) = Bernoulli(z), wherez = σ(r(d)). σ(·) is the sigmoid function which outputs the sampling probabilities of z, and function r is a nonlinear function specified as a multilayer perceptron. Based on Q(z|d), the binary vector z can be sampled in a deterministic or stochastic way. In the deterministic case, we have z i = (z i − 0.5) , where z i and z i denote the i-th entry of z andz, respectively. In the stochastic case, we have where µ i ∼ Uniform(0, 1).

The Generative Network
The generative network takes selected topic vectors T z as input and outputs the word probability distribution P (V|f (g(T z ))). We set f to be a linear function, i.e., f (g(T z )) = Eg(T z ) + b, where E ∈ R |V|×d and b ∈ R |V| , d is the size of g(T z ).
Then, according to Eq. (1) we have: where b i is the i-th entry of b. We do not resort to more complex function f to avoid the "posterior collapse" phenomenon (Lucas et al., 2019).

Optimization
The inference network and the generative network can be trained jointly via backpropagation to optimize the objective in Eq. (5). However, the gradients with respect to the binary vector z would be essentially all zero, thus the inference network cannot be trained. To address this issue, we utilize the straight-through estimator (Bengio et al., 2013) to approximate the gradients with respect to z as 1. As thus, the gradients can be backpropagated from the generative network to the inference network.
In this work, the prior on z is set to be the standard Bernoulli distribution, that is, all entries in γ are fixed at 0.5. Therefore, the Kullback-Leibler divergence term in Eq. (4) can be computed as:

Hash Code Generation
Once the model has been trained, we can generate hash codes for both training and query texts via the inference network. Since the vector z outputted by the inference network is binary, we choose it as the hash code directly, which indicates that our model is end-to-end trainable. Note that when generating hash codes, the binary vector z should be sampled only in the deterministic way.

Discussion
As shown in the hash code generation process, each hash bit corresponds to an implicit topic. When the hash bit is 1, the corresponding implicit topic vector will be selected to generate the text. In this sense, the learned hash codes not only reduce the dimension of the original texts but also capture their implicit topics, which lends themselves more interpretability. However, the hash codes learned by existing hashing methods lack this property. In fact, our approach WISH can be regarded as a Latent Topic Modeling (LTM) model. Here, we provide an intuitive way to demonstrate how WISH can be treated as an LTM model and what is the difference between WISH and existing LTM models. We choose the most popular two LTM models, Probabilistic Latent Semantic Analysis (PLSA) (Hofmann, 2001) and Latent Dirichlet Allocation (LDA) (Blei et al., 2003), for comparison and illustrate their graphical representations in Figure 2, where M , N and l denote the number of texts, the number of words and the number of latent topics, respectively. As shown in Figure 2 (a), PLSA first chooses a topic c based on the text topic distribution θ and then generates word w according to the c-th topic vector t c . LDA is a slightly modified version of PLSA. In LDA, both topic distribution θ and topic vector t are assumed to follow the Dirichlet distribution characterized by α and β. As for WISH, its graphical representation is similar to PLSA and LDA, as shown in Figure 2 (c). WISH first samples a binary vector z from the multivariate Bernoulli distribution characterized by γ. Then a subset of topic vectors T z are selected to generate word w. In view of the word generation process, WISH can be regarded as an LTM model. However, WISH is a discrete deep model as z is forced to be binary. And the prior on z is Bernoulli distribution rather than Dirichlet distribution. Similar to PLSA, WISH does not have any prior on the topic vector t. But it generates words based on a subset of topic vectors simultaneously instead of only one topic vector. Besides, PLSA is a transductive method, it is unable to deal with query texts, thus it cannot be used for text hashing.

Datasets
We use four public benchmark datasets for evaluation. 1) Reuters 1 is a collection of 10,788 news documents with 90 different classes. Similar to (Chaidaroon and Fang, 2017), only the 20 most frequent classes are taken into consideration. 2) TMC 2 contains air traffic reports provided by NASA and is comprised of 28,596 reports divided into 22 different categories. 3) 20Newsgroups 3 is a dataset of 18,846 newsgroup posts, partitioned into 20 different groups. 4) Agnews 4 is a collection of 127,600 news articles, which has 4 categories. For each article, we use both the title and the description. All the adopted datasets are short text data. The average text lengths of Reuters, TMC, 20Newsgroups and Agnews are 51, 63, 102 and 21, respectively.
For each dataset, we filter all documents by removing words with more than 90% document frequency and words occurring less than 3 times. We also apply stopwords removal using the sklearn stopwords list. No stemming is performed. We split each dataset into three parts with 80% for training, 10% for validation and 10% for testing. Only documents in the training set are retrieved for each testing document during evaluation. We choose the TF-IDF (Manning et al., 2008) features as the original document representation.

Evaluation Metric
To evaluate the effectiveness of the generated hash codes in similarity search, we treat each document in the testing set as a query. For each query, we retrieve relevant documents from the training set based on the Hamming distance between their hash codes. To facilitate comparison with prior semantic hashing methods (Chaidaroon and Fang, 2017;Shen et al., 2018;Hansen et al., 2019;Chaidaroon et al., 2018), we take precision as the evaluation metric. To be more specific, for each query, we search for the 100 nearest/closest documents and measure the performance as the precision among the 100 retrieved documents (Prec@100), which is calculated as the ratio of the number of retrieved relevant documents to the number of all retrieved documents (fixed value of 100). The total performance is then simply the average Prec@100 score over all queries. To determine if a retrieved document is relevant to the given query, following prior works (Chaidaroon and Fang, 2017;Shen et al., 2018;Hansen et al., 2019;Wang et al., 2013;Chaidaroon et al., 2018), we consider documents sharing at least one class label as relevant pairs.

Training Details
On all datasets, we implement the inference network of our approach with 2 hidden layers (both with 1000 units) using the ReLU activation function, followed by a hidden layer with sigmoid activation function to obtain the sampling probabilities of hash code z. We also employ the dropout technique (Srivastava et al., 2014) with the keep probability of 0.8 on the output of the second layer to alleviate overfitting. The generative network consists of only one layer with softmax activation function, as described in Section 3.3. We adopt the stochastic method to sample the binary vector z during training so as to encourage exploration. For simplicity, we choose function g as the summing function to integrate the selected topic vectors T z before feeding them to the generative network.
Our model is trained using the Adam optimizer (Kingma and Ba, 2014), and the learning rate is fixed at 0.001 for all parameters. By default, we set the orthogonal constraint coefficient λ to be 1. The topic vector size d is fixed at 50 for Reuters and 100 on the other three datasets. Following (Chaidaroon and Fang, 2017), we add a weight parameter for the Kullback-Leibler divergence term. This parameter is initially fixed at 0 and then increased by 5 × 10 −6 in each iteration. We implement our approach in Pytorch 5 and conduct all experiments on a server with 2 AMD Ryzen Threadripper 2950X 16-Core Processors and 2 Nvidia Titan RTX GPUs. Our implementation can be accessed at https://github.com/smartyfh/WISH.

Baseline Comparison
To evaluate the performance of our approach WISH in few-bits hashing, we set the length of hash codes (i.e., l) as 4, 6, 8, 10, 12. For a fair comparison, we run each method 10 times and report the average results. The detailed results are presented in Table 1, where the best performing results are shown in bold. Firstly, we observe that WISH consistently outperforms all baselines on the four datasets across different number of hash bits. For example, on the 20Newsgroups dataset, WISH achieves approximately 10% performance promotion over the best performing baseline when the number of hash bits is set to 4 and 6. These results indicate that WISH can make the most effective use of the limited information transmitted from the inference network with the aid of the auxiliary implicit topic vectors. Secondly, we observe that all methods achieve better performance with the increasing of hash code length. This is desirable because longer hash codes can reserve more information. Overall, compared to the baseline methods, our approach WISH is more suitable for few-bits hashing.

Comparison with LDA
We have discussed the relationship between WISH and two LTM models PLSA and LDA in Section 3.5. PLSA is a transductive method, thus cannot be used for hashing. While LDA is an inductive method, it can be utilized for hashing directly. Here we compare WISH with LDA by setting the number of hash bits to 8. The results are illustrated in Figure 3, where 20NG stands for 20Newsgroups.
We can see that WISH shows much better performance than LDA. Compared to LDA, WISH has two advantages: 1) WISH is a discrete model and learns hash codes directly, while LDA needs a binarization step to generate hash codes, which usually leads to suboptimal results. 2) WISH is a deep neural generative model, which inherits good properties of both deep learning and probabilistic generative models. While LDA is a shallow model.

Effects of Sampling Strategies
As described in Section 3.3, there are two sampling strategies on how to obtain the binary vector z: namely the stochastic and deterministic sampling method. Here we compare the two sampling strategies and observe their effects on the performance of WISH. We fix the number of hash bits at 8 and report the results in Figure 4. As can be observed, on all datasets, the stochastic sampling method outperforms the deterministic sampling method. The results indicate that endowing the sampling process of the binary vector z with more stochasticity helps to make the learned binary representations of input texts more meaningful and more discriminative.

Effects of Topic Vector Size
In this section, we investigate the effects of topic vector size d.  Figure 5, we observe that our approach is relatively stable with respect to d. Although d is expected to be much larger than the size of the binary vector z (in order to address the issue of information loss), there is no need to set it to be very large. A small d can reduce the number of parameters in our model, which reduces the training time and also the chance of overfitting.

Effects of Parameter λ
As shown in Eq. (5), our approach has involved an orthogonal constraint on the topic vectors T with λ being the weighting parameter. This constraint is important because it helps to reduce information redundancy in the topic vectors and thus makes the learned binary representations more discriminative. It is useful to study the effects of λ. To this end, we tune λ in the range of {0.01, 0.1, 1, 10, 100} and report the results on Reuters and 20Newsgroups in Figure 6, where the number of hash bits is set to 4 and 8. Note that the horizontal axis of Figure 6 is plotted in log scale. From Figure 6, we observe that our approach is robust to λ. Although we vary λ in such a large range, the performance keeps stable.

Effects of Implicit Topic Vector Integration Strategies
Recall that in our approach there is a function g, which integrates all the selected implicit topic vectors before feeding them to the generative network.
In previous experiments, we have set g to be the summing function (i.e., adding all the selected topic vectors). Since g has many other choices like averaging (i.e., taking the average of all the selected topic vectors) and more complex methods, here we compare the summing strategy and the averaging strategy. We conduct experiments on Reuters and 20Newsgroups by varying the number of hash bits in the range of {4, 6, 8, 10, 12}. The results are illustrated in Figure 7. As can be seen, the summing strategy consistently shows better performance than the averaging strategy. The results indicate that a proper g is important to ensure the performance of our approach. With a more advanced g, our approach has the potential to achieve even better performance. We leave the exploration of more advanced g as our future work.

Time Comparison
In this part, we first compare the training time of different methods on the largest dataset Agnews. For our approach WISH and all the baselines except STH (i.e., VDSH, NASH, RBSH, Node2hash and NbrReg), we run each method 100 iterations. The results are reported in Figure 8  we can observe that STH runs much faster than the other methods. This is because STH is a shallow model, whereas all other methods are deep models. We can also observe that RBSH and NbrReg takes much longer time for training. The training time of our approach WISH is comparable with VDSH, NASH and Node2hash. We further compare the average query time of different methods. To this end, we treat each document in the testing set as a query. We first feed each query to the trained model to generate the hash code and then retrieve relevant documents from the training set. The average time results are reported in Figure 8 (b), from which we observe that STH takes much longer query time, while RBSH and our approach WISH are very efficient. The results demonstrate the efficiency of our approach.

Comparison of Long-Bits Hashing
As described in Section 5.1, our approach WISH consistently outperforms all baseline methods in few-bits hashing. Here we look at the performance when the hash codes are set to be longer. Specifically, we vary the number of hash bits in the range of {16, 20, 24, 28, 32} and conduct experiments on TMC and Agnews. We choose the two datasets because TMC has the most ground-truth classes while Agnews has the least ground-truth classes. In this experiment, the topic vector size is fixed at 50 for both datasets. And we set the learning rate to 0.001 for TMC and 0.0001 for Agnews. The results are illustrated in Figure 9. From Figure 9, we observe that WISH shows better performance than all baseline methods on both datasets. The results further confirm the effectiveness of our approach.

Visualization of Hash Codes
To intuitively see if the learned hash codes can preserve the semantics of the original documents, we further perform a qualitative visualization analysis using the UMAP (  the dataset 20Newsgroups. Specifically, we first project the 8 bits hash codes learned by VDSH and our approach WISH into the 2D space and then generate the scatter plots. Figure 10 illustrates the results. In Figure 10, each point denotes a document which is associated with one of the 20 classes and different colors represent different classes. As can be observed, our approach WISH generates more separate clusters, while the cluster structure of VDSH is highly overlapped. We can also observe that the points generated by our approach are closer to each other if they share the same class label. This visualization analysis verifies the effectiveness of our approach again and demonstrates that our approach can preserve the semantics of documents even though it is an unsupervised method.

Conclusion
In this paper, we have presented a simple but effective unsupervised neural generative semantic hashing method with a focus on few-bits hashing.
To address the problem of information loss in fewbits hashing, we have introduced a set of auxiliary implicit topic vectors. With the aid of these topic vectors, our approach can well capture the semantics of texts, thus the learned hash codes are not only low-dimensional representations of the original texts but also capture their implicit topics.
We have further analyzed that our approach can be treated as an LTM model although it is fundamentally different from existing LTM models. To evaluate the effectiveness of our approach, we have conducted a comprehensive set of experiments, the results demonstrate the superiority of our approach over existing semantic hashing methods.