Document Hashing with Mixture-Prior Generative Models

Hashing is promising for large-scale information retrieval tasks thanks to the efficiency of distance evaluation between binary codes. Generative hashing is often used to generate hashing codes in an unsupervised way. However, existing generative hashing methods only considered the use of simple priors, like Gaussian and Bernoulli priors, which limits these methods to further improve their performance. In this paper, two mixture-prior generative models are proposed, under the objective to produce high-quality hashing codes for documents. Specifically, a Gaussian mixture prior is first imposed onto the variational auto-encoder (VAE), followed by a separate step to cast the continuous latent representation of VAE into binary code. To avoid the performance loss caused by the separate casting, a model using a Bernoulli mixture prior is further developed, in which an end-to-end training is admitted by resorting to the straight-through (ST) discrete gradient estimator. Experimental results on several benchmark datasets demonstrate that the proposed methods, especially the one using Bernoulli mixture priors, consistently outperform existing ones by a substantial margin.


Introduction
Similarity search aims to find items that look most similar to the query one from a huge amount of data , and are found in extensive applications like plagiarism analysis (Stein et al., 2007), collaborative filtering (Koren, 2008;, content-based multimedia retrieval (Lew et al., 2006), web services (Dong et al., 2004) etc. Semantic hashing is an effective way to accelerate the searching process by representing every document with a compact binary code. In this way, one only needs to evaluate the * Corresponding author.
hamming distance between binary codes, which is much cheaper than the Euclidean distance calculation in the original feature space.
Existing hashing methods can be roughly divided into data-independent and data-dependent categories. Data-independent methods employ random projections to construct hash functions without any consideration on data characteristics, like the locality sensitive hashing (LSH) algorithm (Datar et al., 2004). On the contrary, data dependent hashing seeks to learn a hash function from the given training data in a supervised or an unsupervised way. In the supervised case, a deterministic function which maps the data to a binary representation is trained by using the provided supervised information (e.g. labels) (Liu et al., 2012;Shen et al., 2015;. However, the supervised information is often very difficult to obtain or is not available at all. Unsupervised hashing seeks to obtain binary representations by leveraging the inherent structure information in data, such as the spectral hashing (Weiss et al., 2009), graph hashing (Liu et al., 2011), iterative quantization (Gong et al., 2013), self-taught hashing (Zhang et al., 2010) etc.
Generative models are often considered as the most natural way for unsupervised representation learning (Miao et al., 2016;Bowman et al., 2015;Yang et al., 2017). Many efforts have been devoted to hashing by using generative models. In (Chaidaroon and Fang, 2017), variational deep semantic hashing (VDSH) is proposed to solve the semantic hashing problem by using the variational autoencoder (VAE) (Kingma and Welling, 2013). However, this model requires a two-stage training since a separate step is needed to cast the continuous representations in VAE into binary codes. Under the two-stage training strategy, the model is more prone to get stuck at poor performance (Xu et al., 2015;Zhang et al., 2010;Wang et al., 2013). To address the issue, the neural architecture for generative semantic hashing (NASH) in  proposed to use a Bernoulli prior to replace the Gaussian prior in VDSH, and further use the straight-through (ST) method (Bengio et al., 2013) to estimate the gradients of functions involving binary variables. It is shown that the endto-end training brings a remarkable performance improvement over the two-stage training method in VDSH. Despite of superior performances, only the simplest priors are used in these models, i.e. Gaussian in VDSH and Bernoulli in NASH. However, it is widely known that priors play an important role on the performance of generative models (Goyal et al., 2017;Chen et al., 2016;Jiang et al., 2016).
Motivated by this observation, in this paper, we propose to produce high-quality hashing codes by imposing appropriate mixture priors on generative models. Specifically, we first propose to model documents by a VAE with a Gaussian mixture prior. However, similar to the VDSH, the proposed method also requires a separate stage to cast the continuous representation into binary form, making it suffer from the same pains of twostage training. Then we further propose to use a Bernoulli mixture as the prior, in hopes to yield binary representations directly. An end-to-end method is further developed to train the model, by resorting to the straight-through gradient estimator for neural networks involving binary random variables. Extensive experiments are conducted on benchmark datasets, which show substantial gains of the proposed mixture-prior methods over existing ones, especially the method with a Bernoulli mixture prior.

Semantic Hashing by Imposing Mixture Priors
In this section, we investigate how to obtain similarity-preserved hashing codes by imposing different mixture priors on variational encoder.

Preliminaries on Generative Semantic Hashing
Let x ∈ Z |V | + denote the bag-of-words representation of a document and x i ∈ {0, 1} |V | denote the one-hot vector representation of the i-th word of the document, where |V | denotes the vocabulary size. VDSH in (Chaidaroon and Fang, 2017) proposed to model a document D, which is de-fined by a sequence of one-hot word representa- where the prior p(z) is the standard Gaussian distribution N (0, I); the likelihood has the factorized form p θ (D|z) = |D| i=1 p θ (x i |z), and ; (2) E ∈ R m×|V | is a parameter matrix which connects latent representation z to one-hot representation x i of the i-th word, with m being the dimension of z; b i is the bias term and θ = {E, b 1 , ..., b |V | }. It is known that generative models with better modeling capability often imply that the obtained latent representations are also more informative.
To increase the modeling ability of (1), we may resort to more complex likelihood p θ (D|z), such as using deep neural networks to relate the latent z to the observation x i , instead of the simple softmax function in (2). However, as indicated in , employing expressive nonlinear decoders likely destroy the distance-keeping property, which is essential to yield good hashing codes. In this paper, instead of employing a more complex decoder p θ (D|z), more expressive priors are leveraged to address this issue.

Semantic Hashing by Imposing Gaussian Mixture Priors
To begin with, we first replace the standard Gaussian prior p(z) = N (0, I) in (1) by the following Gaussian mixture prior where K is the number of mixture components; π k is the probability of choosing the k-th component and K k π k = 1; µ k ∈ R m and σ 2 k ∈ R m + are the mean and variance vectors of the Gaussian distribution of the k-th component; and diag(·) means diagonalizing the vector. For any sample z ∼ p(z), it can be equivalently generated by a two-stage procedure: 1) choosing a component c ∈ {1, 2, · · · , K} according to the categorical distribution Cat(π) with π = [π 1 , π 2 , · · · , π K ]; 2) drawing a sample from the z d i a g m s  Figure 1: The architectures of the GMSH and BMSH. The data generative process of GMSH is done as follows: (1) Pick a component c ∈ {1, 2, ..., K} from Cat(π) with π = [π 1 , π 2 , ..., π K ]; (2) Draw a sample z from the picked Gaussian distribution N µ c , diag(σ 2 c ) ; (3) Use g θ (z) to decode the sample z into an observablex. The process of generating data in BMSH can be described as follows: (1) Choose a component c from Cat(π); (2) Sample a latent vector from the chosen distribution Bernoulli(γ c ); (3) Inject data-dependent noise into z, and draw z from N (z, diag(σ 2 c )); (4) Then use decoder g θ (z ) to reconstructx.
distribution N µ c , diag σ 2 c . Thus, the document D is modelled as where To train the model, we seek to optimize the lower bound of the log-likelihood where q φ (z, c|x) is the approximate posterior distribution of p(z, c|x) parameterized by φ; here x could be any representation of the documents, like the bag-of-words, TFIDF etc. For the sake of tractability, q φ (z, c|x) is further assumed to maintain a factorized form, i.e., q φ (z, c|x) = q φ (z|x)q φ (c|x). Substituting it into the lower bound gives For simplicity, we assume that q φ (z|x) and q φ (c|x) take the forms of Gaussian and categorical distributions, respectively, and the distribution parameters are defined as the outputs of neural networks. The entire model, including the generative and inference arms, is illustrated in Figure  1(a). Using the properties of Gaussian and categorical distributions, the last two terms in (6) can be expressed in a closed form. Combining with the reparameterization trick in stochastic gradient variational bayes (SGVB) estimator (Kingma and Welling, 2013), the lower bound L can be optimized w.r.t. model parameters {θ, π, µ k , σ k , φ} by error backpropagation and SGD algorithms directly. Given a document x, its hashing code can be obtained through two steps: 1) mapping x to its latent representation by z = µ φ (x), where the µ φ (x) is the encoder mean µ φ (·); 2) thresholding z into binary form. As suggested in (Wang et al., 2013;Chaidaroon et al., 2018;Chaidaroon and Fang, 2017) that when hashing a batch of documents, we can use the median value of the elements in z as the critical value, and threshold each element of z into 0 and 1 by comparing it to this critical value. For presentation conveniences, the proposed semantic hashing model with a Gaussian mixture priors is referred as GMSH.

Semantic Hashing by Imposing Bernoulli Mixture Priors
To avoid the separate casting step used in GMSH, inspired by NASH , we further propose a Semantic Hashing model with a Bernoulli Mixture prior (BMSH). Specifically, we replace the Gaussian mixture prior in GMSH with the following Bernoulli mixture prior where γ k ∈ [0, 1] m represents the probabilities of z being 1. Effectively, the Bernoulli mixture prior, in addition to generating discrete samples, plays a similar role as Gaussian mixture prior, which make the samples drawn from different components have different patterns. The samples from the Bernoulli mixture can be generated by first choosing a component c ∈ {1, 2, · · · , K} from Cat(π) and then drawing a sample from the chosen distribution Bernoulli(γ c ). The entire model can be described as p(D, z, c) = p θ (D|z)p(z|c)p(c), where p θ (D|z) is defined the same as (2), and p(c) = Cat(π) and p(z|c) = Bernoulli(γ c ). Similar to GMSH, the model can be trained by maximizing the variational lower bound, which maintains the same form as (6). Different from GMSH, in which q φ (z|x) and p(z|c) are both in a Gaussian form, here p(z|c) is a Bernoulli distribution by definition, and thus q φ (z|x) is assumed to be the Bernoulli form as well, with the probability of the i-th element z i taking 1 defined as for i = 1, 2, · · · , m. Here g i φ (·) indicates the ith output of a neural network parameterized by φ. Similarly, we also define the posterior regarding which component to choose as where h k φ (x) is the k-th output of a neural network parameterized by φ. With denotation α i = q φ (z i = 1|x) and β k = q φ (c = k|x), the last two terms in (6) can be expressed in close-form as where γ i c denotes the i-th element of γ c . Due to the Bernoulli assumption for the posterior q φ (z|x), the commonly used reparameterization trick for Gaussian distribution cannot be used to directly estimate the first term E q φ (z|x) [log p θ (D|z)] in (6). Fortunately, inspired by the straight-through gradient estimator in (Bengio et al., 2013), we can parameterize the i-th element of binary sample z from q φ (z|x) as where sign(·) the is the sign function, which is equal to 1 for nonnegative inputs and -1 otherwise; and ξ i ∼ Uniform(0, 1) is a uniformly random sample between 0 and 1. The reparameterization method used above can guarantee generating binary samples. However, backpropagation cannot be used to optimize the lower bound L since the gradient of sign(·) w.r.t. its input is zero almost everywhere. To address this problem, the straight-through(ST) estimator (Bengio et al., 2013) is employed to estimate the gradient for the binary random variables, where the derivative of z i w.r.t φ is simply approximated . Thus, the gradients can then be backpropagated through discrete variables. Similar to NASH , data-dependent noises are also injected into the latent variables when reconstructing the document x so as to obtain more robust binary representations. The entire model of BMSH, including generative and inference parts, is illustrated in Figure 1(b).
To understand how the mixture-prior model works differently from the simple prior model, we examine the main difference term E q φ (c|x) [KL (q φ (z|x)||p(z|c))] in (6), where q φ (c|x) is the approximate posterior probability that indicates the document x is generated by the c-th component distribution with c ∈ {1, 2, · · · , K}. In the mixture-prior model, the approximate posterior q φ (z|x) is compared to all mixture components p(z|c) = N µ c , diag(σ 2 c ) . The term E q φ (c|x) [KL (q φ (z|x)||p(z|c))] can be understood as the average of all these KLdivergences weighted by the probabilities q φ (c|x). Thus, comparing to the simple-prior model, the mixture-prior model is endowed with more flexibilities, allowing the documents to be regularized by different mixture components according to their context.

Extensions to Supervised Hashing
When label information is available, it can be leveraged to yield more effective hashing codes since labels provide extra information about the similarities of documents. Specifically, a mapping from the latent representation z to the cor-responding label y is learned for each document. The mapping encourages latent representations of documents with the same label to be close in the latent space, while those with different labels to be distant. A classifier built from a two-layer MLP is employed to parameterize this mapping, with its cross-entropy loss denoted by L dis (z, y). Taking the supervised objective into account, the total loss is defined as where L is the lower bound arising in GMSH or BMSH model; α controls the relative weight of the two losses. By examining the total loss L total , it can be seen that minimizing the loss encourages the model to learn a representation z that accounts for not only the unsupervised content similarities of documents, but also the supervised similarities from the extra label information.

Related Work
Existing hashing methods can be categorized into data independent and data dependent methods. A typical example of data independent hashing is the local-sensitive hashing (LSH) (Datar et al., 2004). However, such method usually requires long hashing codes to achieve satisfactory performance. To yield more effective hashing codes, more and more researches focus on data dependent hashing methods, which include unsupervised and supervised methods. Unsupervised hashing methods only use unlabeled data to learn hash functions. For example, spectral hashing (SpH) (Weiss et al., 2009) learns the hash function by imposing balanced and uncorrelated constraints on the learned codes. Iterative quantization (ITQ) (Gong et al., 2013) generates the hashing codes by simultaneously maximizing the variance of each binary bit and minimizing the quantization error.
In (Zhang et al., 2010), the authors proposed to decompose the learning procedure into two steps: first learning hashing codes for documents via unsupervised learning, then using binary classifiers to predict the -bit hashing codes. Since the labels provide useful guidance in learning effective hash functions, supervised hashing methods are proposed to leverage the label information. For instance, binary reconstruction embedding (BRE) (Kulis and Darrell, 2009) learns the hash function by minimizing the reconstruction error between the original distances and the hamming distances of the corresponding hashing codes. Supervised hashing with kernels (KSH) (Liu et al., 2012) is a kernel-based method, which utilizes the pairwise information between samples to generate hashing codes by minimizing the hamming distances on similar pairs and maximizing those on dissimilar pairs. Recently, VDSH (Chaidaroon and Fang, 2017) proposed to use a VAE to learn the latent representations of documents and then use a separate stage to cast the continuous representations into binary codes. While fairly successful, this generative hashing model requires a two-stage training. NASH  proposed to substitute the Gaussian prior in VDSH with a Bernoulli prior to tackle this problem, by using a straight-through estimator (Bengio et al., 2013) to estimate the gradient of neural network involving the binary variables. This model can be trained in an end-toend manner. Our models differ from VDSH and NASH in that mixture priors are employed to yield better hashing codes, whereas only the simplest priors are used in both VDSH and NASH. Training Details We experiment with the four models proposed in this paper, i.e., GMSH and BMSH for unsupervised hashing, and GMSH-S and BMSH-S for supervised hashing. The same network architectures as VDSH and NASH are used in our experiments to admit a fair comparison. Specifically, a two-layer feed-forward neural network with 500 hidden units and ReLU activation function is employed as the encoder and the extra classifier in the supervised case, while the decoder is the same as that stated in (2). Similar to VDSH and NASH (Chaidaroon and Fang, 2017;, the TFIDF feature of a document is used as the input to the encoder. The Adam optimizer (Kingma and Ba, 2014) is used in the training of our models, and its learning rate is Datasets  TMC  20Newsgroups  Reuters   Method  16bit  32bit  64bit  128bit  16bit  32bit  64bit  128bit  16bit  32bit  64bit    set to be 1 × 10 −3 , with a decay rate of 0.96 for every 10000 iterations. The component number K and the parameter α in (11) are determined based on the validation set.
Evaluation Metrics For every document from the testing set, we retrieve similar documents from the training set based on the hamming distance between their hashing codes. For each query, 100 closest documents are retrieved, among which the documents sharing the same label as the query are deemed as the relevant results. The ratio between the number of relevant ones and the total number, which is 100, is calculated as the similarity search precision. The averaged value over all testing documents is then reported. The retrieval precisions under the cases of 16 bits, 32 bits, 64 bits, 128 bits hashing codes are evaluated, respectively.

Performance Evaluation of Unsupervised
Semantic Hashing Table 1 shows the performance of the proposed and baseline models on three datasets under the unsupervised setting, with the number of hashing bits ranging from 16 to 128. From the experimental results, it can be seen that GMSH outperforms previous models under all considered scenarios on both TMC and Reuters. It also achieves better performance on 20Newsgroups when the length of hashing codes is large, e.g. 64 or 128. Comparing to VDSH using the simple Gaussian prior, the proposed GMSH using a Gaussian mixture prior exhibits better retrieval performance overall. This strongly demonstrates the benefits of using mixture priors on the task of semantic hashing. One possible explanation is that the mixture prior enables the documents from different categories to be regularized by different distributions, guiding the model to learn more distinguishable representations for documents from different categories. It can be further observed that among all methods, BMSH achieves the best performance under different datasets and hashing codes length consistently. This may be attributed to the imposed Bernoulli mixture prior, which offers both the ad-    vantages of producing more distinguishable codes with a mixture prior and end-to-end training enabled by a Bernoulli prior. BMSH integrates the merits of NASH and GMSH, and thus is more suitable for the hashing task. Figure 2 shows how retrieval precisions vary with the number of hashing bits on the three datasets. It can be observed that as the number increases from 32 to 128, the retrieval precisions of most previous models tend to decrease. This phenomenon is especially obvious for VDSH, in which the precisions on all three datasets drop by a significant margin. This interesting phenomenon has been reported in previous works Chaidaroon and Fang, 2017;Wang et al., 2013;Liu et al., 2012), and the reason could be overfitting since the model with long hashing codes is more likely to overfitting (Chaidaroon and Fang, 2017;. However, it can be seen that our model is more robust to the number of hashing bits. When the number is increased to 64 or 128, the performance of our models is kept almost unchanged. This may be also attributed to the mixture priors imposed in our models, which can regularize the models more effectively.

Performance Evaluation of Supervised Semantic Hashing
We evaluate the performance of supervised hashing in this section. Table 2 shows the performances of different supervised hashing models on three datasets under different lengths of hashing codes. We observe that all of the VAE-based generative hashing models (i.e VDSH, NASH, GMSH and BMSH) exhibit better performance, demonstrating the effectiveness of generative models on the task of semantic hashing. It can be also seen that BMSH-S achieves the best performance, suggesting that the advantages of Bernoulli mixture priors can also be extended to the supervised scenarios.
To gain a better understanding about the relative  performance gain of the four proposed models, the retrieval precisions of GMSH, BMSH, GMSH-S and BMSH-S using 32-bit hashing codes on the three datasets are plotted together in Figure 4. It can be obviously seen that GMSH-S and BMSH-S outperform GMSH and BMSH by a substantial margin, respectively. This suggests that the proposed generative hashing models can also leverage the label information to improve the hashing codes' quality.

Impacts of the Component Number
To investigate the impacts of component number, experiments are conducted for GMSH and BMSH under different values of K. For demonstration convenience, the length of hashing codes is fixed to 32. Table 3 shows the precisions of top 100 retrieved documents when the number of components K is set to different values. We can see that the retrieval precisions of the proposed models, especially the BMSH, are quite robust to this parameter. For BMSH, the difference between the best and worst precisions on the three datasets are 0.0123, 0.0052 and 0.0134, respectively, which are small comparing to the gains that BMSH has achieved. One exception is the performance of GMSH on 20Newsgroups dataset. However, as seen from Table 3, as long as the number K is not too small, the performance loss is still acceptable. It is worth noting that the worst performance of GMSH on 20Newsgroups is 0.4708, which is still better than VDSH's 0.4327 as in Table 1. For the BMSH model, the performance is stable across all the considered datasets and K values.

Visualization of Learned Embeddings
To understand the performance gains of the proposed models better, we visualize the learned representations of VDSH-S, GMSH-S and BMSH-S on 20Newsgroups dataset. UMAP (McInnes et al., 2018) is used to project the 32-dimensional latent representations into a 2-dimensional space, as shown in Figure 3. Each data point in the figure denotes a document, with each color representing one category. The number shown with the color is the ground truth category ID. It can be observed from Figure 3 (a) and (b) that more embeddings are clustered correctly when the Gaussian mixture prior is used. This confirms the advantages of using mixture priors in the task of hashing. Furthermore, it is observed that the latent embeddings learned by BMSH-S can be clustered almost perfectly. In contrast, many embeddings are found to be clustered incorrectly for the other two models. This observation is consistent with the conjecture that mixture prior and end-to-end training are both useful for semantic hashing.

Conclusions
In this paper, deep generative models with mixture priors were proposed for the tasks of semantic hashing. We first proposed to use a Gaussian mixture prior, instead of the standard Gaussian prior in VAE, to learn the representations of documents. A separate step was then used to cast the continuous latent representations into binary hashing codes. To avoid the requirement of a separate casting step, we further proposed to use the Bernoulli mixture prior, which offers the advantages of both mixture prior and the end-to-end training. Comparing to strong baselines on three public datasets, the experimental results indicate that the proposed methods using mixture priors outperform existing models by a substantial margin. Particularly, the semantic hashing model with Bernoulli mixture prior (BMSH) achieves state-of-the-art results on all the three datasets considered in this paper.