Generative Semantic Hashing Enhanced via Boltzmann Machines

Generative semantic hashing is a promising technique for large-scale information retrieval thanks to its fast retrieval speed and small memory footprint. For the tractability of training, existing generative-hashing methods mostly assume a factorized form for the posterior distribution, enforcing independence among the bits of hash codes. From the perspectives of both model representation and code space size, independence is always not the best assumption. In this paper, to introduce correlations among the bits of hash codes, we propose to employ the distribution of Boltzmann machine as the variational posterior. To address the intractability issue of training, we first develop an approximate method to reparameterize the distribution of a Boltzmann machine by augmenting it as a hierarchical concatenation of a Gaussian-like distribution and a Bernoulli distribution. Based on that, an asymptotically-exact lower bound is further derived for the evidence lower bound (ELBO). With these novel techniques, the entire model can be optimized efficiently. Extensive experimental results demonstrate that by effectively modeling correlations among different bits within a hash code, our model can achieve significant performance gains.


Introduction
Similarity search, also known as nearest-neighbor search, aims to find items that are similar to a query from a large dataset. It plays an important role in modern information retrieval systems and has been used in various applications, ranging from plagiarism analysis (Stein et al., 2007) to content-based multimedia retrieval (Lew et al., 2006), etc. However, looking for nearest neighbors in the Euclidean space is often computationally * Corresponding author.
prohibitive for large-scale datasets (calculating cosine similarity with high-dimensional vectors is computationally-expensive). Semantic hashing circumvents this problem by representing semantically similar documents with compact and binary codes. Accordingly, similar documents can be retrieved by evaluating the hamming distances of their hash codes much more efficiently.
To obtain similarity-preserving hash codes, extensive efforts have been made to learn hash functions that can preserve the similarity information of original documents in the binary embedding space (Shen et al., 2015;Liu et al., 2016). Existing methods often require the availability of label information, which is often expensive to obtain in practice. To avoid the use of labels, generative semantic hashing methods have been developed. Specifically, the variational autoencoder (VAE) is first employed for semantic hashing in (Chaidaroon and Fang, 2017), and their model is termed VDSH. As a two-step process, the continuous document representations obtained from VAE are directly converted into binary hash codes. To resolve the two-step training problem, Bernoulli priors are leveraged as the prior distribution in NASH (Shen et al., 2018), replacing the continuous Gaussian prior in VDSH. By utilizing straight-through (ST) technique (Bengio et al., 2013), their model can be trained in an end-to-end manner, while keeping the merits of VDSH. Recently, to further improve the quality of hash codes, mixture priors are investigated in BMSH (Dong et al., 2019), while more accurate gradient estimators are studied in Doc2hash (Zhang and Zhu, 2019), both under a similar framework as NASH.
Due to the training-tractability issue, the aforementioned generative hashing methods all assume a factorized variational form for the posterior, e.g., independent Gaussian in VDSH and independent Bernoulli in NASH, BMSH and Doc2hash. This assumption prevents the models from capturing dependencies among the bits of hash codes. Although uncorrelated bits are sometimes preferred in hashing, as reported in (Zhang and Li, 2014), this may not apply to generative semantic hashing. This is due to the fact that the independent assumption could severely limit a model's ability to yield meaningful representations and thereby produce high-quality hash codes. Moreover, as the code length increases (to e.g. 128 bits), the number of possible codes (or simply the code space) will be too large for a dataset with limited number of data points. As a result, we advocate that correlations among bits of a hash code should be considered properly to restrict the embedding space, and thus enable a model to work effectively under a broad range of code lengths.
To introduce correlations among bits of hash codes, we propose to adopt the Boltzmann-machine (BM) distribution (Ackley et al., 1985) as a variational posterior to capture various complex correlations. One issue with this setting, relative to existing efficient training methods, is the inefficiency brought in training. To address this issue, we first prove that the BM distribution can be augmented as a hierarchical concatenation of a Gaussian-like distribution and a Bernoulli distribution. Using this result, we then show that samples from BM distributions can be well reparameterized easily. To enable efficient learning, an asymptotically-exact lower bound of the standard evidence lower bound (ELBO) is further developed to deal with the notorious problem of the normalization term in Boltzmann machines. With the proposed reparameterization and the new lower bound, our model can be trained efficiently as the previous generative hashing models that preserve no bit correlations. Extensive experiments are conducted to evaluate the performance of the proposed model. It is observed that on all three public datasets considered, the proposed model achieves the best performance among all comparable models. In particular, thanks to the introduced correlations, we observe the performance of the proposed model does not deteriorate as the code length increases. This is surprising and somewhat contrary to what has been observed in other generative hashing models.

Preliminaries
Generative Semantic Hashing In the context of generative semantic hashing, each document is represented by a sequence of words x = {w 1 , w 2 , · · · , w |x| }, where w i is the i-th word and is denoted by a |V |-dimensional one-hot vector; |x| and |V | denotes the document size (number of words) and the vocabulary size, respectively. Each document x is modeled by a joint probability: where s is a latent variable representing the document's hash code. With the probability p θ (x, s) trained on a set of documents, the hash code for a document x can be derived directly from the posterior distribution p θ (s|x). In existing works, the likelihood function, or the decoder takes a form where E ∈ R m×|V | is the matrix connecting the latent code s and the one-hot representation of words; and e j is the one-hot vector with the only '1' locating at the i-th position. Documents could be modelled better by using more expressive likelihood functions, e.g., deep neural networks, but as explained in (Shen et al., 2018), they are more likely to destroy the crucial distance-keeping property for semantic hashing. Thus, the simple form of (2) is often preferred in generative hashing. As for the prior distribution p(s), it is often chosen as the standard Gaussian distribution as in VDSH (Chaidaroon and Fang, 2017), or the Bernoulli distribution as in NASH and BMSH (Shen et al., 2018;Dong et al., 2019).
Inference Probabilistic models can be trained by maximizing the log-likelihood log p θ (x) with p θ (x) = s p θ (x, s)ds. However, due to the intractability of calculating p θ (x), we instead optimize its evidence lower bound (ELBO), i.e., where q φ (s|x) is the proposed variational posterior parameterized by φ. It can be shown that log p θ (x) ≥ L holds for any q φ (s|x) , and that if q φ (s|x) is closer to the true posterior p θ (s|x), the bound L will be tighter. Training then reduces to maximizing the lower bound L w.r.t. θ and φ. In VDSH (Chaidaroon and Fang, 2017), q φ (s|x) takes the form of an independent Gaussian distribution where µ φ (x) and σ φ (x) are two vector-valued functions parameterized by multi-layer perceptrons (MLP) with parameters φ. Later, in NASH and BMSH (Shen et al., 2018;Dong et al., 2019), q φ (s|x) is defined as an independent Bernoulli distribution, i.e., where g φ (x) is also vector-valued function parameterized by a MLP. The value at each dimension represents the probability of being 1 at that position. The MLP used to parameterize the posterior q φ (s|x) is also referred to as the encoder network.
One key requirement for efficient end-to-end training of generative hashing method is the availability of reparameterization for the variational distribution q φ (s|x). For example, when q φ (s|x) is a Gaussian distribution as in (4), a sample s from it can be efficiently reparameterized as with ∼ N (0, I). When q φ (s|x) is a Bernoulli distribution as in (5), a sample from it can be reparameterized as where ∈ R m with elements i ∼ uniform(0, 1). With these reparameterization tricks, the lower bound in (3) can be estimated by the sample s as where s has been denoted as s φ to explicitly indicate its dependence on φ. To train these hashing models, the backpropagation algorithm can be employed to estimate the gradient of (8) w.r.t. θ and φ easily. However, it is worth noting that in order to use the reparameterization trick, all existing methods assumed a factorized form for the proposed posterior q φ (s|x), as shown in (4) and (5). This suggests that the binary bits in hash codes are independent of each other, which is not the best setting in generative semantic hashing.

Correlation-Enhanced Generative Semantic Hashing
In this section, we present a scalable and efficient approach to introducing correlations into the bits of hash codes, by using a Boltzmann-machine distribution as the variational posterior with approximate reparameterization.

Boltzmann Machine as the Variational Posterior
Many probability distributions defined over binary variables s ∈ {0, 1} m are able to capture the dependencies. Among them, the most famous one should be the Boltzmann-machine distribution (Ackley et al., 1985), which takes the following form: where Σ ∈ R m×m and µ ∈ R m are the distribution parameters; and Z s e 1 2 s T Σs+µ T s is the normalization constant. The Boltzmann-machine distribution can be adopted to model correlations among the bits of a hash code. Specifically, by restricting the posterior to the Boltzmann form and substituting it into the lower bound of (3), we can write the lower bound as: and µ φ (x) are functions parameterized by the encoder network with parameters φ and x as input. One problem with such modeling is that the expectation term E q φ (s|x) [·] in (11) cannot be expressed in a closed form due to the complexity of q φ (s|x). Consequently, one cannot directly optimize the lower bound L w.r.t. θ and φ.

Reparameterization
An alternative way is to approximate the expectation term by using the reparameterized form of a sample s from q φ (s|x), as was done in the previous uncorrelated generative hashing models (see (6) and (7)). Compared to existing simple variational distributions, there is no existing work on how to reparameterize the complicated Boltzmannmachine distribution. To this end, we first show that the Boltzmann-machine distribution can be equivalently written as the composition of an approximate correlated Gaussian distribution and a Bernoulli distribution. Proposition 1. A Boltzmann-machine distribution b(s) = 1 Z e 1 2 s T Σs+µ T s with Σ 0 can be equivalently expressed as the composition of two distributions, that is, where p(r) = 1 Z m i=1 (e r i + 1) · N (r; µ, Σ); p(s|r) = m i=1 p(s i |r i ) with s i and r i denoting the i-th element of s and r; and p(s i |r i ) Bernoulli(σ(r i )) with σ(·) being the sigmoid function.
Proof. See Appendix A.1 for details.
Based on Proposition 1, we can see that a sample from the Boltzmann-machine distribution q φ (s|x) in (10) can be sampled hierarchically as and σ(·) is applied to its argument element-wise. From the expression of q φ (r|x), we can see that for small values of r i , the influence of (e r i + 1) on the overall distribution is negligible, and thus q φ (r|x) can be well approximated by the Gaussian distribution N (r; µ φ (x), Σ φ (x)). For relatively large r i , the term (e r i + 1) will only influence the distribution mean, roughly shifting the Gaussian distribution N (r; µ φ (x), Σ φ (x)) by an amount approximately equal to its variance. For problems of interest in this paper, the variances of posterior distribution are often small, hence it is reasonable to approximate samples from q φ (r|x) by those from N (r; µ φ (x), Σ φ (x)).
With this approximation, we can now draw samples from Boltzmann-machine distribution q φ (s|x) in (10) approximately by the two steps below For the Gaussian sample r ∼ N (r; µ φ (x), Σ φ (x)), similar to (6), it can be reparameterized as It should be noted that in practice, we can define the function L φ (x) in advance and then obtain , thus the Cholesky decomposition is not needed.
Given the Gaussian sample r, similar to the reparameterization of Bernoulli variables in (7), we can reparameterize the Bernoulli sample s ∼ Bernoulli(σ(r)) as s = sign(σ(r)−u)+1 2 , where u ∈ R m with each element u i ∼ uniform(0, 1). By combining the above reparameterizations, a sample from the Boltzmann-machine distribution q φ (s|x) can then be approximately reparameterized as where the subscript φ is to explicitly indicate that the sample s is expressed in terms of φ.
With the reparameterization s φ , the expectation term in (11) can be approximated as log . Consequently, the gradients of this term w.r.t. both θ and φ can be evaluated efficiently by backpropagation, with the only difficulty lying at the non-differentiable function sign(·) of s φ in (18). Many works have been devoted to estimate the gradient involving discrete random variables (Bengio et al., 2013;Jang et al., 2017;Tucker et al., 2017;Grathwohl et al., 2018;Yin and Zhou, 2019). Here, we adopt the simple straight-through (ST) technique (Bengio et al., 2013), which has been found performing well in many applications. By simply treating the hard threshold function sign(·) as the identity function, the ST technique estimates the gradient as Then, the gradient of the first term in ELBO L w.r.t. φ can be computed efficiently by backpropagation.

An Asymptotically-Exact Lower Bound
To optimize the ELBO in (11), we still need to calculate the gradient of log Z φ , which is known to be notoriously difficult. A common way is to estimate the gradient ∂ log Z φ ∂φ by MCMC methods (Tieleman, 2008;Desjardins et al., 2010;Su et al., 2017a,b), which are computationally expensive and often of high variance. By noticing a special form of the ELBO (11), we develop a lower bound for the ELBO L, where the log Z φ term can be conveniently cancelled out. Specifically, we introduce another probability distribution h(s) and lower bound the original ELBO: Since KL(·) ≥ 0, we have L(θ, φ) ≤ L holds for all h(s), i.e., L is a lower bound of L, and equals to the ELBO L when h(s) = q φ (s|x). For the choice of h(s), it should be able to reduce the gap between L and L as much as possible, while ensuring that the optimization is tractable. Balancing on the two sides, a mixture distribution is used where k denotes the number of components; p(s|r (i) ) is the multivariate Bernoulli distribution and r (i) is the i-th sample drawn from q φ (r|x) as defined in (14). By substituting h k (s) into (20) and taking the expectation w.r.t. r (i) , we have . It can be proved that the bound L k gradually approaches the ELBO L as k increases, and finally equals to it as k → ∞. Specifically, we have Proposition 2. For any integer k, the lower bound L k of the ELBO satisfies the conditions: 1) L k+1 ≥ L k ; 2) lim k→∞ L k = L.
Proof. See Appendix A.2 for details.
By substituting L in (11) and h k (s) in (21) into (22), the bound can be further written as where the log Z φ term is cancelled out since it appears in both terms but has opposite signs. For the first term in (23), as discussed at the end of Section 3.1, it can be approximated as log . For the second term, each sample r (i) for i = 1, · · · , k can be approximately reparameterized like that in (17). Given the r (i) for i = 1, · · · , k, samples from h k (s) can also be reparameterized in a similar way as that for Bernoulli distributions in (7). Thus, samples drawn from r (1···k) ∼ q φ (r (1···k) |x) and s ∼ h k (s) are also reparameterizable, as detailed in Appendix A.3. By denoting this reparametrized sample ass φ , we can approximate the second term in (23) as log . Thus the lower bound (23) becomes With the discrete gradient estimation techniques like the ST method, the gradient of L k w.r.t. θ and φ can then be evaluated efficiently by backpropagation. Proposition 2 indicates that the exact L k gets closer to the ELBO as k increases, so better bound can be expected for the approximated L k as well when k increases. In practice, a moderate value of k is found to be sufficient to deliver a good performance.

Low-Rank Perturbation for the Covariance Matrix
In the reparameterization of a Gaussian sample, (17), a m × m matrix L φ (x) is required, with m denoting the length of hash codes. The elements of L φ (x) are often designed as the outputs of neural networks parameterized by φ. Therefore, if m is large, the number of neural network outputs will be too large. To overcome this issue, a more parameter-efficient strategy called Low-Rank Perturbation is employed, which restricts covariance matrix to the form where D is a diagonal matrix with positive entries and U = [u 1 , u 2 , · · · u v ] is a low-rank perturbation matrix with u i ∈ R m and v m. Under this low-rank perturbed Σ, the Gaussian samples can be reparameterized as where 1 ∼ N (0, I m ) and 2 ∼ N (0, I v ). We can simply replace (17) with the above expression in any place that uses r. In this way, the number of neural network outputs can be dramatically reduced from m 2 to mv.

Related Work
Semantic Hashing (Salakhutdinov and Hinton, 2009) is a promising technique for fast approximate similarity search. Locality-Sensitive Hashing, one of the most popular hashing methods (Datar et al., 2004), projects documents into low-dimensional hash codes in a randomized manner. However, the method does not leverage any information of data, and thus generally performs much worse than those data-dependent methods. Among the datadependent methods, one of the mainstream methods is supervised hashing, which learns a function that could output similar hash codes for semantically similar documents by making effective use of the label information (Shen et al., 2015;Liu et al., 2016). Different from supervised methods, unsupervised hashing pays more attention to the intrinsic structure of data, without making use of the labels. Spectral hashing (Weiss et al., 2009), for instance, learns balanced and uncorrelated hash codes by seeking to preserve a global similarity structure of documents. Self-taught hashing (Zhang et al., 2010), on the other hand, focuses more on preserving local similarities among documents and presents a two-stage training procedure to obtain such hash codes. In contrast, to generate highquality hash codes, iterative quantization (Gong et al., 2013) aims to minimize the quantization error, while maximizing the variance of each bit at the same time.
Among the unsupervised hashing methods, the idea of generative semantic hashing has gained much interest in recent years. Under the VAE framework, VDSH (Chaidaroon and Fang, 2017) was proposed to first learn continuous the documents' latent representations, which are then cast into binary codes. While semantic hashing is achieved with generative models nicely, the twostage training procedure is problematic and is prone to result in local optima. To address this issue, NASH (Shen et al., 2018) went one step further and presented an integrated framework to enable the end-to-end training by using the discrete Bernoulli prior and the ST technique, which is able to estimate the gradient of functions with discrete variables. Since then, various directions have been explored to improve the performance of NASH. (Dong et al., 2019) proposed to employ the mixture priors to improve the model's capability to distinguish documents from different categories, and thereby improving the quality of hash codes. On the other hand, a more accurate gradient estimator called Gumbel-Softmax (Jang et al., 2017; is explored in Doc2hash (Zhang and Zhu, 2019) to replace the ST estimator in NASH. More recently, to better model the similarities between different documents, (Hansen et al., 2019) investigated the combination of generative models and ranking schemes to generate hash codes. Different from the aforementioned generative semantic hashing methods, in this paper, we focus on how to incorporate correlations into the bits of hash codes.

Experimental Setup
Datasets Following previous works, we evaluate our model on three public benchmark datasets: i) Reuters21578, which consists of 10788 documents with 90 categories; ii) 20Newsgroups, which contains 18828 newsgroup posts from 20 different topics; iii) TMC, which is a collection of 21519 documents categorized into 22 classes.
Training Details For the conveniences of comparisons, we use the same network architecture as that in NASH and BMSH. Specifically, a 2-layer feed-forward neural network with 500 hidden units and a ReLU activation function is used as an inference network, which receives the TF-IDF of a document as input and outputs the mean and covariance matrix of the Gaussian random variables r. During training, the dropout (Srivastava et al., 2014) is used to alleviate the overfitting issue, with the keeping probability selected from {0.8, 0.9} based on the performance on the validation set. The Adam optimizer (Kingma and Ba, 2014) is used to train our model, with the learning rate set to 0.001 initially and then decayed for every 10000 iterations. For all experiments on different datasets and lengths of hash codes, the rank v of matrix U is set to 10 and the number of component k in the distribution h k (s) is set to 10 consistently, although a systematic ablation study is conducted in Section 5.5 to investigate their impacts on the final performances.

Evaluation Metrics
The performance of our proposed approach is measured by retrieval precision i.e., the ratio of the number of relevant documents to that of retrieved documents. A retrieved document is said to be relevant if its label is the same as that of the query one. Specifically, during the eval-uating phase, we first pick out top 100 most similar documents for each query document according to the hamming distances of their hash codes, from which the precision is calculated. The precisions averaged over all query documents are reported as the final performance.

Results of Generative Semantic Hashing
The retrieval precisions on datasets TMC, Reuters and 20Newsgroups are reported in Tables 1, 2 and 3, respectively, under different lengths of hash codes. Compared to the generative hashing method NASH without considering correlations, we can see that the proposed method, which introduces correlations among bits by simply employing the distribution of Boltzmann machine as the posterior, performs significantly better on all the three datasets considered. This strongly corroborates the benefits of taking correlations into account when learning the hash codes. From the tables, we can also observe that the proposed model even outperforms the BMSH, an enhanced variant of NASH that employs more complicated mixture distributions as a prior. Since only the simplest prior is used in the proposed model, larger performance gains can be expected if mixture priors are used as in BMSH. Notably, a recent work named RBSH is proposed in (Hansen et al., 2019), which improves NASH by specifically ranking the documents according to their similarities. However, since it employs a different data preprocessing technique as the existing works, we cannot include its results for a direct comparison here. Nevertheless, we trained our model on their preprocessed datasets and find that our method still outperforms it. For details about the results, please refer to Appendix A.4.
Moreover, when examining the retrieval performance of hash codes under different lengths, it is observed that the performance of our proposed method never deteriorates as the code length increases, while other models start to perform poorly after the length of codes reaching a certain level. For the most comparable methods like VDSH, NASH and BMSH, it can be seen that the performance of 128 bits is generally much worse than that of 64 bits. This phenomenon is illustrated more clearly in Figure 1. This may attribute to the reason that for hash codes without correlations, the number of codes will increase exponentially as the code length increases. Because the code space is too large, the probability of assigning similar items   to nearby binary codes may decrease significantly. But for the proposed model, since the bits of hash codes are correlated to each other, the effective number of codes can be determined by the strength of correlations among bits, effectively restricting the size of code space. Therefore, even though the code length increases continually, the performance of our proposed model does not deteriorate.

Empirical Study of Computational Efficiency
To show the computational efficiency of our proposed method, we also report the average running time per epoch in GPU on TMC dataset, which is of the largest among the considered ones, in Table 4. As a benchmark, the average training time of vanilla NASH is 2.553s per epoch. It can be seen that because of to the use of low-rank parameterization of the covariance matrix, the proposed model can be trained almost as efficiently as vanilla NASH, but deliver a much better performance.

Hash Codes Visualization
To further investigate the capability of different models in generating semantic-preserving binary codes, we project the hash codes produced by VDSH, NASH and our proposed model on 20Newsgroups datasets onto a two-dimensional plane by using the widely adopted UMAP technique (McInnes

Analyses on the Impacts of v and k
Ranks v Low-rank perturbed covariance matrix enables the proposed model to trade-off between complexity and performance. That is, larger v allows the model to capture more dependencies among latent variables, but the required computational complexity also increases. To investigate its impacts, we evaluate the performance of the 64bit hash codes obtained from the proposed model under different values of v, with the other key parameter k fixed to 10. The result is listed in the left half of Table 5. Notably, the proposed model with v = 0 is equivalent to NASH since there is not any correlation between the binary random variables. It can be seen that as the number of ranks  increases, the retrieval precisions also increase, justifying the hypothesis that employing the posteriors with correlations can increase the model's representational capacity and thereby improves the hash codes' quality in turn. It is worth noting that the most significant performance improvement is observed between the models with v = 0 and v = 1, and then as the value of v continues to increase, the improvement becomes relatively small. This indicates that it is feasible to set the v to a relatively small value to save computational resources while retaining competitive performance.
The number of mixture components k As stated in Section 3.3, increasing the number of components k in the mixture distribution h k (s) will reduce the gap between the lower bound L k and the ELBO L. To investigate the impacts of k, the retrieval precisions of the proposed model are evaluated under different values of k, while setting the other key parameter v = 10. It can be seen from the right half of Table 5 that as the number of components k increases, the retrieval precision also increases gradually, suggesting that a tighter lower bound L k can always indicate better hash codes. Hence, if more mixture components are used, better hash codes can be expected. Due to the sake of complexity, only 10 components are used at most in the experiments.

Conclusion
In this paper, by employing the distribution of Boltzmann machine as the posterior, we show that correlations can be efficiently introduced into the bits. To facilitate training, we first show that the BM distribution can be augmented as a hierarchical concatenation of a Gaussian-like distribution and a Bernoulli distribution. Then, an asymptoticallyexact lower bound of ELBO is further developed to tackle the tricky normalization term in Boltzmann machines. Significant performance gains are observed in the experiments after introducing correlations into the bits of hash codes.

A Appendices
A.1 Proof of Proposition 1 Proof. Making use of completing the square technique, the joint distribution of r and s can be decomposed as: where q(r|s) = N (r; Σs + µ, Σ), q(s) = 1 Z e µ s+ 1 2 s Σs .
From above, we show that the marginal distribution q(s) is a Boltzmann machine distribution.

A.2 Proof of Proposition 2
We show the following facts about the proposed lower bound of ELBO L k . First, For any integer k, we have L k+1 ≥ L k . For brevity we denote E q φ (r (1,··· ,k) |x) as E r 1..k . First, due to the symmetry of indices, the following equality holds: E r 1..k E q(s|r (1) ) log h k (s)=E r 1..k E q(s|r (i) ) log h k (s).
From this, we have Applying the equality (27) gives us: We now show that lim k→∞ L k = L. According to the strong law of large numbers, h k (s) = 1 k k j q(s|r (j) ) converges to E q(r|x) [q(s|r)] = q(s|x) almost surely. We then have lim k→∞ E r 1..k [KL(h k (s)||q(s|x))] = 0.
Therefore, L k approaches L as k approaches infinity.

A.3 Derivation of reparameterization for h k (s)
Recall that h k (s) = 1 k k j=1 q(s|r (j) φ ). We show that it can be easily reparameterized. Specifically, we could sample from such a mixture distribution through a two-stage procedure: (i) choosing a component c ∈ {1, 2, · · · , k} from a uniform discrete distribution, which is then transformed as a k-dimensional one-hot vectorc; (ii) drawing a sample from the selected component, i.e. q(s|r (c) φ ). Moreover, we define a matrix R φ (x) ∈ R m×k with its columns consisting of r (1) φ , r (2) φ , · · · , r (k) φ , each of which can be also reparameterized. In this way, a samples φ from the distribution h k (s) can be simply expressed as s φ = sign (σ(R φc ) − u) + 1 2 which can be seen as selecting a sample r (c) φ and then passing it through a perturbed sigmoid function. Therefore, during training, the gradients of φ are simply back-propagated through the chosen sample r (c) φ .

A.4 Comparisons between RBSH and our method
As discussed before, the main reason that we cited this paper but didn't compare with it is that the  datasets in (Hansen et al., 2019) are preprocessed differently as ours. Therefore, it is inappropriate to include the performance of the model from (Hansen et al., 2019) into the comparisons of our paper directly. Our work is a direct extension along the research line of VDSH and NASH. In our experiments, we followed their setups and used the preprocessed datasets that are publicized by them. However, in (Hansen et al., 2019), the datasets are preprocessed by themselves. The preprocessing procedure influences the final performance greatly, as observed in the reported results.
To see how our model performs compared to (Hansen et al., 2019), we evaluate our model on the 20Newsgroup and TMC datasets that are preprocessed by the method in (Hansen et al., 2019). The results are reported in Table 6, where RBSH is the model from (Hansen et al., 2019). We can see that using the same preprocessed datasets, our model overall performs better than RBSH, especially in the case of long codes. It should be emphasized that the correlation-introducing method proposed in this paper can be used with all existing VAE-based hashing models. In this paper, the base model is NASH, and when they are used together, we see a significant performance improvement. Since the RBSH is also a VAE-based hashing model, the proposed method can also be used with it to introduce correlations into the code bits, and significant improvements can also be expected.