NASH: Toward End-to-End Neural Architecture for Generative Semantic Hashing

Semantic hashing has become a powerful paradigm for fast similarity search in many information retrieval systems. While fairly successful, previous techniques generally require two-stage training, and the binary constraints are handled ad-hoc. In this paper, we present an end-to-end Neural Architecture for Semantic Hashing (NASH), where the binary hashing codes are treated as Bernoulli latent variables. A neural variational inference framework is proposed for training, where gradients are directly backpropagated through the discrete latent variable to optimize the hash function. We also draw the connections between proposed method and rate-distortion theory, which provides a theoretical foundation for the effectiveness of our framework. Experimental results on three public datasets demonstrate that our method significantly outperforms several state-of-the-art models on both unsupervised and supervised scenarios.


Introduction
The problem of similarity search, also called nearest-neighbor search, consists of finding documents from a large collection of documents, or corpus, which are most similar to a query document of interest. Fast and accurate similarity search is at the core of many information retrieval applications, such as plagiarism analysis (Stein et al., 2007), collaborative filtering (Koren, 2008), content-based multimedia retrieval (Lew et al., 2006) and caching (Pandey et al., 2009). Semantic hashing is an effective approach for fast similarity search (Salakhutdinov and Hinton, 2009;Zhang * Equal contribution. et al., 2010;Wang et al., 2014). By representing every document in the corpus as a similaritypreserving discrete (binary) hashing code, the similarity between two documents can be evaluated by simply calculating pairwise Hamming distances between hashing codes, i.e., the number of bits that are different between two codes. Given that today, an ordinary PC is able to execute millions of Hamming distance computations in just a few milliseconds (Zhang et al., 2010), this semantic hashing strategy is very computationally attractive.
While considerable research has been devoted to text (semantic) hashing, existing approaches typically require two-stage training procedures. These methods can be generally divided into two categories: (i) binary codes for documents are first learned in an unsupervised manner, then l binary classifiers are trained via supervised learning to predict the l-bit hashing code (Zhang et al., 2010;Xu et al., 2015); (ii) continuous text representations are first inferred, which are binarized as a second (separate) step during testing (Wang et al., 2013;Chaidaroon and Fang, 2017). Because the model parameters are not learned in an end-to-end manner, these two-stage training strategies may result in suboptimal local optima. This happens because different modules within the model are optimized separately, preventing the sharing of information between them. Further, in existing methods, binary constraints are typically handled adhoc by truncation, i.e., the hashing codes are obtained via direct binarization from continuous representations after training. As a result, the information contained in the continuous representations is lost during the (separate) binarization process. Moreover, training different modules (mapping and classifier/binarization) separately often requires additional hyperparameter tuning for each training stage, which can be laborious and timeconsuming.
In this paper, we propose a simple and generic neural architecture for text hashing that learns binary latent codes for documents in an end-toend manner. Inspired by recent advances in neural variational inference (NVI) for text processing (Miao et al., 2016;Yang et al., 2017;Shen et al., 2017b), we approach semantic hashing from a generative model perspective, where binary (hashing) codes are represented as either deterministic or stochastic Bernoulli latent variables. The inference (encoder) and generative (decoder) networks are optimized jointly by maximizing a variational lower bound to the marginal distribution of input documents (corpus). By leveraging a simple and effective method to estimate the gradients with respect to discrete (binary) variables, the loss term from the generative (decoder) network can be directly backpropagated into the inference (encoder) network to optimize the hash function.
Motivated by the rate-distortion theory (Berger, 1971;Theis et al., 2017), we propose to inject data-dependent noise into the latent codes during the decoding stage, which adaptively accounts for the tradeoff between minimizing rate (number of bits used, or effective code length) and distortion (reconstruction error) during training. The connection between the proposed method and ratedistortion theory is further elucidated, providing a theoretical foundation for the effectiveness of our framework.
Summarizing, the contributions of this paper are: (i) to the best of our knowledge, we present the first semantic hashing architecture that can be trained in an end-to-end manner; (ii) we propose a neural variational inference framework to learn compact (regularized) binary codes for documents, achieving promising results on both unsupervised and supervised text hashing; (iii) the connection between our method and rate-distortion theory is established, from which we demonstrate the advantage of injecting data-dependent noise into the latent variable during training.

Related Work
Models with discrete random variables have attracted much attention in the deep learning community (Jang et al., 2016;Maddison et al., 2016;van den Oord et al., 2017;Li et al., 2017;Shu and Nakayama, 2017). Some of these structures are more natural choices for language or speech data, which are inherently discrete. More specifically, < l a t e x i t s h a 1 _ b a s e 6 4 = " 4 g s o F B p B B A b m y f n 2 Z e N A 3 f T q K 6 U = " > A A A B 7 3 i c b V B N T w I x E J 3 F L 8 Q v 1 K O X R m K C F 7 J r S N Q b 0 Y t H T F z B w I Z 0 S x c a 2 u 6 m 7 R r J h l / h x Y M a r / 4 d b / 4 b C + x B w Z d M 8 v L e T G b m h Q l n 2 r j u t 1 N Y W V 1 b 3 y h u l r a 2 d 3 b 3 y v s H 9 z p O F a E + i X m s 2 i H W l D N J f c M M p + 1 E U S x C T l v h 6 H r q t x 6 p 0 i y W d 2 a c 0 E D g g W Q R I 9 h Y 6 W H Q 6 y Z D V n 0 6 7 Z U r b s 2 d A S 0 T L y c V y N H s l b + 6 / Z i k g k p D O N a 6 4 7 m J C T K s D C O c T k r d V N M E k x E e 0 I 6 l E g u q g 2 x 2 8 A S d W K W P o l j Z k g b N 1 N 8 T G R Z a j 0 V o O w U 2 Q 7 3 o T c X / v E 5 q o o s g Y z J J D Z V k v i h K O T I x m n 6 P + k x R Y v j Y E k w U s 7 c i M s Q K E 2 M z K t k Q v M W X l 4 l / V r u s u b f 1 S u M q T 6 M I R 3 A M V f D g H B p w A 0 3 w g Y C A Z 3 i F N 0 c 5 L 8 6 7 8 z F v L T j 5 z C H 8 g f P 5 A 5 / Q j 9 M = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " 4 g s o F B p B B A b m y f n 2 Z e N A 3 f T q K 6 U = " > A A A B 7 3 i c b V B N T w I x E J 3 F L 8 Q v 1 K O X R m K C F 7 J r S N Q b 0 Y t H T F z B w I Z 0 S x c a 2 u 6 m 7 R r J h l / h x Y M a r / 4 d b / 4 b C + x B w Z d M 8 v L e T G b m h Q l n 2 r j u t 1 N Y W V 1 b 3 y h u l r a 2 d 3 b 3 y v s H 9 z p O F a E + i X m s 2 i H W l D N J f c M M p + 1 E U S x C T l v h 6 H r q t x 6 p 0 i y W d 2 a c 0 E D g g W Q R I 9 h Y 6 W H Q 6 y Z D V n 0 6 7 Z U r b s 2 d A S 0 T L y c V y N H s l b + 6 / Z i k g k p D O N a 6 4 7 m J C T K s D C O c T k r d V N M E k x E e 0 I 6 l E g u q g 2 x 2 8 A S d W K W P o l j Z k g b N 1 N 8 T G R Z a j 0 V o O w U 2 Q 7 3 o T c X / v E 5 q o o s g Y z J J D Z V k v i h K O T I x m n 6 P + k x R Y v j Y E k w U s 7 c i M s Q K E 2 M z K t k Q v M W X l 4 l / V r u s u b f 1 S u M q T 6 M I R 3 A M V f D g H B p w A 0 3 w g Y C A Z 3 i F N 0 c 5 L 8 6 7 8 z F v L T j 5 z C H 8 g f P 5 A 5 / Q j 9 M = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " 4 g s o F B p B B A b m y f n 2 Z e N A 3 f T q K 6 U = " > A A A B 7 3 i c b V B N T w I x E J 3 F L 8 Q v 1 K O X R m K C F 7 J r S N Q b 0 Y t H T F z B w I Z 0 S x c a 2 u 6 m 7 R r J h l / h x Y M a r / 4 d b / 4 b C + x B w Z d M 8 v L e T G b m h Q l n 2 r j u t 1 N Y W V 1 b 3 y h u l r a 2 d 3 b 3 y v s H 9 z p O F a E + i X m s 2 i H W l D N J f c M M p + 1 E U S x C T l v h 6 H r q t x 6 p 0 i y W d 2 a c 0 E D g g W Q R I 9 h Y 6 W H Q 6 y Z D V n 0 6 7 Z U r b s 2 d A S 0 T L y c V y N H s l b + 6 / Z i k g k p D O N a 6 4 7 m J C T K s D C O c T k r d V N M E k x E e 0 I 6 l E g u q g 2 x 2 8 A S d W K W P o l j Z k g b N 1 N 8 T G R Z a j 0 V o O w U 2 Q 7 3 o T c X / v E 5 q o o s g Y z J J D Z V k v i h K O T I x m n 6 P + k x R Y v j Y E k w U s 7 c i M s Q K E 2 M z K t k Q v M W X l 4 l / V r u s u b f 1 S u M q T 6 M I R 3 A M V f D g H B p w A 0 3 w g Y C A Z 3 i F N 0 c 5 L 8 6 7 8 z F v L T j 5 z C H 8 g f P 5 A 5 / Q j 9 M = < / l a t e x i t > z < l a t e x i t s h a 1 _ b a s e 6 4 = " W I l b T b B F L L c q O v t 8 1 z B c 0 3 G a g J U = " > A A A B 5 3 i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 l F U G 9 F L x 5 b M L b Q h r L Z T t q 1 m 0 3 Y 3 Q g 1 9 B d 4 8 a D i 1 b / k z X / j t s 1 B W x 8 M P N 6 b Y W Z e k A i u j e t + O 4 W V 1 b X 1 j e J m a W t 7 Z 3 e v v H 9 w r + N U M f R Y L G L V D q h G w S V 6 h h u B 7 U Q h j Q K B r W B 0 M / V b j 6 g 0 j + W d G S f o R 3 Q g e c g Z N V Z q P v X K F b f q z k C W S S 0 n F c j R 6 J W / u v 2 Y p R F K w w T V u l N z E + N n V B n O B E 5 K 3 V R j Q t m I D r B j q a Q R a j + b H T o h J 1 b p k z B W t q Q h M / X 3 R E Y j r c d R Y D s j a o Z 6 0 Z u K / 3 m d 1 I S X f s Z l k h q U b L 4 o T A U x M Z l + T f p c I T N i b A l l i t t b C R t S R Z m x 2 Z R s C L X F l 5 e J d 1 a 9 q r r N 8 0 r 9 O k + j C E d w D K d Q g w u o w y 0 0 w A M G C M / w C m / O g / P i v D s f 8 9 a C k 8 8 c w h 8 4 n z 9 X U 4 z R < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " W I l b T b B F L L c q O v t 8 1 z B c 0 3 G a g J U = " > A A A B 5 3 i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 l F U G 9 F L x 5 b M L b Q h r L Z T t q 1 m 0 3 Y 3 Q g 1 9 B d 4 8 a D i 1 b / k z X / j t s 1 B W x 8 M P N 6 b Y W Z e k A i u j e t + O 4 W V 1 b X 1 j e J m a W t 7 Z 3 e v v H 9 w r + N U M f R Y L G L V D q h G w S V 6 h h u B 7 U Q h j Q K B r W B 0 M / V b j 6 g 0 j + W d G S f o R 3 Q g e c g Z N V Z q P v X K F b f q z k C W S S 0 n F c j R 6 J W / u v 2 Y p R F K w w T V u l N z E + N n V B n O B E 5 K 3 V R j Q t m I D r B j q a Q R a j + b H T o h J 1 b p k z B W t q Q h M / X 3 R E Y j r c d R Y D s j a o Z 6 0 Z u K / 3 m d 1 I S X f s Z l k h q U b L 4 o T A U x M Z l + T f p c I T N i b A l l i t t b C R t S R Z m x 2 Z R s C L X F l 5 e J d 1 a 9 q r r N 8 0 r 9 O k + j C E d w D K d Q g w u o w y 0 0 w A M G C M / w C m / O g / P i v D s f 8 9 a C k 8 8 c w h 8 4 n z 9 X U 4 z R < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " W I l b T b B F L L c q O v t 8 1 z B c 0 3 G a g J U = " > A A A B 5 3 i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 l F U G 9 F L x 5 b M L b Q h r L Z T t q 1 m 0 3 Y 3 Q g 1 9 B d 4 8 a D i 1 b / k z X / j t s 1 B W x 8 M P N 6 b Y W Z e k A i u j e t + O 4 W V 1 b X 1 j e J m a W t 7 Z 3 e v v H 9 w r + N U M f R Y L G L V D q h G w S V 6 h h u B 7 U Q h j Q K B r W B 0 M / V b j 6 g 0 j + W d G S f o R 3 Q g e c g Z N V Z q P v X K F b f q z k C W S S 0 n F c j R 6 J W / u v 2 Y p R F K w w T V u l N z E + N n V B n O B E 5 K 3 V R j Q t m I D r B j q a Q R a j + b H T o h J 1 b p k z B W t q Q h M / X 3 R E Y j r c d R Y D s j a o Z 6 0 Z u K / 3 m d 1 I S X f s Z l k h q U b L 4 o T A U x M Z l + T f p c I T N i b A l l i t t b C R t S R Z m x 2 Z R s C L X F l 5 e J d 1 a 9 q r r N 8 0 r 9 O k + j C E d w D K d Q g w u o w y 0 0 w A M G C M / w C m / O g / P i v D s f 8 9 a C k 8 8 c w h 8 4 n z 9 X U 4 z R < / l a t e x i t >x < l a t e x i t s h a 1 _ b a s e 6 4 = " 9 f y 0 M z 7 X / A k u g g 6 I 9 A R a l + H B k n 4 = " > A A A B 7 X i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 l E U G 9 F L x 4 r G F t o Q 9 l s N + 3 S z S b s T s Q S + i O 8 e F D x 6 v / x 5 r 9 x 2 + a g r Q 8 G H u / N M D M v T K U w 6 L r f T m l l d W 1 9 o 7 x Z 2 d r e 2 d 2 r 7 h 8 8 m C T T j P s s k Y l u h 9 R w K R T 3 U a D k 7 V R z G o e S t 8 L R z d R v P X J t R K L u c Z z y I K Y D J S L B K F q p 1 R 1 S z J 8 m v W r N r b s z k G X i F a Q G B Z q 9 6 l e 3 n 7 A s 5 g q Z p M Z 0 P D f F I K c a B Z N 8 U u l m h q e U j e i A d y x V N O Y m y G f n T s i J V f o k S r Q t h W S m / p 7 I a W z M O A 5 t Z 0 x x a B a 9 q f i f 1 8 k w u g x y o d I M u W L z R V E m C S Z k + j v p C 8 0 Z y r E l l G l h b y V s S D V l a B O q 2 B C 8 x Z e X i X 9 W v 6 q 7 d + e 1 x n W R R h m O 4 B h O w Y M L a M A t N M E H B i N 4 h l d 4 c 1 L n x X l 3 P u a t J a e Y O Y Q / c D 5 / A B u 5 j 5 w = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " 9 f y 0 M z 7 X / A k u g g 6 I 9 A R a l + H B k n 4 = " > A A A B 7 X i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 l E U G 9 F L x 4 r G F t o Q 9 l s N + 3 S z S b s T s Q S + i O 8 e F D x 6 v / x 5 r 9 x 2 + a g r Q 8 G H u / N M D M v T K U w 6 L r f T m l l d W 1 9 o 7 x Z 2 d r e 2 d 2 r 7 h 8 8 m C T T j P s s k Y l u h 9 R w K R T 3 U a D k 7 V R z G o e S t 8 L R z d R v P X J t R K L u c Z z y I K Y D J S L B K F q p 1 R 1 S z J 8 m v W r N r b s z k G X i F a Q G B Z q 9 6 l e 3 n 7 A s 5 g q Z p M Z 0 P D f F I K c a B Z N 8 U u l m h q e U j e i A d y x V N O Y m y G f n T s i J V f o k S r Q t h W S m / p 7 I a W z M O A 5 t Z 0 x x a B a 9 q f i f 1 8 k w u g x y o d I M u W L z R V E m C S Z k + j v p C 8 0 Z y r E l l G l h b y V s S D V l a B O q 2 B C 8 x Z e X i X 9 W v 6 q 7 d + e 1 x n W R R h m O 4 B h O w Y M L a M A t N M E H B i N 4 h l d 4 c 1 L n x X l 3 P u a t J a e Y O Y Q / c D 5 / A B u 5 j 5 w = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " 9 f y 0 M z 7 X / A k u g g 6 I 9 A R a l + H B k n 4 = " x < l a t e x i t s h a 1 _ b a s e 6 4 = " w r Y R r S 9 n q r 2 / j T s C L X F l 5 e J d 1 a 9 q r r N 8 0 r 9 O k + j C E d w D K d Q g w u o w y 0 0 w A M G C M / w C m / O g / P i v D s f 8 9 a C k 8 8 c w h 8 4 n z 9 U T Y z P < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " w r Y R r S 9 n q r 2 / j T s C L X F l 5 e J d 1 a 9 q r r N 8 0 r 9 O k + j C E d w D K d Q g w u o w y 0 0 w A M G C M / w C m / O g / P i v D s f 8 9 a C k 8 8 c w h 8 4 n z 9 U T Y z P < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " w r Y R r S 9 n q r 2 / j T < l a t e x i t s h a 1 _ b a s e 6 4 = " o L / k S 6 0 C r A 7 r 8 c e u y O Q m k l S U P / Y = " > A A A C D 3 i c b V B N S 8 N A E N 3 U r 1 q / o h 6 9 L B a x g p R E B P V W 9 K I X q W B s o Y l l s 9 2 2 S 3 e T s L s R 2 p C f 4 M W / 4 s W D i l e v 3 v w 3 b t o c t P X B w O O 9 G W b m + R G j U l n W t 1 G Y m 1 9 Y X C o u l 1 Z W 1 9 Y 3 z M 2 t O x n G A h M H h y w U T R 9 J w m h A H E U V I 8 1 I E M R 9 R h r + 4 C L z G w 9 E S B o G t 2 o Y E Y + j X k C 7 F C O l p b a 5 P 7 p P 3 E h Q T l J X U g 5 d j l Q f I 5 Z c p 5 X R I d R a j y N 4 d d A 2 y 1 b V G g P O E j s n Z Z C j 3 j a / 3 E 6 I Y 0 4 C h R m S s m V b k f I S J B T F j K Q l N 5 Y k Q n i A e q S l a Y A 4 k V 4 y f i i F e 1 r p w G 4 o d A U K j t X f E w n i U g 6 5 r z u z e + W 0 l 4 n / e a 1 Y d U + 9 h A Z R r E i A J 4 u 6 M Y M q h F k 6 s E M F w Y o N N U F Y U H 0 r x H 0 k E F Y 6 w 5 I O w Z 5 + e Z Y 4 R 9 W z q n V z X K 6 d 5 2 k U w Q 7 Y B R V g g x N Q A 5 e g D h y A w S N 4 B q / g z X g y X o x 3 4 2 P S W j D y m W 3 w B 8 b n D 5 4 + n H s = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " o L / k S 6 0 C r A 7 r 8 c e u y O Q m k l S U P / Y = " > A A A C D 3 i c b V B N S 8 N A E N 3 U r 1 q / o h 6 9 L B a x g p R E B P V W 9 K I X q W B s o Y l l s 9 2 2 S 3 e T s L s R 2 p C f 4 M W / 4 s W D i l e v 3 v w 3 b t o c t P X B w O O 9 G W b m + R G j U l n W t 1 G Y m 1 9 Y X C o u l 1 Z W 1 9 Y 3 z M 2 t O x n G A h M H h y w U T R 9 J w m h A H E U V I 8 1 I E M R 9 R h r + 4 C L z G w 9 E S B o G t 2 o Y E Y + j X k C 7 F C O l p b a 5 P 7 p P 3 E h Q T l J X U g 5 d j l Q f I 5 Z c p 5 X R I d R a j y N 4 d d A 2 y 1 b V G g P O E j s n Z Z C j 3 j a / 3 E 6 I Y 0 4 C h R m S s m V b k f I S J B T F j K Q l N 5 Y k Q n i A e q S l a Y A 4 k V 4 y f i i F e 1 r p w G 4 o d A U K j t X f E w n i U g 6 5 r z u z e + W 0 l 4 n / e a 1 Y d U + 9 h A Z R r E i A J 4 u 6 M Y M q h F k 6 s E M F w Y o N N U F Y U H 0 r x H 0 k E F Y 6 w 5 I O w Z 5 + e Z Y 4 R 9 W z q n V z X K 6 d 5 2 k U w Q 7 Y B R V g g x N Q A 5 e g D h y A w S N 4 B q / g z X g y X o x 3 4 2 P S W j D y m W 3 w B 8 b n D 5 4 + n H s = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " o L / k S 6 0 C r A 7 r 8 c e u y O Q m k l S U P / Y = " > A A A C D 3 i c b V B N S 8 N A E N 3 U r 1 q / o h 6 9 L B a x g p R E B P V W 9 K I X q W B s o Y l l s 9 2 2 S 3 e T s L s R 2 p C f 4 M W / 4 s W D i l e v 3 v w 3 b t o c t P X B w O O 9 G W b m + R G j U l n W t 1 G Y m 1 9 Y X C o u l 1 Z W 1 9 Y 3 z M 2 t O x n G A h M H h y w U T R 9 J w m h A H E U V I 8 1 I E M R 9 R h r + 4 C L z G w 9 E S B o G t 2 o Y E Y + j X k C 7 F C O l p b a 5 P 7 p P 3 E h Q T l J X U g 5 d j l Q f I 5 Z c p 5 X R I d R a j y N 4 d d A 2 y 1 b V G g P O E j s n Z Z C j 3 j a / 3 E 6 I Y 0 4 C h R m S s m V b k f I S J B T F j K Q l N 5 Y k Q n i A e q S l a Y A 4 k V 4 y f i i F e 1 r p w G 4 o d A U K j t X f E w n i U g 6 5 r z u z e + W 0 l 4 n / e a 1 Y d U + 9 h A Z R r E i A J 4 u 6 M Y M q h F k 6 s E M F w Y o N N U F Y U H 0 r x H 0 k E F Y 6 w 5 I O w Z 5 + e Z Y 4 R 9 W z q n V z X K 6 d 5 2 k U w Q 7 Y B R V g g x N Q A 5 e g D h y A w S N 4 B q / g z X g y X o x 3 4 2 P S W j D y m W 3 w B 8 b n D 5 4 + n H s = < / l a t e x i t >  Figure 1: NASH for end-to-end semantic hashing. The inference network maps x → z using an MLP and the generative network recovers x as z →x.
van den Oord et al. (2017) combined VAEs with vector quantization to learn discrete latent representation, and demonstrated the utility of these learned representations on images, videos, and speech data. Li et al. (2017) leveraged both pairwise label and classification information to learn discrete hash codes, which exhibit state-of-the-art performance on image retrieval tasks.
For natural language processing (NLP), although significant research has been made to learn continuous deep representations for words or documents (Mikolov et al., 2013;Kiros et al., 2015;, discrete neural representations have been mainly explored in learning word embeddings (Shu and Nakayama, 2017;Chen et al., 2017). In these recent works, words are represented as a vector of discrete numbers, which are very efficient storage-wise, while showing comparable performance on several NLP tasks, relative to continuous word embeddings. However, discrete representations that are learned in an endto-end manner at the sentence or document level have been rarely explored. Also there is a lack of strict evaluation regarding their effectiveness. Our work focuses on learning discrete (binary) representations for text documents. Further, we employ semantic hashing (fast similarity search) as a mechanism to evaluate the quality of learned binary latent codes.

Hashing under the NVI Framework
Inspired by the recent success of variational autoencoders for various NLP problems (Miao et al., 2016;Bowman et al., 2015;Yang et al., 2017;Miao et al., 2017;Shen et al., 2017b;, we approach the training of discrete (binary) latent variables from a generative perspec-tive. Let x and z denote the input document and its corresponding binary hash code, respectively. Most of the previous text hashing methods focus on modeling the encoding distribution p(z|x), or hash function, so the local/global pairwise similarity structure of documents in the original space is preserved in latent space (Zhang et al., 2010;Wang et al., 2013;Xu et al., 2015;Wang et al., 2014). However, the generative (decoding) process of reconstructing x from binary latent code z, i.e., modeling distribution p(x|z), has been rarely considered. Intuitively, latent codes learned from a model that accounts for the generative term should naturally encapsulate key semantic information from x because the generation/reconstruction objective is a function of p(x|z). In this regard, the generative term provides a natural training objective for semantic hashing.
We define a generative model that simultaneously accounts for both the encoding distribution, p(z|x), and decoding distribution, p(x|z), by defining approximations q φ (z|x) and q θ (x|z), via inference and generative networks, g φ (x) and g θ (z), parameterized by φ and θ, respectively.
Specifically, x ∈ Z |V | + is the bag-of-words (count) representation for the input document, where |V | is the vocabulary size. Notably, we can also employ other count weighting schemes as input features x, e.g., the term frequency-inverse document frequency (TFIDF) (Manning et al., 2008). For the encoding distribution, a latent variable z is first inferred from the input text x, by constructing an inference network g φ (x) to approximate the true posterior distribution p(z|x) as q φ (z|x). Subsequently, the decoder network g θ (z) maps z back into input space to reconstruct the original sequence x asx, approximating p(x|z) as q θ (x|z) (as shown in Figure 1). This cyclic strategy, x → z →x ≈ x, provides the latent variable z with a better ability to generalize (Miao et al., 2016).
To tailor the NVI framework for semantic hashing, we cast z as a binary latent variable and assume a multivariate Bernoulli prior on z: Thus, the encoding (approximate posterior) distribution q φ (z|x) is restricted to take the form q φ (z|x) = Bernoulli(h), where h = σ(g φ (x)), σ(·) is the sigmoid function, and g φ (·) is the (nonlinear) inference network specified as a multilayer perceptron (MLP). As illustrated in Figure 1, we can obtain samples from the Bernoulli posterior either deterministically or stochastically. Suppose z is a l-bit hash code, for the deterministic binarization, we have, for i = 1, 2, ......, l: where z is the binarized variable, and z i and g i φ (x) denote the i-th dimension of z and g φ (x), respectively. The standard Bernoulli sampling in (1) can be understood as setting a hard threshold at 0.5 for each representation dimension, therefore, the binary latent code is generated deterministically. Another strategy to obtain the discrete variable is to binarize h in a stochastic manner: where µ i ∼ Uniform(0, 1). Because of this sampling process, we do not have to assume a predefined threshold value like in (1).

Training with Binary Latent Variables
To estimate the parameters of the encoder and decoder networks, we would ideally maximize the marginal distribution p(x) = p(z)p(x|z)dz. However, computing this marginal is intractable in most cases of interest. Instead, we maximize a variational lower bound. This approach is typically employed in the VAE framework (Kingma and Welling, 2013): where the Kullback-Leibler (KL) divergence D KL (q φ (z|x)||p(z)) encourages the approximate posterior distribution q φ (z|x) to be close to the multivariate Bernoulli prior p(z). In this case, D KL (q φ (z|x)|p(z)) can be written in closed-form as a function of g φ (x): Note that the gradient for the KL divergence term above can be evaluated easily.
For the first term in (3), we should in principle estimate the influence of µ i in (2) on q θ (x|z) by averaging over the entire (uniform) noise distribution. However, a closed-form distribution does not exist since it is not possible to enumerate all possible configurations of z, especially when the latent dimension is large. Moreover, discrete latent variables are inherently incompatible with backpropagation, since the derivative of the sign function is zero for almost all input values. As a result, the exact gradients of L vae wrt the inputs before binarization would be essentially all zero.
To estimate the gradients for binary latent variables, we utilize the straight-through (ST) estimator, which was first introduced by Hinton (2012). So motivated, the strategy here is to simply backpropagate through the hard threshold by approximating the gradient ∂z/∂φ as 1. Thus, we have: Although this is clearly a biased estimator, it has been shown to be a fast and efficient method relative to other gradient estimators for discrete variables, especially for the Bernoulli case (Bengio et al., 2013;Hubara et al., 2016;Theis et al., 2017). With the ST gradient estimator, the first loss term in (3) can be backpropagated into the encoder network to fine-tune the hash function g φ (x). For the approximate generator q θ (x|z) in (3), let x i denote the one-hot representation of ith word within a document. Note that x = i x i is thus the bag-of-words representation for document x. To reconstruct the input x from z, we utilize a softmax decoding function written as: where q(x i = w|z) is the probability that x i is word w ∈ V , q θ (x|z) = i q(x i = w|z) and θ = {E, b 1 , . . . , b |V | }. Note that E ∈ R d×|V | can be interpreted as a word embedding matrix to be learned, and {b i } |V | i=1 denote bias terms. Intuitively, the objective in (6) encourages the discrete vector z to be close to the embeddings for every word that appear in the input document x. As shown in Section 5.3.1, meaningful semantic structures can be learned and manifested in the word embedding matrix E.

Injecting Data-dependent Noise to z
To reconstruct text data x from sampled binary representation z, a deterministic decoder is typically utilized (Miao et al., 2016;Chaidaroon and Fang, 2017). Inspired by the success of employing stochastic decoders in image hashing applications (Dai et al., 2017;Theis et al., 2017), in our experiments, we found that injecting random Gaussian noise into z makes the decoder a more favorable regularizer for the binary codes, which in practice leads to stronger retrieval performance. Below, we invoke the rate-distortion theory to perform some further analysis, which leads to interesting findings.
Learning binary latent codes z to represent a continuous distribution p(x) is a classical information theory concept known as lossy source coding. From this perspective, semantic hashing, which compresses an input document into compact binary codes, can be casted as a conventional ratedistortion tradeoff problem (Theis et al., 2017;Ballé et al., 2016): where rate and distortion denote the effective code length, i.e., the number of bits used, and the distortion introduced by the encoding/decoding sequence, respectively. Further,x is the reconstructed input and β is a hyperparameter that controls the tradeoff between the two terms.
Considering the case where we have a Bernoulli prior on z as p(z) ∼ Bernoulli(γ), and x conditionally drawn from a Gaussian distribution p(x|z) ∼ N (Ez, σ 2 I).
where e i ∈ R d can be interpreted as a codebook with |V | codewords. In our case, E corresponds to the word embedding matrix as in (6).
For the case of stochastic latent variable z, the objective function in (3) can be written in a form similar to the rate-distortion tradeoff: where C is a constant that encapsulates the prior distribution p(z) and the Gaussian distribution normalization term. Notably, the trade-off hyperparameter β = σ −2 /2 is closely related to the variance of the distribution p(x|z). In other words, by controlling the variance σ, the model can adaptively explore different trade-offs between the rate and distortion objectives. However, the optimal trade-offs for distinct samples may be different.
Inspired by the observations above, we propose to inject data-dependent noise into latent variable z, rather than to setting the variance term σ 2 to a fixed value (Dai et al., 2017;Theis et al., 2017). Specifically, log σ 2 is obtained via a one-layer MLP transformation from g φ (x). Afterwards, we sample z from N (z, σ 2 I), which then replace z in (6) to infer the probability of generating individual words (as shown in Figure 1). As a result, the variances are different for every input document x, and thus the model is provided with additional flexibility to explore various trade-offs between rate and distortion for different training observations. Although our decoder is not a strictly Gaussian distribution, as in (6), we found empirically that injecting data-dependent noise into z yields strong retrieval results, see Section 5.1.

Supervised Hashing
The proposed Neural Architecture for Semantic Hashing (NASH) can be extended to supervised hashing, where a mapping from latent variable z to labels y is learned, here parametrized by a twolayer MLP followed by a fully-connected softmax layer. To allow the model to explore and balance between maximizing the variational lower bound in (3) and minimizing the discriminative loss, the following joint training objective is employed: where η refers to parameters of the MLP classifier and α controls the relative weight between the variational lower bound (L vae ) and discriminative loss (L dis ), defined as the cross-entropy loss. The parameters {θ, φ, η} are learned end-to-end via Monte Carlo estimation.

Datasets
We use the following three standard publicly available datasets for training and evaluation: (i) Reuters21578, containing 10,788 news documents, which have been classified into 90 different categories. (ii) 20Newsgroups, a collection of 18,828 newsgroup documents, which are categorized into 20 different topics. (iii) TMC (stands for SIAM text mining competition), containing air traffic reports provided by NASA. TMC consists 21,519 training documents divided into 22 different categories. To make direct comparison with prior works, we employed the TFIDF features on these datasets supplied by (Chaidaroon and Fang, 2017), where the vocabulary sizes for the three datasets are set to 10,000, 7,164 and 20,000, respectively.

Training Details
For the inference networks, we employ a feedforward neural network with 2 hidden layers (both with 500 units) using the ReLU non-linearity activation function, which transform the input documents, i.e., TFIDF features in our experiments, into a continuous representation. Empirically, we found that stochastic binarization as in (2) shows stronger performance than deterministic binarization, and thus use the former in our experiments. However, we further conduct a systematic ablation study in Section 5.2 to compare the two binarization strategies.
Our model is trained using Adam (Kingma and Ba, 2014), with a learning rate of 1 × 10 −3 for all parameters. We decay the learning rate by a factor of 0.96 for every 10,000 iterations. Dropout (Srivastava et al., 2014) is employed on the output of encoder networks, with the rate selected from {0.7, 0.8, 0.9} on the validation set. To facilitate comparisons with previous methods, we set the dimension of z, i.e., the number of bits within the hashing code) as 8, 16, 32, 64, or 128.

Evaluation Metrics
To evaluate the hashing codes for similarity search, we consider each document in the testing set as a query document. Similar documents to the query in the corresponding training set need to be retrieved based on the Hamming distance of their hashing codes, i.e. number of different bits. To facilitate comparison with prior work (Wang et al., 2013;Chaidaroon and Fang, 2017), the performance is measured with precision. Specifically, during testing, for a query document, we first retrieve the 100 nearest/closest documents according to the Hamming distances of the corresponding hash codes (i.e., the number of different bits). We then examine the percentage of documents among these 100 retrieved ones that belong to the same label (topic) with the query document (we consider documents having the same label as relevant pairs). The ratio of the number of relevant documents to the number of retrieved documents (fixed value of 100) is calculated as the precision score. The precision scores are further averaged over all test (query) documents.

Experimental Results
We experimented with four variants for our NASH model: (i) NASH: with deterministic decoder; (ii) NASH-N: with fixed random noise injected to decoder; (iii) NASH-DN: with data-dependent noise injected to decoder; (iv) NASH-DN-S: NASH-DN with supervised information during training.  Table 1 presents the results of all models on Reuters dataset. Regarding unsupervised semantic hashing, all the NASH variants consistently outperform the baseline methods by a substantial margin, indicating that our model makes the most effective use of unlabeled data and manage to assign similar hashing codes, i.e., with small Hamming distance to each other, to documents that belong to the same label. It can be also observed that the injection of noise into the decoder networks has improved the robustness of learned binary representations, resulting in better retrieval performance. More importantly, by making the variances of noise adaptive to the specific input, our NASH-DN achieves even better results, compared with NASH-N, highlighting the importance of exploring/learning the trade-off between rate and distortion objectives by the data itself. We observe the same trend and superiority of our NASH-DN models on the other two benchmarks, as shown in Tables 3 and 4.

Semantic Hashing Evaluation
Another observation is that the retrieval results tend to drop a bit when we set the length of hashing codes to be 64 or larger, which also happens for some baseline models. This phenomenon has been reported previously in ; Liu et al. (2012); Wang et al. (2013);Chaidaroon and Fang (2017), and the reasons could be twofold: (i) for longer codes, the number of data points that are assigned to a certain binary code decreases exponentially. As a result, many queries may fail to return any neighbor documents ; (ii) considering the size of training data, it is likely that the model may overfit with long hash codes (Chaidaroon and Fang, 2017). However, even with longer hashing codes,  NASH   gun  treatment  company  definition  israeli  books  guns  disease  market  defined  arabs  english  weapon  drugs  afford  explained  arab  references  armed  health  products  discussion  jewish  learning  assault  medicine  money  knowledge  jews  reference   NVDM   guns  medicine  expensive  defined  israeli  books  weapon  health  industry  definition  arab  reference  gun  treatment  company  printf  arabs  guide  militia  disease  market  int  lebanon  writing  armed  patients  buy  sufficient  lebanese  pages   Table 2: The five nearest words in the semantic space learned by NASH, compared with the results from NVDM (Miao et al., 2016).   our NASH models perform stronger than the baselines in most cases (except for the 20Newsgroups dataset), suggesting that NASH can effectively allocate documents to informative/meaningful hashing codes even with limited training data.
We also evaluate the effectiveness of NASH in a supervised scenario on the Reuters dataset, where the label or topic information is utilized during training. As shown in Figure 2, our NASH-DN-S model consistently outperforms several supervised semantic hashing baselines, with various choices of hashing bits. Notably, our model exhibits higher Top-100 retrieval precision than VDSH-S and VDSH-SP, proposed by Chaidaroon and Fang (2017). This may be attributed to the fact that in VDSH models, the continuous embeddings are not optimized with their future binarization in mind, and thus could hurt the relevance of learned binary codes. On the contrary, our model is optimized in an end-to-end manner, where the gradients are directly backpropagated to the inference network (through the binary/discrete latent variable), and thus gives rise to a more robust hash function.

The effect of stochastic sampling
As described in Section 3, the binary latent variables z in NASH can be either deterministically (via (1)) or stochastically (via (2)) sampled. We compare these two types of binarization functions in the case of unsupervised hashing. As illustrated in Figure 3, stochastic sampling shows stronger retrieval results on all three datasets, indicating that endowing the sampling process of latent variables with more stochasticity improves the learned representations.

The effect of encoder/decoder networks
Under the variational framework introduced here, the encoder network, i.e., hash function, and decoder network are jointly optimized to abstract semantic features from documents. An interesting question concerns what types of network should be leveraged for each part of our NASH model. In this regard, we further investigate the effect of   using an encoder or decoder with different nonlinearity, ranging from a linear transformation to two-layer MLPs. We employ a base model with an encoder of two-layer MLPs and a linear decoder (the setup described in Section 3), and the ablation study results are shown in Table 6.  It is observed that for the encoder networks, increasing the non-linearity by stacking MLP layers leads to better empirical results. In other words, endowing the hash function with more modeling capacity is advantageous to retrieval tasks. However, when we employ a non-linear network for the decoder, the retrieval precision drops dramatically. It is worth noting that the only difference between linear transformation and one-layer MLP is whether a non-linear activation function is employed or not.
This observation may be attributed the fact that the decoder networks can be considered as a sim-ilarity measure between latent variable z and the word embeddings E k for every word, and the probabilities for words that present in the document is maximized to ensure that z is informative. As a result, if we allow the decoder to be too expressive (e.g., a one-layer MLP), it is likely that we will end up with a very flexible similarity measure but relatively less meaningful binary representations. This finding is consistent with several image hashing methods, such as SGH (Dai et al., 2017) or binary autoencoder (Carreira-Perpinán and Raziperchikolaei, 2015), where a linear decoder is typically adopted to obtain promising retrieval results. However, our experiments may not speak for other choices of encoder-decoder architectures, e.g., LSTM-based sequence-to-sequence models  or DCNN-based autoencoder (Zhang et al., 2017).

Analysis of Semantic Information
To understand what information has been learned in our NASH model, we examine the matrix E ∈ R d×l in (6). Similar to (Miao et al., 2016;Larochelle and Lauly, 2012), we select the 5 nearest words according to the word vectors learned from NASH and compare with the corresponding results from NVDM.
As shown in Table 2, although our NASH model contains a binary latent variable, rather than a continuous one as in NVDM, it also effectively group semantically-similar words together in the learned vector space. This further demonstrates that the proposed generative framework manages to bypass the binary/discrete constraint and is able to abstract useful semantic information from documents.

Case Study
In Table 5, we show some examples of the learned binary hashing codes on 20Newsgroups dataset. We observe that for both 8-bit and 16bit cases, NASH typically compresses documents with shared topics into very similar binary codes. On the contrary, the hashing codes for documents with different topics exhibit much larger Hamming distance. As a result, relevant documents can be efficiently retrieved by simply computing their Hamming distances.

Conclusions
This paper presents a first step towards end-to-end semantic hashing, where the binary/discrete constraints are carefully handled with an effective gradient estimator. A neural variational framework is introduced to train our model. Motivated by the connections between the proposed method and rate-distortion theory, we inject data-dependent noise into the Bernoulli latent variable at the training stage. The effectiveness of our framework is demonstrated with extensive experiments.