Compressing Neural Language Models by Sparse Word Representations

Neural networks are among the state-of-the-art techniques for language modeling. Existing neural language models typically map discrete words to distributed, dense vector representations. After information processing of the preceding context words by hidden layers, an output layer estimates the probability of the next word. Such approaches are time- and memory-intensive because of the large numbers of parameters for word embeddings and the output layer. In this paper, we propose to compress neural language models by sparse word representations. In the experiments, the number of parameters in our model increases very slowly with the growth of the vocabulary size, which is almost imperceptible. Moreover, our approach not only reduces the parameter space to a large extent, but also improves the performance in terms of the perplexity measure.


Introduction
Language models (LMs) play an important role in a variety of applications in natural language processing (NLP), including speech recognition and document recognition. In recent years, neural network-based LMs have achieved significant breakthroughs: they can model language more precisely than traditional n-gram statistics (Mikolov et al., 2011); it is even possible to generate new sentences from a neural LM, benefiting various downstream tasks like machine translation, summarization, and dialogue systems (Devlin et al., 2014;Rush et al., 2015;Sordoni et al., 2015;Mou et al., 2015b). 1 Code released on https://github.com/chenych11/lm Existing neural LMs typically map a discrete word to a distributed, real-valued vector representation (called embedding) and use a neural model to predict the probability of each word in a sentence. Such approaches necessitate a large number of parameters to represent the embeddings and the output layer's weights, which is unfavorable in many scenarios. First, with a wider application of neural networks in resourcerestricted systems (Hinton et al., 2015), such approach is too memory-consuming and may fail to be deployed in mobile phones or embedded systems. Second, as each word is assigned with a dense vector-which is tuned by gradient-based methods-neural LMs are unlikely to learn meaningful representations for infrequent words. The reason is that infrequent words' gradient is only occasionally computed during training; thus their vector representations can hardly been tuned adequately.
In this paper, we propose a compressed neural language model where we can reduce the number of parameters to a large extent. To accomplish this, we first represent infrequent words' embeddings with frequent words' by sparse linear combinations. This is inspired by the observation that, in a dictionary, an unfamiliar word is typically defined by common words. We therefore propose an optimization objective to compute the sparse codes of infrequent words. The property of sparseness (only 4-8 values for each word) ensures the efficiency of our model.
Based on the pre-computed sparse codes, we design our compressed language model as follows. A dense embedding is assigned to each common word; an infrequent word, on the other hand, computes its vector representation by a sparse combination of common words' embeddings. We use the long short term memory (LSTM)-based recurrent neural network (RNN) as the hidden layer of our model. The weights of the output layer are also compressed in a same way as embeddings. Consequently, the number of trainable neural parameters is a constant regardless of the vocabulary size if we ignore the biases of words. Even considering sparse codes (which are very small), we find the memory consumption grows imperceptibly with respect to the vocabulary.
We evaluate our LM on the Wikipedia corpus containing up to 1.6 billion words. During training, we adopt noise-contrastive estimation (NCE) (Gutmann and Hyvärinen, 2012) to estimate the parameters of our neural LMs. However, different from Mnih and Teh (2012), we tailor the NCE method by adding a regression layer (called ZRegressoion) to predict the normalization factor, which stabilizes the training process. Experimental results show that, our compressed LM not only reduces the memory consumption, but also improves the performance in terms of the perplexity measure.
To sum up, the main contributions of this paper are three-fold. (1) We propose an approach to represent uncommon words' embeddings by a sparse linear combination of common ones'. (2) We propose a compressed neural language model based on the pre-computed sparse codes. The memory increases very slowly with the vocabulary size (4-8 values for each word). (3) We further introduce a ZRegression mechanism to stabilize the NCE algorithm, which is potentially applicable to other LMs in general.

Standard Neural LMs
Language modeling aims to minimize the joint probability of a corpus (Jurafsky and Martin, 2014). Traditional n-gram models impose a Markov assumption that a word is only dependent on previous n − 1 words and independent of its position. When estimating the parameters, researchers have proposed various smoothing techniques including back-off models to alleviate the problem of data sparsity.  propose to use a feedforward neural network (FFNN) to replace the multinomial parameter estimation in n-gram models. Recurrent neural networks (RNNs) can also be used for language modeling; they are especially capable of capturing long range dependencies in sentences (Mikolov et al., 2010;Sundermeyer et  In the above models, we can view that a neural LM is composed of three main parts, namely the Embedding, Encoding, and Prediction subnets, as shown in Figure 1. The Embedding subnet maps a word to a dense vector, representing some abstract features of the word (Mikolov et al., 2013). Note that this subnet usually accepts a list of words (known as history or context words) and outputs a sequence of word embeddings.
The Encoding subnet encodes the history of a target word into a dense vector (known as context or history representation). We may either leverage FFNNs  or RNNs (Mikolov et al., 2010) as the Encoding subnet, but RNNs typically yield a better performance (Sundermeyer et al., 2015).
The Prediction subnet outputs a distribution of target words as where h is the vector representation of context/history h, obtained by the Encoding subnet.
is the bias (the prior). s(h, w i ) is a scoring function indicating the degree to which the context h matches a target word w i . (V is the size of vocabulary V; C is the dimension of context/history, given by the Encoding subnet.)

Complexity Concerns of Neural LMs
Neural network-based LMs can capture more precise semantics of natural language than n-gram models because the regularity of the Embedding subnet extracts meaningful semantics of a word and the high capacity of Encoding subnet enables complicated information processing. Despite these, neural LMs also suffer from several disadvantages mainly out of complexity concerns.
Time complexity. Training neural LMs is typically time-consuming especially when the vocabulary size is large. The normalization factor in Equation (1) contributes most to time complexity. Morin and Bengio (2005) propose hierarchical softmax by using a Bayesian network so that the probability is self-normalized. Sampling techniques-for example, importance sampling (Bengio and Senécal, 2003), noise-contrastive estimation (Gutmann and Hyvärinen, 2012), and target sampling (Jean et al., 2014)-are applied to avoid computation over the entire vocabulary. Infrequent normalization maximizes the unnormalized likelihood with a penalty term that favors normalized predictions (Andreas and Klein, 2014). Memory complexity and model complexity. The number of parameters in the Embedding and Prediction subnets in neural LMs increases linearly with respect to the vocabulary size, which is large (Table 1). As said in Section 1, this is sometimes unfavorable in memory-restricted systems. Even with sufficient hardware resources, it is problematic because we are unlikely to fully tune these parameters. Chen et al. (2015) propose the differentiated softmax model by assigning fewer parameters to rare words than to frequent words. However, their approach only handles the output weights, i.e., W in Equation (2); the input embeddings remain uncompressed in their approach.
In this work, we mainly focus on memory and model complexity, i.e., we propose a novel method to compress the Embedding and Prediction subnets in neural language models.

Related Work
Existing work on model compression for neural networks. Buciluǎ et al. (2006) and Hinton et al. (2015) use a well-trained large network to guide the training of a small network for model compression. Jaderberg et al. (2014) compress neural models by matrix factorization, Gong et al. (2014) by quantization. In NLP, Mou et al. (2015a) learn an embedding subspace by supervised training. Our work resembles little, if any, to the above methods as we compress embeddings and output weights using sparse word representations. Existing model compression typically works with a compromise of performance. On the contrary, our model improves the perplexity measure after compression. Sparse word representations.
We leverage sparse codes of words to compress neural LMs. Faruqui et al. (2015) propose a sparse coding method to represent each word with a sparse vector. They solve an optimization problem to obtain the sparse vectors of words as well as a dictionary matrix simultaneously. By contrast, we do not estimate any dictionary matrix when learning sparse codes, which results in a simple and easyto-optimize model.

Our Proposed Model
In this section, we describe our compressed language model in detail. Subsection 3.1 formalizes the sparse representation of words, serving as the premise of our model. On such a basis, we compress the Embedding and Prediction subnets in Subsections 3.2 and 3.3, respectively. Finally, Subsection 3.4 introduces NCE for parameter estimation where we further propose the ZRegression mechanism to stabilize our model.

Sparse Representations of Words
We split the vocabulary V into two disjoint subsets (B and C). The first subset B is a base set, containing a fixed number of common words (8k in our experiments). C = V\B is a set of uncommon words. We would like to use B's word embeddings to encode C's.
Our intuition is that oftentimes a word can be defined by a few other words, and that rare words should be defined by common ones. Therefore, it is reasonable to use a few common words' embeddings to represent that of a rare word. Following most work in the literature (Lee et al., 2006;Yang et al., 2011), we represent each uncommon word with a sparse, linear combination of com-mon ones' embeddings. The sparse coefficients are called a sparse code for a given word.
We first train a word representation model like SkipGram (Mikolov et al., 2013) to obtain a set of embeddings for each word in the vocabulary, including both common words and rare words. Sup- Each word in B has a natural sparse code (denoted as x): it is a one-hot vector with B elements, the i-th dimension being on for the i-th word in B.
For a word w ∈ C, we shall learn a sparse vector x = (x 1 , x 2 , . . . , x B ) as the sparse code of the word. Provided that x has been learned (which will be introduced shortly), the embedding of w iŝ To learn the sparse representation of a certain word w, we propose the following optimization objective where max denotes the component-wise maximum; w is the embedding for a rare word w ∈ C.
The first term (called fitting loss afterwards) evaluates the closeness between a word's coded vector representation and its "true" representation w, which is the general goal of sparse coding.
The second term is an 1 regularizer, which encourages a sparse solution. The last two regularization terms favor a solution that sums to 1 and that is nonnegative, respectively. The nonnegative regularizer is applied as in He et al. (2012) due to psychological interpretation concerns.
It is difficult to determine the hyperparameters α, β, and γ. Therefore we perform several tricks. First, we drop the last term in the problem (4), but clip each element in x so that all the sparse codes are nonnegative during each update of training.
Second, we re-parametrize α and β by balancing the fitting loss and regularization terms dynamically during training. Concretely, we solve the following optimization problem, which is slightly different but closely related to the conceptual objective (4): where L(x) = U x − w 2 2 , R 1 (x) = x 1 , and R 2 (x) = |1 x−1|. α t and β t are adaptive parameters that are resolved during training time. Suppose x t is the value we obtain after the update of the t-th step, we expect the importance of fitness and regularization remain unchanged during training. This is equivalent to or where w α and w β are the ratios between the regularization loss and the fitting loss. They are much easier to specify than α or β in the problem (4). We have two remarks as follows.
• To learn the sparse codes, we first train the "true" embeddings by word2vec 2 for both common words and rare words. However, these true embeddings are slacked during our language modeling. • As the codes are pre-computed and remain unchanged during language modeling, they are not tunable parameters of our neural model. Considering the learned sparse codes, we need only 4-8 values for each word on average, as the codes contain 0.05-0.1% nonzero values, which are almost negligible.

Parameter Compression for the Embedding Subnet
One main source of LM parameters is the Embedding subnet, which takes a list of words (history/context) as input, and outputs dense, lowdimensional vector representations of the words. We leverage the sparse representation of words mentioned above to construct a compressed Embedding subnet, where the number of parameters is independent of the vocabulary size.
By solving the optimization problem (5) for each word, we obtain a non-negative sparse code x ∈ R B for each word, indicating the degree to which the word is related to common words in B. Then the embedding of a word is given bŷ w = U x.
We would like to point out that the embedding of a wordŵ is not sparse because U is a dense matrix, which serves as a shared parameter of learning all words' vector representations.

Parameter Compression for the Prediction Subnet
Another main source of parameters is the Prediction subnet. As Table 1 shows, the output layer contains V target-word weight vectors and biases; the number increases with the vocabulary size. To compress this part of a neural LM, we propose a weight-sharing method that uses words' sparse representations again. Similar to the compression of word embeddings, we define a base set of weight vectors, and use them to represent the rest weights by sparse linear combinations. Without loss of generality, we let D = W :,1:B be the output weights of B base target words, and c = b 1:B be bias of the B target words. 3 The goal is to use D and c to represent W and b. However, as the values of W and b are unknown before the training of LM, we cannot obtain their sparse codes in advance.
We claim that it is reasonable to share the same set of sparse codes to represent word vectors in Embedding and the output weights in the Prediction subnet. In a given corpus, an occurrence of a word is always companied by its context. The co-occurrence statistics about a word or corresponding context are the same. As both word embedding and context vectors capture these co-occurrence statistics (Levy and Goldberg, 2014), we can expect that context vectors share the same internal structure as embeddings. Moreover, for a fine-trained network, given any word w and its context h, the output layer's weight vector corresponding to w should specify a large inner-product score for the context h; thus these context vectors should approximate the weight vector of w. Therefore, word embeddings and the output weight vectors should share the same internal structures and it is plausible to use a same set of sparse representations for both words and target-word weight vectors. As we shall show in Section 4, our treatment of compressing the Prediction subnet does make sense and achieves high performance.
Formally, the i-th output weight vector is estimated byŴ  We apply NCE to estimate the parameters of the Prediction sub-network (dashed round rectangle). The SpUnnrmProb layer outputs a sparse, unnormalized probability of the next word. By "sparsity," we mean that, in NCE, the probability is computed for only the "true" next word (red) and a few generated negative samples.
The biases can also be compressed aŝ where x i is the sparse representation of the i-th word. (It is shared in the compression of weights and biases.) In the above model, we have managed to compressed a language model whose number of parameters is irrelevant to the vocabulary size.
To better estimate a "prior" distribution of words, we may alternatively assign an independent bias to each word, i.e., b is not compressed. In this variant, the number of model parameters grows very slowly and is also negligible because each word needs only one extra parameter. Experimental results show that by not compressing the bias vector, we can even improve the performance while compressing LMs.

Noise-Contrastive Estimation with ZRegression
We adopt the noise-contrastive estimation (NCE) method to train our model. Compared with the maximum likelihood estimation of softmax, NCE reduces computational complexity to a large degree. We further propose the ZRegression mechanism to stablize training. NCE generates a few negative samples for each positive data sample. During training, we only need to compute the unnormalized probability of these positive and negative samples. Interested readers are referred to (Gutmann and Hyvärinen, 2012) for more information.
Formally, the estimated probability of the word w i with history/context h is where θ is the parameters and Z h is a contextdependent normalization factor. P 0 (w i |h; θ) is the unnormalized probability of the w (given by the SpUnnrmProb layer in Figure 2). The NCE algorithm suggests to take Z h as parameters to optimize along with θ, but it is intractable for context with variable lengths or large sizes in language modeling. Following Mnih and Teh (2012), we set Z h = 1 for all h in the base model (without ZRegression).
The objective for each occurrence of context/history h is where P n (w) is the probability of drawing a negative sample w; k is the number of negative samples that we draw for each positive sample. The overall objective of NCE is where h i is an occurrence of the context and M is the total number of context occurrences. Although setting Z h to 1 generally works well in our experiment, we find that in certain scenarios, the model is unstable. Experiments show that when the true normalization factor is far away from 1, the cost function may vibrate. To comply with NCE in general, we therefore propose a ZRegression layer to predict the normalization constant Z h dependent on h, instead of treating it as a constant.
The regression layer is computed by  where W Z ∈ R C and b Z ∈ R are weights and bias for ZRegression. Hence, the estimated probability by NCE with ZRegression is given by Note that the ZRegression layer does not guarantee normalized probabilities. During validation and testing, we explicitly normalize the probabilities by Equation (1).

Evaluation
In this part, we first describe our dataset in Subsection 4.1. We evaluate our learned sparse codes of rare words in Subsection 4.2 and the compressed language model in Subsection 4.3. Subsection 4.4 provides in-depth analysis of the ZRegression mechanism.

Dataset
We used the freely available Wikipedia 4 dump (2014) as our dataset. We extracted plain sentences from the dump and removed all markups. We further performed several steps of preprocessing such as text normalization, sentence splitting, and tokenization. Sentences were randomly shuffled, so that no information across sentences could be used, i.e., we did not consider cached language models. The resulting corpus contains about 1.6 billion running words.
The corpus was split into three parts for training, validation, and testing. As it is typically timeconsuming to train neural networks, we sampled a subset of 100 million running words to train neural LMs, but the full training set was used to train the backoff n-gram models. We chose hyperparameters by the validation set and reported model performance on the test set. Table 2 presents some statistics of our dataset.

Qualitative Analysis of Sparse Codes
To obtain words' sparse codes, we chose 8k common words as the "dictionary," i.e., B = 8000. Figure 3: The sparse representations of selected words. The x-axis is the dictionary of 8k common words; the y-axis is the coefficient of sparse coding. Note that algorithm, secret, and debate are common words, each being coded by itself with a coefficient of 1.
We had 2k-42k uncommon words in different settings. We first pretrained word embeddings of both rare and common words, and obtained 200d vectors U and w in Equation (5). The dimension was specified in advance and not tuned. As there is no analytic solution to the objective, we optimized it by Adam (Kingma and Ba, 2014), which is a gradient-based method. To filter out small coefficients around zero, we simply set a value to 0 if it is less than 0.015 · max{v ∈ x}. w α in Equation (6) was set to 1 because we deemed fitting loss and sparsity penalty are equally important. We set w β in Equation (7) to 0.1, and this hyperparameter is insensitive. Figure 3 plots the sparse codes of a few selected words. As we see, algorithm, secret, and debate are common words, and each is (sparsely) coded by itself with a coefficient of 1. We further notice that a rare word like algorithms has a sparse representation with only a few non-zero coefficient.
Moreover, the coefficient in the code of algorithms-corresponding to the base word algorithm-is large (∼ 0.6), showing that the words algorithm and algorithms are similar. Such phenomena are also observed with secret and debate.
The qualitative analysis demonstrates that our approach can indeed learn a sparse code of a word, and that the codes are meaningful.

Quantitative Analysis of Compressed Language Models
We then used the pre-computed sparse codes to compress neural LMs, which provides quantitative analysis of the learned sparse representations of words. We take perplexity as the performance measurement of a language model, which is de-fined by where N is the number of running words in the test corpus.

Settings
We leveraged LSTM-RNN as the Encoding subnet, which is a prevailing class of neural networks for language modeling (Sundermeyer et al., 2015;Karpathy et al., 2015). The hidden layer was 200d. We used the Adam algorithm to train our neural models. The learning rate was chosen by validation from {0.001, 0.002, 0.004, 0.006, 0.008}. Parameters were updated with a mini-batch size of 256 words. We trained neural LMs by NCE, where we generated 50 negative samples for each positive data sample in the corpus. All our model variants and baselines were trained with the same pre-defined hyperparameters or tuned over a same candidate set; thus our comparison is fair. We list our compressed LMs and competing methods as follows.
• KN3. We adopted the modified Kneser-Ney smoothing technique to train a 3-gram LM; we used the SRILM toolkit (Stolcke and others, 2002) in out experiment. • LBL5. A Log-BiLinear model introduced in Mnih and Hinton (2007). We used 5 preceding words as context. • LSTM-s. A standard LSTM-RNN language model which is applied in Sundermeyer et al. (2015) and Karpathy et al. (2015). We implemented the LM ourselves based on Theano (Theano Development Team, 2016) and also used NCE for training. • LSTM-z. An LSTM-RNN enhanced with the ZRegression mechanism described in Section 3.4. • LSTM-z,wb. Based on LSTM-z, we compressed word embeddings in Embedding and the output weights and biases in Prediction. • LSTM-z,w. In this variant, we did not compress the bias term in the output layer. For each word in C, we assigned an independent bias parameter.

Performance
Tables 3 shows the perplexity of our compressed model and baselines. As we see, LSTM-based LMs significantly outperform the log-bilinear    model as well as the backoff 3-gram LM, even if the 3-gram LM is trained on a much larger corpus with 1.6 billion words. The ZRegression mechanism improves the performance of LSTM to a large extent, which is unexpected. Subsection 4.4 will provide more in-depth analysis.
Regarding the compression method proposed in this paper, we notice that LSTM-z,wb and LSTM-z,w yield similar performance to LSTM-z. In particular, LSTM-z,w outperforms LSTM-z in all scenarios of different vocabulary sizes. Moreover, both LSTM-z,wb and LSTM-z,w can reduce the memory consumption by up to 80% (Table 4).
We further plot in Figure 4 the model performance (lines) and memory consumption (bars) in a fine-grained granularity of vocabulary sizes. We see such a tendency that compressed LMs (LSTMz,wb and LSTM-z,w, yellow and red lines) are generally better than LSTM-z (black line) when we have a small vocabulary. However, LSTMz,wb is slightly worse than LSTM-z if the vocabulary size is greater than, say, 20k. The LSTM-z,w remains comparable to LSTM-z as the vocabulary grows.
To explain this phenomenon, we may imagine that the compression using sparse codes has two effects: it loses information, but it also enables more accurate estimation of parameters especially for rare words. When the second factor dominates, we can reasonably expect a high performance of the compressed LM.
From the bars in Figure 4, we observe that traditional LMs have a parameter space growing linearly with the vocabulary size. But the number of parameters in our compressed models does not increase-or strictly speaking, increases at an extremely small rate-with vocabulary.
These experiments show that our method can largely reduce the parameter space with even performance improvement. The results also verify that the sparse codes induced by our model indeed capture meaningful semantics and are potentially useful for other downstream tasks.

Effect of ZRegression
We next analyze the effect of ZRegression for NCE training. As shown in Figure 5a, the training process becomes unstable after processing 70% of the dataset: the training loss vibrates significantly, whereas the test loss increases.
We find a strong correlation between unstableness and the Z h factor in Equation (10), i.e., the sum of unnormalized probability (Figure 5b). Theoretical analysis shows that the Z h factor tends to be self-normalized even though it is not forced to (Gutmann and Hyvärinen, 2012). However, problems would occur, should it fail.
In traditional methods, NCE jointly estimates normalization factor Z and model parameters (Gutmann and Hyvärinen, 2012). For language modeling, Z h dependents on context h. Mnih and Teh (2012) propose to estimate a separate Z h based on two history words (analogous to 3-gram), but their approach hardly scales to RNNs because of the exponential number of different combinations of history words.
We propose the ZRegression mechanism in Section 3.4, which can estimate the Z h factor well ( Figure 5d) based on the history vector h. In this way, we manage to stabilize the training process ( Figure 5c) and improve the performance by   Table 3.
It should be mentioned that ZRegression is not specific to model compression and is generally applicable to other neural LMs trained by NCE.

Conclusion
In this paper, we proposed an approach to represent rare words by sparse linear combinations of common ones. Based on such combinations, we managed to compress an LSTM language model (LM), where memory does not increase with the vocabulary size except a bias and a sparse code for each word. Our experimental results also show that the compressed LM has yielded a better performance than the uncompressed base LM.