Scaling Hidden Markov Language Models

The hidden Markov model (HMM) is a fundamental tool for sequence modeling that cleanly separates the hidden state from the emission structure. However, this separation makes it difficult to fit HMMs to large datasets in modern NLP, and they have fallen out of use due to very poor performance compared to fully observed models. This work revisits the challenge of scaling HMMs to language modeling datasets, taking ideas from recent approaches to neural modeling. We propose methods for scaling HMMs to massive state spaces while maintaining efficient exact inference, a compact parameterization, and effective regularization. Experiments show that this approach leads to models that are more accurate than previous HMM and n-gram-based methods, making progress towards the performance of state-of-the-art neural models.


Introduction
Hidden Markov models (HMMs) are a fundamental latent-variable model for sequential data, with a rich history in NLP. They have been used extensively in tasks such as tagging (Merialdo, 1994), alignment (Vogel et al., 1996), and even, in a few cases, language modeling (Kuhn et al., 1994;Huang, 2011). Compared to other sequence models, HMMs are appealing since they fully separate the process of generating hidden states from observations, while allowing for exact posterior inference.
State-of-the-art systems in NLP have moved away from utilizing latent hidden states and toward deterministic deep neural models. We take several lessons from the success of neural models for NLP tasks: (a) model size is critical for accuracy, e.g.
Code available at github.com/harvardnlp/hmm-lm large LSTMs (Zaremba et al., 2014) show marked improvements in performance; (b) the right parameterization is critically important for representation learning, e.g. a feedforward model (Bengio et al., 2003) can have the same distributional assumptions as an n-gram model while performing significantly better; (c) dropout is key to achieving strong performance (Zaremba et al., 2014;Merity et al., 2017).
We revisit HMMs for language modeling as an alternative to modern neural models, while considering key empirical lessons from these approaches. Towards that goal, we introduce three techniques: a modeling constraint that allows us to use a large number of hidden states while maintaining efficient exact inference, a neural parameterization that improves generalization while remaining faithful to the probabilistic structure of the HMM, and a variant of dropout that both improves accuracy and halves the computational overhead during training.
Experiments employ HMMs on two language modeling datasets. Our approach allows us to train an HMM with tens of thousands of states while maintaining efficiency and significantly outperforming past HMMs as well as n-gram models.

Related Work
In order to improve the performance of HMMs on language modeling, several recent papers have combined HMMs with neural networks. Buys et al. (2018) develop an approach to relax HMMs, but their models either perform poorly or alter the probabilistic structure to resemble an RNN. Krakovna and Doshi-Velez (2016) utilize model combination with an RNN to connect both approaches in a small state-space model. Our method instead focuses on scaling pure HMMs to a large number of states.
Prior work has also considered neural parameterizations of HMMs. Tran et al. (2016) demonstrate improvements in POS induction with a neural parameterization of an HMM. They consider small state spaces, as the goal is tag induction rather than language modeling. 1 Most similar to this work are the large HMM models of Dedieu et al. (2019). They introduce a sparsity constraint in order to train a 30K state nonneural HMM for character-level language modeling; however, their constraint precludes application to large vocabularies. We overcome this limitation and train models with neural parameterizations on word-level language modeling.
Finally, another approach for scaling state spaces is to grow from small to big via a split-merge process (Petrov et al., 2006;Huang, 2011). In particular, Huang (2011) learn an HMM for language modeling via this process. As fixed-size state spaces are amenable to batching on modern hardware, we leave split-merge procedures for future work.

Background: HMMs
We are interested in learning a distribution over observed tokens x = x 1 , . . . , x T , with each token x t an element of the finite vocabulary X . Hidden Markov models (HMMs) specify a joint distribution over observed tokens x and discrete latent states z = z 1 , . . . , z T , with each z t from the finite set Z. For notational convenience, we define the starting state z 0 = . This yields the joint distribution, We refer to the transition and emission matrices as the distributional parameters of the HMM. Specifically, let A ∈ [0, 1] |Z|×|Z| be the transition probabilities and O ∈ [0, 1] |Z|×|X | the emission probabilities, We distinguish between two types of model parameterizations: scalar and neural, where the model parameters are given by θ. A scalar parameterization sets the model parameters equal to the distributional parameters, so that θ = {A, O}, resulting in O(|Z| 2 + |Z||X |) model parameters. A neural parameterization instead generates the distributional parameters from a neural network (with parameters θ), decoupling the size of θ from A, O. This decoupling gives us the ability to choose between compact or overparameterized θ (relative to A, O). As we scale to large state spaces, we take advantage of compact neural parameterizations.
In order to fit an HMM to data x, we must marginalize over the latent states to obtain the likelihood p(x) = z p(x, z). This sum can be computed in time O(T |Z| 2 ) via the forward algorithm, which becomes prohibitive if the number of latent states |Z| is large. We can then optimize the likelihood with gradient ascent (or alternative variants of expectation maximization). HMMs and RNNs Although the forward algorithm resembles that of the forward pass in a recurrent neural network (RNN) (Buys et al., 2018), there are key representational differences. RNNs do not decouple the latent dynamics from the observed. This often leads to improved accuracy, but precludes posterior inference which is useful for interpretability. A further benefit of HMMs over RNNs is that their associative structure allows for parallel inference via the prefix-sum algorithm (Ladner and Fischer, 1980). 2 Finally, HMMs bottleneck information from every timestep through a discrete hidden state. NLP has a long history of utilizing discrete representations, and discrete representations may yield interesting results. For example, recent work has found that discrete latent variables work well in low-resource regimes (Jin et al., 2020).

Scaling HMMs
We propose three extensions to scale HMMs for better language modeling performance: blocked emissions, which allow for very large models; neural parameterization, which makes it easy for states to share model parameters; and state dropout, which encourages broader state usage.
Blocked Emissions Our main goal is to apply a HMM with a large number of hidden states to learn the underlying dynamics of language data. However, the O(T |Z| 2 ) complexity of marginal inference practically limits the number of HMM states. We can get around this limit by making an assump- Figure 1: The emission matrix as a set of blocks O 1 , . . . , O 4 with fixed number of states k. The columns of each block may vary, as there is no constraint on the number of words a state can emit. Each non-zero cell is constructed from an MLP applied to word E x and state E z embeddings. tion on the HMM emission matrix O. As noted by Dedieu et al. (2019), restricting the number of states that can produce each word can improve inference complexity. We utilize a slightly stronger assumption on the model: a) states are partitioned into M equal sized groups each of which emit the same subset of words, and b) each word is only admitted by one group of k = |Z|/M states which we indicate as Z x ⊂ Z.
We implement this group structure through a set of blocked emissions, each corresponding to one of the M state groups, Figure 1 shows these emission blocks. Each block matrix O m gives the probabilities for emitting tokens X m for states in group m, i.e. states (m − 1)k through mk.
With this constraint, exact marginalization can be computed via Since there are only k states with nonzero probability of occurring at every timestep, we only need to consider transitioning from the |Z xt | = k previous states to the next |Z x t+1 | = k states, resulting in O(k 2 ) operations per timestep. This gives a serial complexity of O(T k 2 ). 3 Algorithm 1 HMM Training (a single batch) Given: block structure and model parameters Compute grad wrt parameters of log p(x) Update model parameters E z , E x and MLP Neural Parameterization A larger state space allows for longer HMM memory, but it also may require more parameters. Even with blocked emissions, the scalar model parameterization of an HMM grows as O(|Z| 2 ) due to the transition matrix. A neural parameterization allows us to share parameters between words and states to capture common structure.
Our parameterization uses an embedding for each state in Z (E z ∈ R |Z|×h ) and each token in X (E x ∈ R |X |×h ). From these we can create representations for leaving and entering a state, as well as emitting a word: The MLP architecture follows Kim et al. (2019), with details in the appendix. This factorized parameterization, shown in Figure 1, reduces the total parameters to O(h 2 + h|Z| + h|X |).
Note that parameter computation is independent of inference and can be cached completely as the emission and transition matrices, A and O, at testtime. For the training algorithm, shown in Algorithm 1, we compute A and O once per batch while RNNs and similar models recompute emissions every token.
Dropout as State Reduction Finally, to encourage full use of the large state space, we introduce dropout that prevents the model from favoring specific states. We propose a form of HMM state dropout that removes states from use entirely at each batch, which also has the added benefit of speeding up inference.
Figure 2: The computation of p(x) is greatly reduced by blocked emissions and state dropout. In the above trellis, each row corresponds to a latent state and each column after the first to a timestep. Each edge between nodes corresponds to a nonzero transition probability. Blocked emissions result in a small subset of all states emitting a given word, as shown by the rectangles. State dropout (leftmost column) allows us to further reduce the number of states we consider, halving the number of (white) states that have nonzero probability in each rectangle. In experiments, the number of possible transitions may be as large as 2 30 while the max number of non-zero transitions is 2 16 . State dropout acts on each emission block O 1 , . . . , O M independently. For each block, we sample a binary dropout mask by sampling λk dropped row indices uniformly without replacement, where λ is the dropout rate. We concatenate these into a global vector b ∈ {0, 1} |Z| , which, along with the previous constraints, ensures, An example of the HMM lattice after state dropout is show in Figure 2. In addition to accuracy improvements, state dropout gives a large practical speed up for both parameter computation and inference. For λ = 0.5 we get a 4× speed improvement for both, due to the reduction in possible transitions. This structured dropout is also easy to exploit on GPU, as it maintains block structure.

Experimental Setup
Emission Blocks The model requires partitioning token types into blocks X m . While there are many partitioning methods, a natural choice is Brown clusters (Brown et al., 1992;Liang, 2005) which are also based on HMMs. Brown clusters are obtained by assigning every token type in X a state in an HMM, then merging states until a desired number of partitions M is reached. We construct the Brown clusters on the training portions of the datasets and assume the vocabulary remains identical at test time (with OOV words mapped to unk). We include more background on Brown Clusters in the appendix. State Dropout We use a dropout rate of λ = 0.5 at training time. For each block of size |X m |, we sample λ|X m | states to use in that block each batch. We draw states from each block from a multivariate hypergeometric distribution using the Gumbel Topk trick for sampling without replacement (Vieira, 2014 (2011), which lowercases all words and substitutes OOV words with unks. We insert EOS tokens after each sentence. For WIKITEXT2 casing is preserved, and all OOV words are unked. We insert EOS tokens after each paragraph. In both datasets OOV words were included in the perplexity (as unks), and EOS was included in the perplexity as well (Merity et al., 2017). Baselines Baselines include both state-of-the-art language models and other alternative LM styles. These include AWD-LSTM (Merity et al., 2017); a 900-state scalar HMM and HMM+RNN extension, which discards the HMM assumptions (Buys et al., 2018); a traditional Kneser-Ney 5-gram model (Mikolov and Zweig, 2012;Heafield et al., 2013), a 256 dimension feedforward neural model, and a 2-layer 256 dimension LSTM.
We compare these with our approach: the very large neural HMM (VL-HMM). Unless otherwise noted, our model has |Z| = 2 15 total states but only considers k = 256 states at every timestep at test time with M = 128 groups. 5 The state and word embeddings as well as the MLP have a hidden dimension of 256. We train with a state dropout rate of λ = 0.5. See the appendix for all hyperparameters.  Table 1 gives the main results. On PTB, the VL-HMM is able to achieve 125.0 perplexity on the valid set, outperforming a FF baseline (159.9) and vastly outperforming the 900-state HMM from Buys et al. (2018) (284.6). 6 The VL-HMM also outperforms the HMM+RNN extension of Buys et al. (2018) (142.3). These results indicate that HMMs are a much stronger model on this benchmark than previously claimed. However, the VL-HMM is still outperformed by LSTMs which have been extensively studied for this task. This trend persists in WIKITEXT-2, with the VL-HMM outperforming the FF model but underperforming an LSTM. Figure 3 examines the effect of state size: We find that performance continuously improves significantly as we grow to 2 16 states, justifying the large state space. The marginal improvement does lower as the number of states increases, implying that the current approach may have limitations in scaling to even larger state spaces. Table 2 considers other ablations: Although neural and scalar parameterizations reach similar training perplexity, the neural model generalizes better on validation with almost 100x fewer model parameters. We find that state dropout results in both 6 Buys et al. (2018) only report validation perplexity for the HMM and HMM+RNN models, so we compare accordingly.

Model
Param

Conclusion
This work demonstrates methods for effectively scaling HMMs to large state spaces on parallel hardware, and shows that this approach results in accuracy gains compared to other HMM models.
In order to scale, we introduce three techniques: a blocked emission constraint, a neural parameterization, and state dropout, which lead to an HMM that outperforms n-gram models and prior HMMs. Once scaled up to take advantage of modern hardware, very large HMMs demonstrate meaningful improvements over smaller HMMs. HMMs are a useful class of probabilistic models with different inductive biases, performance characteristics, and conditional independence structure than RNNs. Future work includes using these approaches to induce model structure, develop accurate models with better interpretability, and to apply these approaches in lower data regimes.

A.1 Brown Clustering
Brown clustering is an agglomerative clustering approach (Brown et al., 1992;Liang, 2005) that assigns every token type a single cluster. The Brown clustering model aims to find an HMM that maximizes the likelihood of an observed corpora under the constraint that every token type can only be emit by a single latent class. The cluster for the word is given by the latent class that emits that token type. Clusters are initialized by assigning every token type a unique latent state in an HMM. States are then merged iteratively until a desired number M is reached. Liang (2005) propose an algorithm that chooses a pair of states to merge at every iteration based on state bigram statistics within a window.

A.2 Hyperparameters
For PENN TREEBANK and WIKITEXT-2, we trained the following baselines: a two layer FF 256-dim 5-gram model and a two layer 256-dim LSTM. The FF model is given by the following: where E w gives the word embeddings, W h ∈ R h×4h , and W x ∈ R |X |×h is weight-tied to the word embeddings. The LSTM model is given by: with a 2-layer LSTM that has weight-tied W x and E w .
For the (5-gram) FF model we use a batch size of 128 and a bptt length of 64, as we found the model needed a larger batch size to achieve decent performance. For the LSTM, we use a batch size of 16 and a BPTT length of 32. For both baseline models we use AdamW (Loshchilov and Hutter, 2017) with a learning rate of 1e-3 and a dropout rate of 0.3 on the activations in the model. Both models use a hidden dimension of h = 256 throughout. These same hyperparameters were applied on both PENN TREEBANK and WIKITEXT-2.
For the HMMs we use a batch size of 16 and a BPTT length of 32. We use state dropout with rate λ = 0.5. We reset the state distribution to p(z 1 | z 0 ) after encountering the EOS symbol. We use AdamW (Loshchilov and Hutter, 2017) with a learning rate of 1e-2 for PENN TREEBANK, and a learning rate of 1e-3 for WIKITEXT-2.
All weights are initialized with the Kaiming uniform initialization. The FF model was trained for 100 epochs, while all other models were trained for 50. Validation likelihood was checked 4 times per epoch, and learning rates were decayed by a factor of 4 if the validation performance did not improve after 8 consecutive checks.
Hyperparameter search was performed manually, using the best validation perplexity achieved in a run. Bounds: In order to reduce the number of parameters further, we experiment with factored state embeddings. We factor the state embeddings into a composition of smaller steate embeddings (E z ∈ R |Z|×h/2 ) as well as block embeddings (E m ∈ R |Z|×h/2 ), which are shared across all states within the same emission block, i.e. all z ∈ Z x share a block embedding. To compose these embeddings, we introduce new residual networks f j , j ∈ {o, i, e} similar to the above, yielding We ablate the factored state embeddings in Sec. A.5. Table 3 shows the results from emission constraint ablations. With a VL-HMM that has |Z| = 2 14 states, the model is insensitive to the number of blocks M explorable given computational constraints. However, with fewer states |Z| = 2 10 we are able to explore a lower number of blocks. With M = 4 blocks, the block-sparse HMM matches an unconstrained HMM with the same number of states. When M = 8, the block-sparse model underperforms, implying there may be room for improvement with the larger HMMs that use M > 8 blocks.

A.4 Emission Constraint Ablation
We additionally compare the blocks induced by Brown clustering with a uniform constraint that samples subsets of states of size n independently and uniformly from Z. This does not admit a partitioning, which makes it difficult to apply state dropout. We therefore zero out half of the columns of the transition matrix randomly before normalization. In the bottom of Table 3, we find that models with uniform constraints are consistently outperformed by models with Brown cluster constraints as measured by validation perplexity. The models with uniform constraints also have poor validation performance despite better training performance, a symptom of overfitting.
These ablations demonstrate that the constraints based on Brown clusters used in this work may not be optimal, motivating future work that learns sparsity structure.

A.5 Factored State Representation Ablation
We examine the effect of factoring state representations into block embeddings and independent state embeddings. The results of the factored state ablation are in Figure 4. We find that the performance of independent state embeddings with is similar to a model with factored embeddings, but performs slightly worse in perplexity.
In Table 4 we see that although the factored state embeddings reduce the total number of parameters, the computation time and perplexity both get worse.

A.6 Computational Considerations
We reproduce the technique ablation table in Table 4 for reference. As we remove neural components, the number of parameters increases but the time of the forward pass decreases. This is because generating parameters from a neural network takes strictly more time than having those parameters available.
When block embeddings are removed and the full state representations are directly parameterized,   the model is faster due to not needing to recompute the full state representations. This contrast is even more pronounced when removing neural components altogether and using a scalar parameterization, with an almost 3x speedup. This is because the distributional parameters do not need to be regenerated by a neural network if they are parameterized directly.