Unsupervised Neural Hidden Markov Models

In this work, we present the first results for neuralizing an Unsupervised Hidden Markov Model. We evaluate our approach on tag in- duction. Our approach outperforms existing generative models and is competitive with the state-of-the-art though with a simpler model easily extended to include additional context.


Introduction
Probabilistic graphical models are among the most important tools available to the NLP community. In particular, the ability to train generative models using Expectation-Maximization (EM), Variational Inference (VI), and sampling methods like MCMC has enabled the development of unsupervised systems for tag and grammar induction, alignment, topic models and more. These latent variable models discover hidden structure in text which aligns to known linguistic phenomena and whose clusters are easily identifiable.
Recently, much of supervised NLP has found great success by augmenting or replacing context, features, and word representations with embeddings derived from Deep Neural Networks. These models allow for learning highly expressive non-convex functions by simply backpropagating prediction errors. Inspired by Berg-Kirkpatrick et al. (2010), who bridged the gap between supervised and unsupervised training with features, we bring neural networks to unsupervised learning by providing evidence that even in * This research was carried out while all authors were at the Information Sciences Institute. unsupervised settings, simple neural network models trained to maximize the marginal likelihood can outperform more complicated models that use expensive inference.
In this work, we show how a single latent variable sequence model, Hidden Markov Models (HMMs), can be implemented with neural networks by simply optimizing the incomplete data likelihood. The key insight is to perform standard forward-backward inference to compute posteriors of latent variables and then backpropagate the posteriors through the networks to maximize the likelihood of the data.
Using features in unsupervised learning has been a fruitful enterprise (Das and Petrov, 2011;Berg-Kirkpatrick and Klein, 2010;Cohen et al., 2011) and attempts to combine HMMs and Neural Networks date back to 1991 (Bengio et al., 1991). Additionally, similarity metrics derived from word embeddings have also been shown to improve unsupervised word alignment (Songyot and Chiang, 2014).
Interest in the interface of graphical models and neural networks has grown recently as new inference procedures have been proposed (Kingma and Welling, 2014;Johnson et al., 2016). Common to this work and ours is the use of neural networks to produce potentials. The approach presented here is easily applied to other latent variable models where inference is tractable and are typically trained with EM. We believe there are three important strengths: 1. Using a neural network to produce model probabilities allows for seamless integration of additional context not easily represented by conditioning variables in a traditional model. Our focus in this preliminary work is to present a generative neural approach to HMMs and demonstrate how this framework lends itself to modularity (e.g. the easy inclusion of morphological information via Convolutional Neural Networks §5), and the addition of extra conditioning context (e.g. using an RNN to model the sentence §6). Our approach will be demonstrated and evaluated on the simple task of part-of-speech tag induction. Future work, should investigate the second and third proposed strengths.

Framework
Graphical models have been widely used in NLP. Typically potential functions ψ(z, x) over a set of latent variables, z, and observed variables, x, are defined based on hand-crafted features. Moreover, independence assumptions between variables are often made for the sake of tractability. Here, we propose using neural networks (NNs) to produce the potentials since neural networks are universal approximators. Neural networks can extract useful taskspecific abstract representations of data. Additionally, Long Short-Term Memory (LSTM) (Hochreiter and Schmidhuber, 1997) based Recurrent Neural Networks (RNNs), allow for modeling unbounded context with far fewer parameters than naive one-hot feature encodings. The reparameterization of potentials with neural networks (NNs) is seamless: The sequence of observed variables are denoted as x = {x 1 , . . . , x n }. In unsupervised learning, we aim to find model parameters θ that maximize the evidence p(x | θ). We focus on cases when the posterior is tractable and we can use Generalized EM (Dempster et al., 1977) to estimate θ.
Text Pierre Vinken will join the board PTB NNP NNP MD VB DT NN where q(z) is an arbitrary distribution, and H is the entropy function. The E-step of EM estimates the posterior p(z | x) based on the current parameters θ.
In the M-step, we choose q(z) to be the posterior p(z | x), setting the KL-divergence to zero. Additionally, the entropy term H[q(z)] is a constant and can therefore be dropped. This means updating θ only requires maximizing The gradient is therefore defined in terms of the gradient of the joint probability scaled by the posteriors: In order to perform the gradient update in Eq 5, we need to compute the posterior p(z | x). This can be done efficiently with the Message Passing algorithm. Note that, in cases where the derivative ∂ ∂θ ln p(x, z | θ) is easy to evaluate, we can perform direct marginal likelihood optimization (Salakhutdinov et al., 2003). We do not address here the question of semi-supervised training, but believe the framework we present lends itself naturally to the incorporation of constraints or labeled data. Next, we demonstrate the application of this framework to HMMs in the service of part-of-speech tag induction.

Part-of-Speech Induction
Part-of-speech tags encode morphosyntactic information about a language and are a fundamental tool in downstream NLP applications. In English, the Penn Treebank (Marcus et al., 1994) distinguishes 36 categories and punctuation. Tag induction is the task of taking raw text and both discovering these latent clusters and assigning them to words in situ. Classes can be very specific (e.g. six types of verbs in English) to their syntactic role. Example tags are shown in Table 1. In this example, board is labeled as a singular noun while Pierre Vinken is a singular proper noun.
Two natural applications of induced tags are as the basis for grammar induction (Spitkovsky et al., 2011;Bisk et al., 2015) or to provide a syntactically informed, though unsupervised, source of word embeddings. Figure 1: Pictorial representation of a Hidden Markov Model.
Latent variable (zt) transitions depend on the previous value (zt−1), and emit an observed word (xt) at each time step.

The Hidden Markov Model
A common model for this task, and our primary workhorse, is the Hidden Markov Model trained with the unsupervised message passing algorithm, Baum-Welch (Welch, 2003).
Model HMMs model a sentence by assuming that (a) every word token is generated by a latent class, and (b) the current class at time t is conditioned on the local history t−1. Formally, this gives us an emission p(x t | z t ) and transition p(z t | z t−1 ) probability. The graphical model is drawn pictorially in Figure 1, where shaded circles denote observations and empty ones are latent. The probability of a given sequence of observations x and latent variables z is given by multiplying transitions and emissions across all time steps (Eq. 6). Finding the optimal sequence of latent classes corresponds to computing an argmax over the values of z.
Because our task is unsupervised we do not have a priori access to these distributions, but they can be estimated via Baum-Welch. The algorithm's outline is provided in Algorithm 1.
Training an HMM with EM is highly non-convex and likely to get stuck in local optima (Johnson, 2007). Despite this, sophisticated Bayesian smoothing leads to state-of-the-art performance (Blunsom and Cohn, 2011). Blunsom and Cohn (2011) further extend the HMM by augmenting its emission distributions with character models to capture morphological information and a tri-gram transition matrix which conditions on the previous two states. Recently, Lin et al. (2015) extended several models Algorithm 1 Baum-Welch Algorithm Randomly Initialize distributions (θ) repeat Compute forward messages: including the HMM to include pre-trained word embeddings learned by different skip-gram models. Our work will fully neuralize the HMM and learn embeddings during the training of our generative model. There has also been recent work on by Rastogi et al. (2016) on neuralizing Finite-State Transducers.

Additional Comparisons
While the main focus of our paper is the seamless extension of an unsupervised generative latent variable model with neural networks, for completeness we will also include comparisons to other techniques which do not adhere to the generative assumption. We include Brown clusters (Brown et al., 1992) as a baseline and two clustering techniques as stateof-the-art comparisons: Christodoulopoulos et al. (2011) andYatbaz et al. (2012).
Of particular interest to us is the work of Brown et al. (1992). Brown clusters group word types through a greedy agglomerative clustering according to their mutual information across the corpus based on bigram probabilities. Brown clusters do not account for a word's membership in multiple syntactic classes, but are a very strong baseline for tag induction. It is possible our approach could be improved by augmenting our objective function to include mutual information computations or a bias towards a harder clustering.

Neural HMM
The aforementioned training of an HMM assumes access to two distributions: (1) Emissions with K × V parameters, and (2) Transitions with K × K parameters. Here we assume there are K clusters and V word types in our vocabulary. Our neural HMM (NHMM) will replace these matrices with the output of simple feed-forward neural networks. All conditioning variables will be presented as input to the network and its final softmax layer will provide probabilities. This should replicate the behavior of the standard HMM, but without an explicit representation of the necessary distributions.

Producing Probabilities
Producing emission and transition probabilities allows for standard inference to take place in the model.

Emission Architecture
Let v k ∈ R D be vector embedding of tag z k , w i ∈ R D and b i vector embedding and bias of word i respectively. The emission probability p(w i | z k ) is given by The emission probability can be implemented by a neural network where w i is the weight of unit i at the output layer of the network. The tag embeddings v k are obtained by a simple feed-forward neural network consisting of a lookup table following by a nonlinear activation (ReLU). When using morphology information ( §5), we will first use another network to produce the word embedddings w i .
Transition Architecture We produce the transition probability directly by using a linear layer of D × K 2 . More specifically, let q ∈ R D be a query embedding. The unnormalized transition matrix T is computed as where U ∈ R D×K 2 and b ∈ R K 2 . We then reshape T to a K × K matrix and apply a softmax layer per row to produce valid transition probabilities.

Training the Neural Network
The probabilities can now be used to perform the aforementioned forward and backward passes over the data to compute posteriors. In this way, we perform the E-step as though we were training a vanilla HMM. Traditionally, these values would simply be re-normalized during the M-step to re-estimate model parameters. Instead, we use them to re-scale our gradients (following the discussion from §2).
Combining the HMM factorization of the joint probability p(x, z) from Eq. 6 with the gradient from Eq. 5, yields the following update rule: The posteriors p(z t | x) and p(z t , z t−1 | x) are obtained by running Baum-Welch as shown in Algorithm 1. Where traditional supervised training can follow a clear gradient signal towards a specific assignment, here we are propagating the model's (un)certainty instead. An additional complication introduced by this paradigm is the question of how many gradient steps to take on a given minibatch. In incremental EM the posteriors are simply accumulated and normalized. Here, we repeatedly recompute gradients on a minibatch until reaching the maximum number of epochs or a convergence threshold is met. Finally, notice that the factorization of the HMM allows us to evaluate the joint distribution p(x, z | θ) easily. We therefore employ Direct Marginal Likelihood (DML) (Salakhutdinov et al., 2003) to optimize the model's parameters. After trying both EM and DML we found EM to be slower to converge and perform slightly weaker. For this reason, the presented results will all be trained with DML.

HMM and Neural HMM Equivalence
An important result we see in Table 2 is that the Neural HMM (NHMM) performs almost identically to the HMM. At this point, we have replaced the underlying machinery, but the model still has the same information bottlenecks as a standard HMM, which limit the amount and type of information carried between words in the sentence. Additionally, both approaches are optimizing the same objective function, data likelihood, via the computation of posteriors. The equivalency is an important sanity check. The A character convolutional neural network is used to compute the weight of the linear layer for every minibatch.
following two sections will demonstrate the extensibility of this approach.

Convolutions for Morphology
The first benefit of moving to neural networks is the ease with which new information can be provided to the model. The first experiment we will perform is replacing words with embedding vectors derived from a Convolutional Neural Network (CNN) (Kim et al., 2016;Jozefowicz et al., 2016). We use a convolutional kernel with widths from 1 to 7, which covers up to 7 character n-grams ( Figure 2). This allows the model to automatically learn lexical representations based on prefix, suffix, and stem information about a word. No additional changes to learning are required for extension.
Adding the convolution does not dramatically slow down our model because the emission distributions can be computed for the whole batch in one operation. We simply pass the whole vocabulary through the convolution in a single operation.

Infinite Context with LSTMs
One of the most powerful strengths of neural networks is their ability to create compact representation of data. We will explore this here in the creation of transition matrices. In particular, we chose to augment the transition matrix with all preceding words in the sentence: p(z t | z t−1 , w 0 , . . . , w t−1 ). Incorporating this amount of context in a traditional HMM is intractable and impossible to estimate, as the number of parameters grows exponentially.
For this reason, we use an stacked LSTM to form a low dimensional representation of the sentence (C 0...t−1 ) which can be easily fed to our network when producing a transition matrix: p(z t | z t−1 , C 0...t−1 ) in Figure 3. By having the LSTM only consume up to the previous word, we do not break any sequential generative model assumptions. 1 In terms of model architecture, the query embedding q will be replaced by a hidden state h t−1 of the LSTM at time step t − 1.

Evaluation
Once a model is trained, the one best latent sequence is extracted for every sentence and evaluated on three metrics.
Many-to-One (M-1) Many-to-one computes the most common true part-of-speech tag for each cluster. It then computes tagging accuracy as if the cluster were replaced with that tag. This metric is easily gamed by introducing a large number of clusters.
One-to-One (1-1) One-to-One performs the same computation as Many-to-One but only one cluster is allowed to be assigned to a given tag. This prevents the gaming of M-1.
V-Measure (VM) V-Measure is an F-measure which trades off conditional entropy between the clusters and gold tags. Christodoulopoulos et al. (2010) found VM is to be the most informative and consistent metric, in part because it is agnostic to the number of induced tags.

Data and Parameters
To evaluate our approaches, we follow the existing literature and train and test on the full WSJ corpus.
There are three components of our models which can be tuned. Something we have to be careful of when train and test are the same data. To avoid cheating, no values were tuned in this work.
Architecture The first parameter is the number of hidden units. We chose 512 because it was the largest power of two we could fit in memory. When we extended our model to include the convolutional emission network, we only used 128 units, due to the intensive computation of Char-CNN over the whole vocabulary per minibatch. The second design choice was the number of LSTM layers. We used a three layer LSTM as it worked well for (Tran et al., 2016), and we applied dropout (Srivastava et al., 2014) over the vertical connections of the LSTMs (Pham et al., 2014) with a rate of 0.5.
Finally, the maximum number of inner loop updates applied per batch is set to six. We train all the models for five epochs and perform gradient clipping whenever the gradient norm is greater than five. To determine when to stop applying the gradient during training we simply check when the log probability has converged ( new−old old < 10 −4 ) or if the maximum number of inner loops has been reached. All optimization was done using Adam (Kingma and Ba, 2015) with default hyper-parameters.
Initialization In addition to architectural choices we have to initialize all of our parameters. Word embeddings (and character embeddings in the CNN) are drawn from a Gaussian N (0, 1). The weights of all linear layers in the model are drawn from a uniform distribution with mean zero and a standard deviation of 1 /n in , where n in is the input dimension of the linear layer. 2 Additionally, weights for the LSTMs are initialized using N (0, 1 /2n), where n is the number of hidden units, and the bias of the forget gate is set to 1, as suggested by Józefowicz et al. (2015). We present some parameter and modeling ablation analysis in §10.
It is worth emphasizing that parameters are shared at the lower level of our network architectures (see Figure 2 and Figure 3). Sharing parameters not only allows the networks to share statistical strength, but also reduces the computational cost of comput-2 This is the default parameter initialization in Torch.  We see significant gains from both morphology (+Conv) and extended context (+LSTM). The combination of these approaches results in a very simple system which is competitive with the best generative model in the literature.
ing sufficient statistics during training due to the marginalization over latent variables. In all of our experiments, we use minibatch size of 256 and sentences of 40 words or less due to memory constraints. Evaluation was performed on all sentence lengths. Additionally, we map all the digits to 0, but do not lower-case the data or perform any other preprocessing. All model code is available online for extension and replication at https://github.com/ketranm/neuralHMM.

Results
Our results are presented in Table 2 along with two baseline systems, and the four top performing and state-of-the-art approaches. As noted earlier, we are happy to see that our NHMM performs almost identically with the standard HMM. Second, we find that our approach, while simple and fast, is competitive with Blunsom (2011). Their Hierarchical Pitman-Yor Process for trigram HMMs with character modeling is a very sophisticated Bayesian approach and the most appropriate comparison to our work.
We see that both extended context (+LSTM) and the addition of morphological information (+Conv) provide substantial boosts to performance. Interestingly, the gains are not completely complementary, as we note that the six and twelve point gains of these additions only combine to a total of sixteen points in

Parameter Ablation
Our model design decisions and weight initializations were chosen based on best practices set forth in the supervised training literature. We are lucky that these also behaved well in the unsupervised setting. Within unsupervised structure prediction, to our best knowledge, there has been no empirical study on neural network architecture design and weight initialization. We therefore provide an initial overview on the topic for several of our decisions.
Weight Initialization If we run our best model (NHMM+Conv+LSTM) with all the weights initialized from a uniform distribution U(−10 −4 , 10 −4 ) 3 we find a dramatic drop in V-Measure performance (61.7 vs 71.7 in Table 3). This is consistent with the common wisdom that unlike supervised learning (Luong et al., 2015), weight initialization is important to achieve good performance on unsupervised tasks. It is possible that performance could be further enhance via the popular technique of ensembling, would would allow for combining models which converged to different local optima.

LSTM Layers And Dropout
We find that dropout is important in training an unsupervised NHMM.
Removing dropout causes performance to drop six points. To avoid tuning the dropout rate, future work might investigate the effect of variational dropout  in unsupervised learning. We also observed that the number of LSTM layers has an impact on V-Measure. Had we simply used a single layer we would have lost nearly five points. It is possible that more layers, perhaps coupled with more data, would yield even greater gains.

Future Work
In addition to parameter tuning and multilingual evaluation, the biggest open questions for our approach are the effects of additional data and augmenting the loss function. Neural networks are notoriously data hungry, indicating that while we achieve competitive results, it is possible our model will scale well when run with large corpora. This would likely require the use of techniques like NCE (Gutmann and Hyvärinen, 2010) which have been shown to be highly effective in related tasks like neural language modeling (Mnih and Teh, 2012;Vaswani et al., 2013). Secondly, despite focusing on ways to augment an HMM, Brown clustering and systems inspired by it perform very well. They aim to maximize mutual information rather than likelihood. It is possible that augmenting or constraining our loss will yield additional performance gains.
Outside of simply maximizing performance on tag induction, a more subtle, but powerful contribution of this work may be its demonstration of the easy and effective nature of using neural networks with Bayesian models traditionally trained by EM. We hope this approach scales well to many other domains and tasks.