Trans-dimensional Random Fields for Language Modeling

Language modeling (LM) involves determining the joint probability of words in a sentence. The conditional approach is dominant, representing the joint probability in terms of conditionals. Examples include n-gram LMs and neural network LMs. An alternative approach, called the random ﬁeld (RF) approach, is used in whole-sentence maximum entropy (WSME) LMs. Although the RF approach has potential beneﬁts, the empirical results of previous WSME models are not satisfactory. In this paper, we revisit the RF approach for language modeling, with a number of innovations. We propose a trans-dimensional RF (TDRF) model and develop a training algorithm using joint stochastic approximation and trans-dimensional mixture sampling. We perform speech recognition experiments on Wall Street Journal data, and ﬁnd that our TDRF models lead to performances as good as the recurrent neural network LMs but are computationally more efﬁcient in computing sentence probability.


Introduction
Language modeling is crucial for a variety of computational linguistic applications, such as speech recognition, machine translation, handwriting recognition, information retrieval and so on. It involves determining the joint probability p(x) of a sentence x, which can be denoted as a pair x = (l, x l ), where l is the length and x l = (x 1 , . . . , x l ) is a sequence of l words.
Currently, the dominant approach is conditional modeling, which decomposes the joint probability of x l into a product of conditional probabilities 1 1 And the joint probability of x is modeled as p(x) = by using the chain rule, p(x i |x 1 , . . . , x i−1 ). (1) To avoid degenerate representation of the conditionals, the history of x i , denoted as h i = (x 1 , · · · , x i−1 ), is reduced to equivalence classes through a mapping φ(h i ) with the assumption p(x i |h i ) ≈ p(x i |φ(h i )).
( 2) Language modeling in this conditional approach consists of finding suitable mappings φ(h i ) and effective methods to estimate p(x i |φ(h i )). A classic example is the traditional n-gram LMs with φ(h i ) = (x i−n+1 , . . . , x i−1 ). Various smoothing techniques are used for parameter estimation (Chen and Goodman, 1999). Recently, neural network LMs, which have begun to surpass the traditional n-gram LMs, also follow the conditional modeling approach, with φ(h i ) determined by a neural network (NN), which can be either a feedforward NN (Schwenk, 2007) or a recurrent NN (Mikolov et al., 2011).
Remarkably, an alternative approach is used in whole-sentence maximum entropy (WSME) language modeling (Rosenfeld et al., 2001). Specifically, a WSME model has the form: Here f (x) is a vector of features, which can be arbitrary computable functions of x, λ is the corresponding parameter vector, and Z is the global normalization constant. Although WSME models have the potential benefits of being able to naturally express sentence-level phenomena and integrate features from a variety of knowledge p(x l )p( EOS |x l ), where EOS is a special token placed at the end of every sentence. Thus the distribution of the sentence length is implicitly modeled. sources, their performance results ever reported are not satisfactory (Rosenfeld et al., 2001;Amaya and Benedí, 2001;Ruokolainen et al., 2010).
The WSME model defined in (3) is basically a Markov random field (MRF). A substantial challenge in fitting MRFs is that evaluating the gradient of the log likelihood requires high-dimensional integration and hence is difficult even for moderately sized models (Younes, 1989), let alone the language model (3). The sampling methods previously tried for approximating the gradient are the Gibbs sampling, the Independence Metropolis-Hasting sampling and the importance sampling (Rosenfeld et al., 2001). Simple applications of these methods are hardly able to work efficiently for the complex, high-dimensional distribution such as (3), and hence the WSME models are in fact poorly fitted to the data. This is one of the reasons for the unsatisfactory results of previous WSME models.
In this paper, we propose a new language model, called the trans-dimensional random field (TDRF) model, by explicitly taking account of the empirical distributions of lengths. This formulation subsequently enables us to develop a powerful Markov chain Monte Carlo (MCMC) technique, called trans-dimensional mixture sampling and then propose an effective training algorithm in the framework of stochastic approximation (SA) (Benveniste et al., 1990;Chen, 2002). The SA algorithm involves jointly updating the model parameters and normalization constants, in conjunction with trans-dimensional MCMC sampling. Section 2 and 3 present the model definition and estimation respectively. Furthermore, we make several additional innovations, as detailed in Section 4, to enable successful training of TDRF models. First, the diagonal elements of hessian matrix are estimated during SA iterations to rescale the gradient, which significantly improves the convergence of the SA algorithm. Second, word classing is introduced to accelerate the sampling operation and also improve the smoothing behavior of the models through sharing statistical strength between similar words. Finally, multiple CPUs are used to parallelize the training of our RF models.
In Section 5, speech recognition experiments are conducted to evaluate our TDRF LMs, compared with the traditional 4-gram LMs and the recurrent neural network LMs (RNNLMs) (Mikolov et al., 2011) which have emerged as a new stateof-art of language modeling. We explore the use of a variety of features based on word and class information in TDRF LMs. In terms of word error rates (WERs) for speech recognition, our TDRF LMs alone can outperform the KN-smoothing 4gram LM with 9.1% relative reduction, and perform comparably to the RNNLM with a slight 0.5% relative reduction. To our knowledge, this result represents the first strong empirical evidence supporting the power of using the whole-sentence language modeling approach. Our open-source TDRF toolkit is released publicly 2 .

Model Definition
Throughout, we denote 3 by x l = (x 1 , . . . , x l ) a sentence (i.e., word sequence) of length l ranging from 1 to m. Each element of x l corresponds to a single word. For l = 1, . . . , m, we assume that sentences of length l are distributed from an exponential family model: . . , f d (x l )) T is the feature vector and λ = (λ 1 , λ 2 , . . . , λ d ) T is the corresponding parameter vector, and Z l (λ) is the normalization constant: Moreover, we assume that length l is associated with probability π l for l = 1, . . . , m. Therefore, the pair (l, x l ) is jointly distributed as p(l, x l ; λ) = π l p l (x l ; λ).
We provide several comments on the above model definition. First, by making explicit the role of lengths in model definition, it is clear that the model in (6) is a mixture of random fields on sentences of different lengths (namely on subspaces of different dimensions), and hence will be called a trans-dimensional random field (TDRF). Different from the WSME model (3), a crucial aspect of the TDRF model (6) is that the mixture weights π l can be set to the empirical length probabilities in the training data. The WSME model (3) is essentially also a mixture of RFs, but the mixture weights implied are proportional to the normalizing constants Z l (λ): where Z(λ) = m l=1 Z l (λ). A motivation for proposing (6) is that it is very difficult to sample from (3), namely (7), as a mixture distribution with unknown weights which typically differ from each other by orders of magnitudes, e.g. 10 40 or more in our experiments. Setting mixture weights to the known, empirical length probabilities enables us to develop a very effective learning algorithm, as introduced in Section 3. Basically, the empirical weights serve as a control device to improve sampling from multiple distributions (Liang et al., 2007;Tan, 2015) .
Second, it can be shown that if we incorporate the length features 4 in the vector of features f (x) in (3), then the distribution p(x; λ) in (3) under the maximum entropy (ME) principle will take the form of (6) and the probabilities (π 1 , . . . , π m ) in (6) implied by the parameters for the length features are exactly the empirical length probabilities.
Third, a feature f i (x l ), 1 ≤ i ≤ d, can be any computable function of the sentence x l , such as n-grams. In our current experiments, the features f i (x l ) and their corresponding parameters λ i are defined to be position-independent and lengthindependent. For example, f i (x l ) = k f i (x l , k), where f i (x l , k) is a binary function of x l evaluated at position k. As a result, the feature f i (x l ) takes values in the non-negative integers.

Model Estimation
We develop a stochastic approximation algorithm using Markov chain Monte Carlo to estimate the parameters λ and the normalization constants Z 1 (λ), ..., Z m (λ) (Benveniste et al., 1990;Chen, 2002). The core algorithms newly designed in this paper are the joint SA for simultaneously estimating parameters and normalizing constants (Section 3.2) and trans-dimensional mixture sampling (Section 3.3) which is used as Step I of the joint SA. The most relevant previous works that we borrowed from are (Gu and Zhu, 2001) on SA for fitting a single RF, (Tan, 2015) on sampling and estimating normalizing constants from multiple RFs of the same dimension, and (Green, 1995) on trans-dimensional MCMC.

Maximum likelihood estimation
Suppose that the training dataset consists of n l sentences of length l for l = 1, . . . , m. First, the maximum likelihood estimate of the length probability π l is easily shown to be n l /n, where n = m l=1 n l . By abuse of notation, we set π l = n l /n hereafter. Next, the log-likelihood of λ given the empirical length probabilities is where D l is the collection of sentences of length l in the training set. By setting to 0 the derivative of (8) with respect to λ, we obtain that the maximum likelihood estimate of λ is determined by the following equation: wherep[f ] is the expectation of the feature vector f with respect to the empirical distribution: and p λ [f ] is the expectation of f with respect to the joint distribution (6) with π l = n l /n: and p λ, . Eq.(9) has the form of equating empirical expectationsp[f ] with theoretical expectations p λ [f ], as similarly found in maximum likelihood estimation of single random field models.

Joint stochastic approximation
Training random field models is challenging due to numerical intractability of the normalizing constants Z l (λ) and expectations p λ,l [f ]. We propose a novel SA algorithm for estimating the parameters λ by (9) and, simultaneously, estimating the log ratios of normalization constants: Algorithm 1 Joint stochastic approximation Input: training set 1: set initial values λ (0) = (0, . . . , 0) T and Step I: MCMC sampling 5: Step II: SA updating 9: Compute λ (t) based on (14) 10: Compute ζ (t) based on (15) and (16)  11: end for where Z 1 (λ) is chosen as the reference value and can be calculated exactly. The algorithm can be obtained by combining the standard SA algorithm for training single random fields (Gu and Zhu, 2001) and a trans-dimensional extension of the self-adjusted mixture sampling algorithm (Tan, 2015).
The joint SA algorithm, whose pseudo-code is shown in Algorithm 1, consists of two steps at each time t as follows.
Step II: SA updating. Compute where γ λ is a learning rate of λ; compute where γ ζ is a learning rate of ζ, and δ l (B (t) ) is the relative frequency of length l appearing in B (t) : The rationale in (15) is to adjust ζ based on how the relative frequencies of lengths δ l (B (t) ) are compared with the desired length probabilities π l . Intuitively, if the relative frequency of some length l in the sample set B (t) is greater (or respectively smaller) than the desired length probability π l , then the hypothesized value ζ (t−1) l is an underestimate (or overestimate) of ζ * l (λ (t−1) ) and hence should be increased (or decreased).
Following Gu & Zhu (2001) and Tan (2015), we set the learning rates in two stages: where 0.5 < β λ , β ζ < 1. In the first stage (t ≤ t 0 ), a slow-decaying rate of t −β is used to introduce large adjustments. This forces the estimates λ (t) and ζ (t) to fall reasonably fast into the true values.
In the second stage (t > t 0 ), a fast-decaying rate of t −1 is used. The iteration number t is multiplied by 0.1 in (19), to make the the learning rate of ζ decay more slowly than λ. Commonly, t 0 is selected to ensure there is no more significant adjustment observed in the first stage.

Trans-dimensional mixture sampling
We describe a trans-dimensional mixture sampling algorithm to simulate from the joint distribution p(l, x l ; λ, ζ), which is used with (λ, ζ) = (λ (t−1) , ζ (t−1) ) at time t for MCMC sampling in the joint SA algorithm. The name "mixture sampling" reflects the fact that p(l, x l ; λ, ζ) represents a labeled mixture, because l is a label indicating that x l is associated with the distribution p l (x l ; ζ).
With fixed (λ, ζ), this sampling algorithm can be seen as formally equivalent to reversible jump MCMC (Green, 1995), which was originally proposed for Bayes model determination.
The trans-dimensional mixture sampling algorithm consists of two steps at each time t: local jump between lengths and Markov move of sentences for a given length. In the following, we denote by L (t−1) and X (t−1) the length and sequence before sampling, but use the short notation (λ, ζ) for (λ (t−1) , ζ (t−1) ).
Step I: Local jump. The Metropolis-Hastings method is used in this step to sample the length. Assuming L (t−1) = k, first we draw a new length j ∼ Γ(k, ·). The jump distribution Γ(k, l) is defined to be uniform at the neighborhood of k : where m is the maximum length. Eq.(20) restricts the difference between j and k to be no more than one. If j = k, we retain the sequence and perform the next step directly, i.e. set L (t) = k and X (t) = X (t−1) . If j = k + 1 or j = k − 1, the two cases are processed differently. If j = k + 1, we first draw an element (i.e., word) Y from a proposal distribution: where {X (t−1) , Y } denotes a sequence of length k + 1 whose first k elements are X (t−1) and the last element is Y . If j = k − 1, we set L (t) = j (= k − 1) and where X (t−1) 1:j is the first j elements of X (t−1) and is the kth element of X (t−1) . In (21) and (22), g k+1 (y|x k ) can be flexibly specified as a proper density function in y. In our application, we find the following choice works reasonably well: Step II: Markov move. After the step of local jump, we obtain Then we perform Gibbs sampling on X (t) , from the first element to the last element (Algorithm 2)

Algorithm Optimization and Acceleration
The joint SA algorithm may still suffer from slow convergence, especially when λ is highdimensional. We introduce several techniques for improving the convergence of the algorithm and reducing computational cost.

Improving SA recursion
We propose two techniques to effectively improve the convergence of SA recursion. The first technique is to incorporate Hessian information, similarly as in related works on stochastic approximation (Gu and Zhu, 2001) and stochastic gradient descent algorithms (Byrd et al., 2014). But we only use the diagonal elements of the Hessian matrix to re-scale the gradient, due to high-dimensionality of λ.
Taking the second derivatives of L(λ) yields where H i denotes the ith diagonal element of Hessian matrix. At time t, before updating the parameter λ (Step II in Section 3.2), we compute The second technique is to introduce the "minibatch" on the training set. At each iteration, a subset D (t) of K sentences are randomly selected from the training set. Then the gradient is approximated with the overall empirical expectationp[f ] being replaced by the empirical expectation over the subset D (t) . This technique is reminiscent of stochastic gradient descent using a random subsample of training data to achieve fast convergence of optimization algorithms (Bousquet and Bottou, 2008).
By combining the two techniques, we revise the updating equation (14) of λ to where 0 < h < 1 is a threshold to avoid H (t) i being too small or even zero. Moreover, a constant t c is added to the denominator of (18), to avoid too large adjustment of λ, i.e. Fig.1(a) shows the result after introducing hessian estimation, and Fig.1(b) shows the effect of training set mini-batching.

Sampling acceleration
For MCMC sampling in Section 3.3, the Gibbs sampling operation of drawing X (t) i (Step 2 in Algorithms 2) involves calculating the probabilities of all the possible elements in position i. This is computationally costly, because the vocabulary size |V| is usually 10 thousands or more in practice. As a result, the Gibbs sampling operation presents a bottleneck limiting the efficiency of sampling algorithms.
The idea of using class information to accelerate training has been proposed in various contexts of language modeling, such as maximum entropy models (Goodman, 2001b) and RNN LMs (Mikolov et al., 2011). However, the realization of this idea is different for training our models.
The pseudo-code of the new sampling method is shown in Algorithm 3. Denote by V c the subset of V containing all the words belonging to class c. In the Markov move step (Step 13 to 18 in Algorithm 3), at each position i, we first generate a class C from a proposal distribution Q i (c) and then accept C as the new c (t) i with probability i+1:L (t) }, but this is suppressed in the notation. Then we normalize the probabilities of words belonging to class c in Algorithm 3), we first generate C ∼ Q k+1 (c) and then generate Y from class C by g k+1 (y|x k , C) = p(k + 1, {x k , y}; λ, ζ) with x k = X (t−1) . Then we set L (t) = j and X (t) = {X (t−1) , Y } with probability as defined in (21), by setting If the proposal j = k − 1, similarly we use acceptance probability (22) with (32). In our application, we construct Q i (c) dynamically as follows. Write x l for {X (t−1) , Y } in Step 8 or for X (t) in Step 11 of Algorithm 3. First, we construct a reduced model p c l (x l ), by including only the features that depend on x l i through its class and retaining the corresponding parameters in p l (x l ; λ, ζ). Then we define the distribution which can be directly calculated without knowing the value of x l i .

Parallelization of sampling
The sampling operation can be easily parallelized in SA Algorithm 1. At each time t, both the parameters λ and log normalization constants ζ are fixed at λ (t−1) and ζ (t−1) . Instead of simulating one Markov Chain, we simulate J Markov Chains on J CPU cores separately. As a result, to generate a sample set B (t) of size K, only K/J sampling steps need to be performed on each CPU core. By parallelization, the sampling operation is completed J times faster than before.

PTB perplexity results
In this section, we evaluate the performance of LMs by perplexity (PPL). We use the Wall Street Journal (WSJ) portion of Penn Treebank (PTB). Sections 0-20 are used as the training data (about 930K words), sections 21-22 as the development data (74K) and section 23-24 as the test data (82K). The vocabulary is limited to 10K words, with one special token U N K denoting words not in the vocabulary. This setting is the same as that used in other studies (Mikolov et al., 2011). The baseline is a 4-gram LM with modified Kneser-Ney smoothing (Chen and Goodman,  1999), denoted by KN4. We use the RNNLM toolkit 5 to train a RNNLM (Mikolov et al., 2011). The number of hidden units is 250 and other configurations are set by default 6 . Word classing has been shown to be useful in conditional ME models (Chen, 2009). For our TDRF models, we consider a variety of features as shown in Table 1, mainly based on word and class information. Each word is deterministically assigned to a single class, by running the automatic clustering algorithm proposed in (Martin et al., 1998) on the training data.
In Table 1, w i , c i , i = 0, −1, . . . , −5 denote the word and its class at different position offset i, e.g. w 0 , c 0 denotes the current word and its class. We first introduce the classic word/class n-gram features (denoted by "w"/"c") and the word/class skipping n-gram features (denoted by "ws"/"cs") (Goodman, 2001a). Second, to demonstrate that long-span features can be naturally integrated in TDRFs, we introduce higher-order features "wsh"/"csh", by considering two words/classes separated with longer distance. Third, as an example of supporting heterogenous features that combine different information, the crossing features "cpw" (meaning class-predict-word) are introduced. Note that for all the feature types in Table 1, only the features observed in the training data are used.
The joint SA (Algorithm 1) is used to train the TDRF models, with all the acceleration methods described in Section 4 applied. The minibatch size K = 300. The learning rates γ λ and γ ζ are configured as (29) and (19) respectively with β λ = β ζ = 0.6 and t c = 3000. For t 0 , it is first initialized to be 10 4 . During iterations, we monitor the smoothed log-likelihood (moving average of 1000 iterations) on the PTB development data.  We set t 0 to the current iteration number once the rising percentage of the smoothed log-likelihoods within 100 iterations is below 20%, and then continue 5000 further iterations before stopping. The configuration of hessian estimation (Section 4.1) is γ H = γ λ and h = 10 −4 . L 2 regularization with constant 10 −5 is used to avoid over-fitting. 8 CPU cores are used to parallelize the algorithm, as described in Section 4.3, and the training of each TDRF model takes less than 20 hours. The perplexity results on the PTB test data are given in Table 2. As the normalization constants of TDRF models are estimated stochastically, we report the Monte Carlo mean and standard deviation from the last 1000 iterations for each PPL. The TDRF model using the basic "w+c" features performs close to the RNNLM in perplexity. To be compact, results with more features are presented in the following WSJ experiment.

WSJ speech recognition results
In this section, we continue to use the LMs obtained above (using PTB training and development data), and evaluate their performance measured by WERs in speech recognition, by rescoring 1000-best lists from WSJ'92 test data (330 sentences). The oracle WER of the 1000-best lists is 3.4%, which are generated from using the Kaldi toolkit 7 with a DNN-based acoustic model. TDRF LMs using a variety of features and different number of classes are tested. The results are shown in Table 3. Different types of features, like the skipping features, the higher-order features and the crossing features can all be easily supported in TDRF LMs, and the performance is improved to varying degrees. Particularly, the TDRF using the "w+c+ws+cs+cpw" features with class number 200 performs comparable to the RNNLM in both perplexity and WER. Numerically, the relative reduction is 9.1% compared with the KN4 LMs, and 0.5% compared with the RNN LM.

Comparison and discussion
TDRF vs WSME. For comparison, Table 3 also presents the results from our implementation of the WSME model (3), using the same features as in Table 1. This WSME model is the same as in (Rosenfeld, 1997), but different from (Rosenfeld et al., 2001), which uses the traditional n-gram LM as the priori distribution p 0 . For the WSME model (3), we can still use a SA training algorithm, similar to that developed in Section 3.2, to estimate the parameters λ. But in this case, there is no need to introduce ζ l , because the normalizing constants Z l (λ) are canceled out as seen from (7). Specifically, the learning rate γ λ and the L 2 regularization are configured the same as in TDRF training. A fixed number of iterations with t 0 = 5000 is performed. The total iteration number is 10000, which is similar to the iteration number used in TDRF training.
In order to calculate perplexity, we need to estimate the global normalizing constant Z(λ) = m l=1 Z l (λ) for the WSME model. Similarly as in (Tan, 2015), we apply the SA algorithm in Section 3.2 to estimate the log normalizing constants ζ, while fixing the parameters λ to be those already estimated from the WSME model and using uniform probabilities π l ≡ m −1 .
The resulting PPLs of these WSME models are extremely poor. The average test log-likelihoods per sentence for these two WSME models are −494 and −509 respectively. However, the W-ERs from using the trained WSME models in hypothesis re-ranking are not as poor as would be expected from their PPLs. This appears to indicate that the estimated WSME parameters are not so bad for relative ranking. Moreover, when the estimated λ and ζ are substituted into our TDRF model (6) with the empirical length probabilities π l , the "corrected" average test log-likelihoods per sentence for these two sets of parameters are improved to be −152 and −119 respectively. The average test log-likelihoods are both −96 for the two corresponding TDRF models in Table 3. This is some evidence for the model deficiency of the WSME distribution as defined in (3), and introducing the empirical length probabilities gives a more reasonable model assumption.
TDRF vs conditional ME. After training, TDRF models are computationally more efficient in computing sentence probability, simply summing up weights for the activated features in the sentence. The conditional ME models (Khudanpur and Wu, 2000;Roark et al., 2004) suffer from the expensive computation of local normalization factors. This computational bottleneck hinders their use in practice (Goodman, 2001b;Rosenfeld et al., 2001). Partly for this reason, although building conditional ME models with sophisticated features as in Table 1 is theoretically possible, such work has not been pursued so far.
TDRF vs RNN. The RNN models suffer from the expensive softmax computation in the output layer 8 . Empirically in our experiments, the average time costs for re-ranking of the 1000-best list for a sentence are 0.16 sec vs 40 sec, based on TDRF and RNN respectively (no GPU used).

Related Work
While there has been extensive research on conditional LMs, there has been little work on the whole-sentence LMs, mainly in (Rosenfeld et al., 2001;Amaya and Benedí, 2001;Ruokolainen et al., 2010). Although the whole-sentence approach has potential benefits, the empirical results of previous WSME models are not satisfactory, almost the same as traditional n-gram models. After incorporating lexical and syntactic information, a mere relative improvement of 1% and 0.4% respectively in perplexity and in WER is reported for the resulting WSEM (Rosenfeld et al., 2001). Subsequent studies of using WSEMs with grammatical features, as in (Amaya and Benedí, 2001) and (Ruokolainen et al., 2010), report perplexity improvement above 10% but no WER improvement when using WSEMs alone.
Most RF modeling has been restricted to fixeddimensional spaces 9 . Despite recent progress, fitting RFs of moderate or large dimensions remains to be challenging (Koller and Friedman, 2009;Mizrahi et al., 2013). In particular, the work of (Pietra et al., 1997) is inspiring to us, but the improved iterative scaling (IIS) method for parameter estimation and the Gibbs sampler are not suitable for even moderately sized models. Our TDRF model, together with the joint SA algorithm and trans-dimensional mixture sampling, are brand new and lead to encouraging results for language modeling.

Conclusion
In summary, we have made the following contributions, which enable us to successfully train T-DRF models and obtain encouraging performance improvement.
• The new TDRF model and the joint SA training algorithm, which simultaneously updates the model parameters and normalizing constants while using trans-dimensional mixture sampling. • Several additional innovations including accelerating SA iterations by using Hessian information, introducing word classing to accelerate the sampling operation and improve the smoothing behavior of the models, and parallelization of sampling. In this work, we mainly explore the use of features based on word and class information. Future work with other knowledge sources and largerscale experiments is needed to fully exploit the advantage of TDRFs to integrate richer features.