Interpolated Spectral NGram Language Models

Spectral models for learning weighted non-deterministic automata have nice theoretical and algorithmic properties. Despite this, it has been challenging to obtain competitive results in language modeling tasks, for two main reasons. First, in order to capture long-range dependencies of the data, the method must use statistics from long substrings, which results in very large matrices that are difficult to decompose. The second is that the loss function behind spectral learning, based on moment matching, differs from the probabilistic metrics used to evaluate language models. In this work we employ a technique for scaling up spectral learning, and use interpolated predictions that are optimized to maximize perplexity. Our experiments in character-based language modeling show that our method matches the performance of state-of-the-art ngram models, while being very fast to train.


Introduction
In the recent years we have witnessed the development of spectral methods based on matrix decompositions to learn Probabilistic Non-deterministic Finite Automata (PNFA) and related models (Hsu et al., 2009(Hsu et al., , 2012Bailly et al., 2009;Balle et al., 2011;Cohen et al., 2012;. Essentially, PNFA can be regarded as recurrent neural networks where the function that predicts the dynamic state representation from previous states is linear. Despite the expressiveness of PNFA and the strong theoretical properties of spectral learning algorithms, it has been challenging to get competitive results on language modeling tasks. We argue and confirm with our experiments that there are two main reasons why using spectral methods for language modeling is challenging. The first reason is a scalability problem to handle long range dependencies. The spectral method is based on computing a Hankel matrix that contains statistics of expectations over substrings generated by the target language. If we want to incorporate long-range dependencies we need to consider long substrings. A consequence of this is that the Hankel matrix can become too large to make it practical to perform algebraic decompositions. To address this problem we use the basis selection technique by Quattoni et al. (2017) to scale spectral learning and model long range dependencies. Our experiments confirm that modeling long range dependencies is essential to obtain competitive language models.
The second limitation of classical spectral methods when applied to language modeling is that the loss function that the learning algorithm attempts to minimize is not aligned with the loss function that is used to evaluate model performance. Spectral methods minimize the 2 distance on the prediction of expectations of substrings up to a certain length (see Balle et al. (2012) for a formulation of spectral learning in terms of loss minimization), while language models are usually evaluated using conditional perplexity. There have been some proposals on generalizing the fundamental ideas of spectral learning to other loss functions (Parikh et al., 2014;. However, while these approaches are promising they have the downside that they lead to relatively expensive iterative convex optimizations and it is still a challenge to scale them to model long-range dependencies. In this paper we propose a simpler yet effective alternative to the iterative optimization. We use the classical spectral method based on low-rank matrix decomposition to learn a PNFA that computes substring expectations. Then we use these expectations as features in an interpolated ngram model and we learn the weights of the interpolation so as to maximize perplexity. This interpo-lation step is iterative, but it is a simple and very efficient convex optimization: the weights of the interpolation can be trained in a few seconds or minutes at most. The refinement step allows us to leverage all the moments computed by the learned PNFA and to align the spectral method with the perplexity evaluation metric. Our experiments on character-level language model show that: (1) modeling long range dependencies is important; and (2) with the simple interpolation step we can obtain competitive results. Our perplexity results are significantly better than feed-forward NNs, as good or better than sophisticated interpolation techniques such as Kneser-Ney estimation, and close to the performance of RNNs on two datasets.
The main contribution of our work consists on combining two simple ideas, i.e. incorporating long-range dependencies via basis selection of long substring moments (Section 2), and refining the predictions of the PNFA with an iterative interpolation step (Section 3). Our experiments show that these two simple ideas bring us one step closer to making spectral methods for PNFA reach state-of-the-art performance on language modeling tasks (Section 4). The advantage of these methods over other popular approaches to language modeling is their simplicity and the fact that they rely on efficient convex optimizations for training the model parameters. Furthermore, PNFA are probabilistic models for which efficient inference methods can be easily derived for computing all sorts of expectations. These expectations could then be used as features to learn predictive interpolation models. In this paper we present experiments with one type of expectation and interpolation model that illustrates the potential of this approach.

Probabilistic Non-Deterministic Finite Automata
We start describing the general class of Weighted Automata over strings. Let x = x 1 · · · x n be a sequence of length n over some finite alphabet Σ. We denote as Σ the set of all finite sequences, and we use it as a domain of our functions. We use x · x to denote the concatenation of two strings x and x . A Non-Deterministic Weighted Automaton (WA) with k states is defined as a tuple: A = α 0 , α ∞ , {A σ } σ∈Σ with: α 0 , α ∞ ∈ R k are the initial and final weight vectors; and A σ ∈ R k×k are the transition matrices associated to each symbol σ ∈ Σ. The function f A : Σ → R realized by an WA A is defined as: Probabilistic Non-Deterministic Finite Automata (PNFA) are WA that compute a probabilistic distribution over strings. One can easily transform a PNFA into another automata that computes substring expectations via simple transformations of the model parameters, and the reverse is also true, see  for details. In this paper we will directly learn and use automata that compute expectations. With these expectations we will calculate the conditional probabilities of a language model 1 : Here, n is the length of the left context, analogous to the order of an NGram model, but we compute the expectations not from counts but from a PNFA.

The Spectral Method
We now give a brief description of the spectral method for estimating a PNFA that computes expectations over substrings. We only provide a higher-level description of the method; for a complete derivation and the theory justifying the algorithm we refer the reader to the works by Hsu et al. (2009) and . Assume a distribution of strings over some discrete alphabet, our target function f (x) is the expected number of times that x appears as a substring of a string sampled from the distribution. At training, we are given strings T from the distribution and we want to estimate f . We denote as f T (x) the empirical substring expectation of x in T. 2 Using f T , the spectral method estimates a WA A with k states, where k is a parameter of the algorithm, such that f A is a good approximation of f . The method reduces the learning problem to computing an SVD decomposition of a special type of matrix called the Hankel matrix, that collects the observed expectations f T . The method is described by the following steps: (1) Select a set of prefixes P and suffixes S, that will serve as indices of the Hankel matrix for rows and columns respectively. A typical choice is to select all substrings up to a certain size n, but this quickly grows, and in practice prior work uses a small n. Instead we use the basis selection technique presented by Quattoni et al. (2017), which allows to capture long-range dependencies (analogous to having a large n) but keeping the number of prefixes and suffixes manageable.
(3) Compute a k-rank factorization of H. Compute the truncated SVD of H, i.e. H ≈ UΣV resulting in a matrix F = UΣ ∈ R P×k and a matrix B = V ∈ R S×k . Thus H ≈ FB is an k-rank factorization of H.
(4) Recover the WA A of k states. Let M + denote the Moore-Penrose pseudo-inverse of a matrix M. The elements of A are recovered as follows. Initial vector: α 0 = h S B. Final vector: α ∞ = F + h P . Transition Matrices: The computation is dominated by step (3), the SVD of the Hankel matrix, which is at most cubic in the size of the matrix. In practice, this method is scalable and fast to train.

Interpolated Predictions
One limitation of the spectral method is that the loss that it minimizes is not aligned with the probabilistic metrics used in language modeling, such as perplexity. Instead the spectral method minimized the 2 loss over the observed empirical moments, i.e. those substrings collected in the Hankel matrix. To align the loss function with a perplexity measure we propose a simple refinement step, where we use the expected counts computed by the learned PNFA as features of a log-linear model, and learn interpolation weights. In contrast to Equation 2, which uses the longest context x of length n to compute the conditional probability, the interpolated model leverages the ability of the PNFA to model substring expectations of all lengths up to n. This is similar to classic interpolation of language models (Rosenfeld, 1994;Chen, 2009).
Given a function f computing substring expectations, the interpolation is: where x 1:n is a context of size n, σ is the output symbol, and w σ,j are the interpolation weights, with one parameter per output symbol σ and context length j, with 0 ≤ j < n.
As it is standard with interpolation models, we train the weights by maximizing the conditional log-likelihood of the development set. We assume that f is fixed, which results in a convex optimization, and we solve with L-BFGS.

Experiments
We present experiments in character-based language modeling. Our spectral ngram models work with a fixed context length, and we show results varying this length up to relatively large values.
Following the standard, the goal is to learn a language model that predicts the next symbol given a sentence prefix, including the prediction of sentence ends. As datasets we use the Penn Treebank (PTB) prepared by Mikolov et al. (2012) 3 , and "War and Peace" (WP) dataset prepared by Karpathy et al. (2016) 4 . We use two probabilistic evaluation metrics that are standard in language modeling tasks: Cross Entropy and Bits per Character (BpC). Depending on the dataset, we use one or the other such that we can directly compare to published results.
Tables 1 and 2 present results in terms of the context size (n) for the PTB and WP tests respectively. The column "UB" shows an upperbound on the performance metric using a context of size n. This is computed directly using the expected counts on the test set to compute the conditional distribution. If we were able to estimate these expectations perfectly, we would achieve the  reported performance. As the two tables show, a context of size 10 already gives a high upperbound, suggesting that we can achieve good performance using a fixed but large horizon.
The tables show results of the spectral language model for different context sizes, using expectations from the "longest" context or "interpolated" expectations. A clear trend is that the results improve with the context length, achieving a stable performance for n = 10. It is also clear that the interpolated predictions work much better than simply using the longest context. Table 2 also compares to a MaxEnt model (labeled "ME"), which is an interpolation model of Eq.3 but uses empirical expectations f T (x) computed from training counts instead of those given by the spectral PNFA. Clearly, the expectations given by the PNFA generalize better and lead to improvements.
The last column of the two tables shows the number of rows (and columns) of the (square) Hankel matrix we factorize for each context size. This gives an idea of the cost of the estimation algorithm, which goes from a few seconds to a few hours, depending on the matrix size. 5 Following the theory behind Quattoni et al. (2017), this number is an upper bound on the size of the minimal PNFA that reproduces exactly the expected counts of training substrings.
The tables include a column "KN" with the results of an ngram language model estimated with Kneser-Ney interpolation (Kneser and Ney, 1995;Chen and Goodman, 1999). Looking at the results on the PTB data in Table 1, our interpolated model performs equally well, and sometimes better, than the KN models using the same context length. Mikolov et al. (2012) reports the performance of other models: a feed-forward neural network 6 obtains 1.57, which our model improves with contexts of n = 6 or larger; an RNN works at 1.41, slightly better than our best result of 1.45. Their best result is of 1.37 for a MaxEnt model with context length of n = 14 engineered for scalability.
For the WP test in Table 2, our model and the KN model perform similarly, with some slight improvements by the KN model. The table also includes the results of a feed-forward neural network (FNN) for increasing orders, by Karpathy et al. (2016). We observe that our interpolated model works better, with our best result at 1.24. They also report the results of an RNN obtaining 1.24, and of LSTM and GRU which both obtain 1.08.

Conclusions
In this paper we presented experiments using character-based spectral ngram language models. We combine two key ideas: a) modeling of longrange dependencies via the basis selection of long substring moments by Quattoni et al. (2017); and b) efficient optimization of arbitrary prediction losses (e.g. cross-entropy) via a loss refinement step. With these two ideas, we can improve the performance of spectral learning for PNFA, and bring the results of spectral models closer to the state-of-the-art.
The ability of the spectral method for PNFA to estimate substring expectations can be exploited in other contexts. For example, we are interested in word-level language models that make use of character-level PNFA to compute expectations, which is useful to make predictions on words and substrings which do not appear in training.
It is also interesting to consider a PNFA as a special case of an RNN which uses linear transi-tions. Given that we obtain similar results than feed-forward NN and some RNN, this suggests that some forms of non-linearities can be approximated by linear models, with the advantage that some computations (mainly, expectations) can be done exactly.