Normalized Log-Linear Interpolation of Backoff Language Models is Efficient

We prove that log-linearly interpolated backoff language models can be efﬁciently and exactly collapsed into a single normalized backoff model, contradicting Hsu (2007). While prior work reported that log-linear interpolation yields lower per-plexity than linear interpolation, normalizing at query time was impractical. We normalize the model ofﬂine in advance, which is efﬁcient due to a recurrence relationship between the normalizing factors. To tune interpolation weights, we apply Newton’s method to this convex problem and show that the derivatives can be computed ef-ﬁciently in a batch process. These ﬁndings are combined in new open-source interpolation tool, which is distributed with KenLM. With 21 out-of-domain corpora, log-linear interpolation yields 72.58 per-plexity on TED talks, compared to 75.91 for linear interpolation.


Introduction
Log-linearly interpolated backoff language models yielded better perplexity than linearly interpolated models (Klakow, 1998;Gutkin, 2000), but experiments and adoption were limited due the impractically high cost of querying. This cost is due to normalizing to form a probability distribution by brute-force summing over the entire vocabulary for each query. Instead, we prove that the log-linearly interpolated model can be normalized offline in advance and exactly expressed as an ordinary backoff language model. This contradicts Hsu (2007), who claimed that log-linearly interpolated models "cannot be efficiently represented as a backoff n-gram model. " We show that offline normalization is efficient due to a recurrence relationship between the normalizing factors (Whittaker and Klakow, 2002). This forms the basis for our opensource implementation, which is part of KenLM: https://kheafield.com/code/kenlm/.
Linear interpolation (Jelinek and Mercer, 1980), combines several language models p i into a single model p L p L (w n | w n−1 where λ i are weights and w n 1 are words. Because each component model p i is a probability distribution and the non-negative weights λ i sum to 1, the interpolated model p L is also a probability distribution. This presumes that the models have the same vocabulary, an issue we discuss in §3.1. A log-linearly interpolated model p LL uses the weights λ i as powers (Klakow, 1998).
The weights λ i are unconstrained real numbers, allowing parameters to soften or sharpen distributions. Negative weights can be used to divide a mixed-domain model by an out-of-domain model. To form a probability distribution, the product is normalized where normalizing factor Z is given by The sum is taken over all words x in the combined vocabulary of the underlying models, which can number in the millions or even billions. Computing Z efficiently is a key contribution in this work. Our proofs assume the component models p i are backoff language models (Katz, 1987) that memorize probability for seen n-grams and charge a backoff penalty b i for unseen n-grams. p i (w n | w n−1 1 ) = p i (w n | w n−1 1 ) if w n 1 is seen p i (w n | w n−1 2 )b i (w n−1 1 ) o.w.
While linearly or log-linearly interpolated models can be queried online by querying the component models (Stolcke, 2002;Federico et al., 2008), doing so costs RAM to store duplicated n-grams and CPU time to perform lookups. Log-linear interpolation is particularly slow due to normalizing over the entire vocabulary. Instead, it is preferable to combine the models offline into a single backoff model containing the union of n-grams. Doing so is impossible for linear interpolation ( §3.2); SRILM (Stolcke, 2002) and MITLM (Hsu and Glass, 2008) implement an approximation. In contrast, we prove that offline log-linear interpolation requires no such approximation.

Related Work
Instead of building separate models then weighting, Zhang and Chiang (2014) show how to train Kneser-Ney models (Kneser and Ney, 1995) on weighted data. Their work relied on prescriptive weights from domain adaptation techniques rather than tuning weights, as we do here.
Our exact normalization approach relies on the backoff structure of component models. Several approximations support general models: ignoring normalization , noisecontrastive estimation (Vaswani et al., 2013), and self-normalization (Andreas and Klein, 2015). In future work, we plan to exploit the structure of other features in high-quality unnormalized loglinear language models (Sethy et al., 2014).
Ignoring normalization is particularly common in speech recognition and machine translation. This is one of our baselines. Unnormalized models can also be compiled into a single model by multiplying the weighted probabilities and backoffs. 1 Many use unnormalized models because weights can be jointly tuned along with other feature weights. However, Haddow (2013) showed that linear interpolation weights can be jointly tuned by pairwise ranked optimization (Hopkins and May, 2011). In theory, normalized log-linear interpolation weights can be jointly tuned in the same way.
Dynamic interpolation weights (Weintraub et al., 1996) give more weight to models familiar with a given query. Typically the weights are a function of the contexts that appear in the combined language model, which is compatible with our approach. However, normalizing factors would need to be calculated in each context.

Linear Interpolation
To motivate log-linear interpolation, we examine two issues with linear interpolation: normalization when component models have different vocabularies and offline interpolation.

Vocabulary Differences
Language models are normalized with respect to their vocabulary, including the unknown word.
If two models have different vocabularies, then the combined vocabulary is larger and the sum is taken over more words. Component models assign their unknown word probability to these new words, leading to an interpolated model that sums to more than one. An example is shown in Table 1.  Table 1: Linearly interpolating two models p 1 and p 2 with equal weight yields an unnormalized model p L . If gaps are filled with zeros instead, the model is normalized.
To work around this problem, SRILM (Stolcke, 2002) uses zero probability instead of the unknown word probability for new words. This produces a model that sums to one, but differs from what users might expect.
IRSTLM (Federico et al., 2008) asks the user to specify a common large vocabulary size. The unknown word probability is downweighted so that all models sum to one over the large vocabulary.
A component model can also be renormalized with respect to a larger vocabulary. For unigrams, the extra mass is the number of new words times the unknown word probability. For longer contexts, if we assume the typical case where the unknown word appears only as a unigram, then queries for new words will back off to unigrams. The total mass in context w n−1 where new is the set of new words. This is efficient to compute online or offline. While there are tools to renormalize models, we are not aware of a tool that does this for linear interpolation.
Log-linear interpolation is normalized by construction. Nonetheless, in our experiments we extend IRSTLM's approach by training models with a common vocabulary size, rather than retrofitting it at query time.

Offline Linear Interpolation
Given an interpolated model, offline interpolation seeks a combined model meeting three criteria: (i) encoding the same probability distribution, (ii) being a backoff model, and (iii) containing the union of n-grams from component models.
Theorem 1. The three offline criteria cannot be satisfied for general linearly interpolated backoff models.
Proof. By counterexample. Consider the models given in Table 2 interpolated with equal weight.  The probabilities shown for p L result from encoding the same distribution. Taking the union of ngrams implies that p L only has entries for A, B, C, and A C. Since the models have the same vocabulary, they are all normalized to one.
Since all models have backoff structure, which when solved for backoff b(A) gives the values shown in Table 2. We then query p L (B | A) online and offline. Online interpolation yields Offline interpolation yields The same problem happens with real language models. To understand why, we attempt to solve for the backoff b L (w n−1 1 ). Supposing w n 1 is not in either model, we query p L (w n | w n−1 which is a weighted average of the backoff weights b 1 (w n−1 1 ) and b 2 (w n−1 1 ). The weights depend on w n , so b L is no longer a function of w n−1 1 .
In the SRILM approximation (Stolcke, 2002), probabilities for n-grams that exist in the model are computed exactly. The backoff weights are chosen to produce a model that sums to one. However, newer versions of SRILM (Stolcke et al., 2011) interpolate by ingesting one component model at a time. For example, the first two models are approximately interpolated before adding a third model. An n-gram appearing only in the third model will have an approximate probability. Therefore, the output depends on the order in which users specify models. Moreover, weights were optimized for correct linear interpolation, not the approximation. Stolcke (2002) find that the approximation actually decreases perplexity, which we also see in the experiments ( §6). However, approximation only happens when the model backs off, which is less likely to happen in fluent sentences used for perplexity scoring.

Offline Log-Linear Interpolation
Log-linearly interpolated backoff models p i can be collapsed into a single offline model p LL . The combined model takes the union of n-grams in component models. 2 For those n-grams, it memorizes correct probability p LL .
When w n 1 does not appear, the backoff b LL (w n−1 1 ) modifies p LL (w n | w n−1 2 ) to make an appropriately normalized probability. To do so, it cancels out the shorter query's normalization term Z(w n−1 2 ) then applies the correct term Z(w n−1 1 ). It also applies the component backoff terms.
Almost by construction, the model satisfies two of our criteria ( §3.2): being a backoff model and containing the union of n-grams. However, backoff models require that the backoff weight of an unseen n-gram be implicitly 1.
Proof. Because we have taken the union of entries, w n−1 1 is unseen in component models. These components are backoff models, so implicitly b i (w n−1 1 ) = 1 ∀i. Focusing on the normalization term Z(w n−1 1 ), All of the models back off because w n−1 1 x is unseen, being a superstring of w n−1 We now have a backoff model containing the union of n-grams. It remains to show that the offline model produces correct probabilities.
Theorem 2. The proposed offline model agrees with online log-linear interpolation.
Proof. By induction on the number of words backed off in offline interpolation. To disambiguate, we will use p on to refer to online interpolation and p off to refer to offline interpolation. Base case: the queried n-gram is in the offline model and we have memorized the online probability by construction. Inductive case: Let p off (w n | w n−1 1 ) be a query that backs off. In online interpolation, Because w n 1 is unseen in the offline model and we took the union, it is unseen in every model p i .
Recognizing the unnormalized probability Z(w n−1 2 )p on (w n | w n−1 2 ), The last equality follows from the definition of b off and Lemma 1, which extended the domain of b off to any w n−1 1 . By the inductive hypothesis, p on (w n | w n−1 2 ) = p off (w n | w n−1 2 ) because it backs off one less time.
The offline model p off (w n | w n−1 1 ) backs off because that is the case we are considering. Combining our chain of equalities, p on (w n | w n−1 1 ) = p off (w n | w n−1 1 ) By induction, the claim holds for all w n 1 .

Normalizing Efficiently
In order to build the offline model, the normalization factor Z needs to be computed in every seen context. To do so, we extend the tree-structure method of Whittaker and Klakow (2002), which they used to compute and cache normalization factors on the fly. It exploits the sparsity of language models: when summing over the vocabulary, most queries will back off. Formally, we define s(w n 1 ) to be the set of words x where p i (x | w n 1 ) does not back off in some model. s(w n 1 ) = {x : w n 1 x is seen in any model} To exploit this, we use the normalizing factor Z(w n 2 ) from a lower order and patch it up by summing over s(w n 1 ). Theorem 3. The normalization factors Z obey a recurrence relationship: Proof. The first term handles seen n-grams while the second term handles unseen n-grams. The definition of Z can be partitioned by cases.
The first term agrees with the claim, so we focus on the case where x ∈ s(w n 1 ). By definition of s, all models back off.
x ∈s(w n 1 ) i This is the second term of the claim. Context sort w n 1 , m(w n 1 ), i p i (w n |w n−1 Suffix sort w n 1 , p LL (w n |w n−1 1 ) Figure 1: Multi-stage streaming pipeline for offline log-linear interpolation. Bold arrows indicate sorting is performed.
The recurrence structure of the normalization factors suggests a computational strategy: compute Z( ) by summing over the unigrams, Z(w n ) by summing over bigrams w n x, Z(w n n−1 ) by summing over trigrams w n n−1 x, and so on.

Streaming Computation
Part of the point of offline interpolation is that there may not be enough RAM to fit all the component models. Moreover, with compression techniques that rely on immutable models (Whittaker and Raj, 2001;Talbot and Osborne, 2007), a mutable version of the combined model may not fit in RAM. Instead, we construct the offline model with disk-based streaming algorithms, using the framework we designed for language model estimation (Heafield et al., 2013). Our pipeline (Figure 1) has four conceptual steps: merge probabilities, apply backoffs, normalize, and output. Applying backoffs and normalization are performed in the same pass, so there are three total passes.

Merge Probabilities
This step takes the union of n-grams and multiplies probabilities from component models. We assume that the component models are sorted in suffix order (Figure 4), which is true of models produced by lmplz (Heafield et al., 2013) or stored in a reverse trie. Moreover, despite having different word indices, the models are consistently sorted using the string word, or a hash thereof. Table 3: Merging probabilities processes n-grams in lexicographic order by suffix. Column headings indicate precedence.

2 1 A A A A A A B A A B
The algorithm processes n-grams in lexicographic (depth-first) order by suffix (Table 3). In this way, the algorithm processes p i (A) before it might be used as a backoff point for p i (A | B) in one of the models. It jointly streams through all models, so that p 1 (A | B) and p 2 (A | B) are available at the same time. Ideally, we would compute unnormalized probabilities.
However, these queries back off when models contain different n-grams. The appropriate backoff weights b i (w n−1 1 ) are not available in a streaming fashion. Instead, we proceed without charging backoffs 1 ) records what backoffs should be charged later.
The normalization step ( §4.2.3) also uses lowerorder probabilities i p i (w n | w n−1 2 ) λ i and needs to access them in a streaming fashion, so we also output  (Heafield et al., 2013). In suffix order, the last word is primary. In context order, the penultimate word is primary. Column headings indicate precedence.
Each output tuple has the form w n 1 , m(w n 1 ), where m(w n 1 ) is a vector of backoff requests, from which m(w n 2 ) can be computed.

Apply Backoffs
This step fulfills the backoff requests from merging probabilities. The merged probabilities are sorted in context order (Table 4) so that ngrams w n 1 sharing the same context w n−1 1 are consecutive. Moreover, contexts w n−1 1 appear in suffix order. We use this property to stream through the component models again in their native suffix order, this time reading backoff weights b i (w n−1 1 ), b i (w n−1 2 ), . . . , b i (w n−1 ). Multiplying the appropriate backoff weights by i p i (w n |w n−1 m i (w n 1 ) ) λ i yields unnormalized probability The same applies to the lower order.
This step also merges backoffs from component models, with output still in context order.
The implementation is combined with normalization, so the tuple is only conceptual.

Normalize
This step computes normalization factor Z for all contexts, which it applies to produce p LL and b LL . Recalling §4.1, Z(w n−1 1 ) is efficient to compute in a batch process by processing suffixes Z( ), Z(w n ), . . . Z(w n−1 2 ) first. In order to minimize memory consumption, we chose to evaluate the contexts in depth-first order by suffix, so that Z(A) is computed immediately before it is needed to compute Z(A A) and forgotten at Z(B).
Computing Z(w n−1 1 ) by applying Theorem 3 requires the sum ) restricts to seen n-grams. For this, we stream through the output of the apply backoffs step in context order, which makes the various values of x consecutive. Theorem 3 also requires a sum over the lower-order unnormalized probabilities We placed these terms in the input tuple for w n−1 1 x. Otherwise, it would be hard to access these values while streaming in context order.
While we have shown how to compute Z(w n−1 1 ), we still need to normalize the probabilities. Unfortunately, Z(w n−1 1 ) is only known after streaming through all records of the form w n−1 1 x, which are the very same records to normalize. We therefore buffer up to the vocabulary size for each order in memory to allow rewinding. Processing context w n−1 1 thus yields normalized probabilities p LL (x | w n−1 1 ) for all seen w n−1 1 x.
These records are generated in context order, the same order as the input. The normalization step also computes backoffs.
) is computed by this step, numerator Z(w n−1 2 ) is available due to depth-first search, and the backoff terms i b i (w n−1 1 ) λ i are present in the input. The backoffs b LL are generated in suffix order, since each context produces a backoff value. These are written to a sidechannel stream as bare values without keys.

Output
Language model toolkits store probability p LL (w n | w n−1 1 ) and backoff b LL (w n 1 ) together as values for the key w n 1 . To reunify them, we sort w n 1 , p LL (w n | w n−1 1 ) in suffix order and merge with the backoff sidechannel from normalization, which is already in suffix order. Suffix order is also preferable because toolkits can easily build a reverse trie data structure.

Tuning
Weights are tuned to maximize the log probability of held-out data. This is a convex optimization problem (Klakow, 1998). Iterations are expensive due to the need to normalize over the vocabulary at least once. However, the number of weights is small, which makes the Hessian matrix cheap to store and invert. We therefore selected Newton's method. 3 The log probability of tuning data w is where CH is cross entropy. However, computing the cross entropy directly would entail a sum over the vocabulary for every word in the tuning data. Instead, we apply Theorem 3 to express Z(w n−1 1 ) in terms of Z(w n−1 2 ) before taking the derivative. This allows us to perform the same depth-first computation as before ( §4.2.3), ). The same argument applies when taking the Hessian with respect to λ i and λ j . Rather than compute it directly in the form

Experiments
We perform experiments for perplexity, query speed, memory consumption, and effectiveness in a machine translation system. Individual language models were trained on English corpora from the WMT 2016 news translation shared task (Bojar et al., 2016). This includes the seven newswires (afp, apw, cna, ltw, nyt, wpb, xin) from English Gigaword Fifth Edition (Parker et al., 2011); the 2007-2015 news crawls; 4 News discussion; News commmentary v11; English from Europarl v8 (Koehn, 2005); the English side of the French-English parallel corpus (Bojar et al., 2013); and the English side of SETIMES2 (Tiedemann, 2009). We additionally built one language model trained on the concatenation of all of the above corpora. All corpora were preprocessed using the standard Moses (Koehn et al., 2007) scripts to perform normalization, tokenization, and truecasing. To prevent SRILM from running out of RAM, we excluded the large monolingual CommonCrawl data, but included English from the parallel CommonCrawl data.
All language models are 5-gram backoff language models trained with modified Kneser-Ney smoothing (Chen and Goodman, 1998) using lmplz (Heafield et al., 2013). Also to prevent SRILM from running out of RAM, we pruned singleton trigrams and above.
For linear interpolation, we tuned weights using IRSTLM. To work around SRILM's limitation of ten models, we interpolated the first ten then carried the combined model and added nine more component models, repeating this last step as necessary. Weights were normalized within batches to achieve the correct final weighting. This simply extends the way SRILM internally carries a combined model and adds one model at a time.

Perplexity experiments
We experiment with two domains: TED talks, which is out of domain, and news, which is indomain for some corpora. For TED, we tuned on the IWSLT 2010 English dev set and test on the 2010 test set. For news, we tuned on the English side of the WMT 2015 Russian-English evaluation set and test on the WMT 2014 Russian-English evaluation set. To measure generalization, we also evaluated news on models tuned for TED and vice-versa. Results are shown in Table 5 Table 6: Speed and memory consumption of LM combination methods. Interpolated models include the concatenated model. Tuning and compiling times are in minutes, memory consumption in gigabytes, and query time in microseconds per query (on 1G of held-out Common Crawl monolingual data).
Log-linear interpolation performs better on TED (72.58 perplexity versus 75.91 for offline linear interpolation). However, it performs worse on news. In future work, we plan to investigate whether log-linear wins when all corpora are outof-domain since it favors agreement by all models. Table 6 compares the speed and memory performance of the competing methods. While the log-linear tuning is much slower, its compilation is faster compared to the offline linear model's long run time. Since the model formats are the same for the concatenation and log-linear, they share the fastest query speeds. Query speed was measured using KenLM's (Heafield, 2011) faster probing data structure. 5

MT experiments
We trained a statistical phrase-based machine translation system for Romanian-English on the Romanian-English parallel corpora released as part of the 2016 WMT news translation shared task. We trained three variants of this MT system. The first used a single language model trained on the concatenation of the 21 individual LM training corpora. The second used 22 language models, with each LM presented to Moses as a separate feature. The third used a single language model which is an interpolation of all 22 models. This variant was run with offline linear, online linear, and log-linear interpolation. All MT system variants were optimized using IWSLT 2011 Romanian-English TED test as the development set, and were evaluated using the IWSLT 2012 Romanian-English TED test set.
As shown in  Table 7: Machine translation performance comparison in an end-to-end system.
jointly tuned normalized log-linear interpolation to future work.

Conclusion
Normalized log-linear interpolation is now a tractable alternative to linear interpolation for backoff language models. Contrary to Hsu (2007), we proved that these models can be exactly collapsed into a single backoff language model. This solves the query speed problem. Empirically, compiling the log-linear model is faster than SRILM can collapse its approximate offline linear model. In future work, we plan to improve performace of feature weight tuning and investigate more general features.