Richer Interpolative Smoothing Based on Modified Kneser-Ney Language Modeling

In this work we present a generalisation of the Modiﬁed Kneser-Ney interpolative smoothing for richer smoothing via additional discount parameters. We provide mathematical under-pinning for the estimator of the new discount parameters, and showcase the utility of our rich MKN language models on several European languages. We further explore the in-terdependency among the training data size, language model order, and number of discount parameters. Our empirical results illustrate that larger number of discount parameters, i) allows for better allocation of mass in the smoothing process, particularly on small data regime where statistical sparsity is severe, and ii) leads to signiﬁcant reduction in perplexity, particularly for out-of-domain test sets which introduce higher ratio of out-of-vocabulary words. 1


Introduction
Probabilistic language models (LMs) are the core of many natural language processing tasks, such as machine translation and automatic speech recognition. m-gram models, the corner stone of language modeling, decompose the probability of an utterance into conditional probabilities of words given a fixed-length context. Due to sparsity of the events in natural language, smoothing techniques are critical for generalisation beyond the training text when estimating the parameters of m-gram LMs. This is particularly important when the training text is small, e.g. building language models for translation or speech recognition in low-resource languages.
A widely used and successful smoothing method is interpolated Modified Kneser-Ney (MKN) (Chen and Goodman, 1999). This method uses a linear interpolation of higher and lower order m-gram probabilities by preserving probability mass via absolute discounting. In this paper, we extend MKN by introducing additional discount parameters, leading to a richer smoothing scheme. This is particularly important when statistical sparsity is more severe, i.e., in building high-order LMs on small data, or when out-of-domain test sets are used.
Previous research in MKN language modeling, and more generally m-gram models, has mainly dedicated efforts to make them faster and more compact (Stolcke et al., 2011;Heafield, 2011;Shareghi et al., 2015) using advanced data structures such as succinct suffix trees. An exception is Hierarchical Pitman-Yor Process LMs (Teh, 2006a;Teh, 2006b) providing a rich Bayesian smoothing scheme, for which Kneser-Ney smoothing corresponds to an approximate inference method. Inspired by this work, we directly enrich MKN smoothing realising some of the reductions while remaining more efficient in learning and inference.
We provide estimators for our additional discount parameters by extending the discount bounds in MKN. We empirically analyze our enriched MKN LMs on several European languages in in-and outof-domain settings. The results show that our discounting mechanism significantly improves the perplexity compared to MKN and offers a more elegant way of dealing with out-of-vocabulary (OOV) words and domain mismatch.
MKN uses lower order k-gram probabilities to smooth higher order probabilities. P (w|u) is defined as, where c(u) is the frequency of the pattern u, γ(.) is a constant ensuring the distribution sums to one, and P (w|π(u)) is the smoothed probability computed recursively based on a similar formula 2 conditioned on the suffix of the pattern u denoted by π(u). Of particular interest are the discount parameters D m (.) which remove some probability mass from the maximum likelihood estimate for each event which is redistributed over the smoothing distribution. The discounts are estimated as where n i (m) is the number of unique m-grams 3 of frequency i. This effectively leads to three discount parameters {D m (1), D m (2), D m (3+)} for the distributions on a particular context length, m. Ney et al. (1994) characterized the data sparsity using the following empirical inequalities,

Generalised MKN
It can be shown (see Appendix A) that these empirical inequalities can be extended to higher fre-2 Note that in all but the top layer of the hierarchy, continuation counts, which count the number of unique contexts, are used in place of the frequency counts (Chen and Goodman, 1999). 3 Continuation counts are used for the lower layers.
quencies and larger contexts m > 3, where σ m is the possible number of m-grams over a vocabulary of size σ, n 0 [m] is the number of mgrams that never occurred, and i>0 n i [m] is the number of m-grams observed in the training data. We use these inequalities to extend the discount depth of MKN, resulting in new discount parameters. The additional discount parameters increase the flexibility of the model in altering a wider range of raw counts, resulting in a more elegant way of assigning the mass in the smoothing process. In our experiments, we set the number of discounts to 10 for all the levels of the hierarchy, (compare this to these in MKN). 4 This results in the following estimators for the discounts, It can be shown that the above estimators for our discount parameters are derived by maximizing a lower bound on the leave-one-out likelihood of the training set, following (Ney et al., 1994;Chen and Goodman, 1999) (see Appendix B for the proof sketch).

Experiments
We compare the effect of using different numbers of discount parameters on perplexity using the Finnish (FI), Spanish (ES), German (DE), English (EN) portions of the Europarl v7 (Koehn, 2005)   of news-test 2015 (all denoted as NT) 5 , and ii) extreme using a 24 hour period of streamed Finnish, and Spanish tweets 6 (denoted as TW), and the German and English sections of the patent description of medical translation task 7 (denoted as MED). See Table 1 for statistics of the training and test sets. ). This effect is consistent across the Europarl corpora, and for all LM orders. We observe a substantial improvements even for m = 10-gram models (see Figure 1). On the medical test set which has 9 times higher OOV ratio, the perplexity reduction shows a similar trend. However, these reductions vanish when an in-domain test set is used. Note that we use the same treatment of OOV words for computing the perplexities which is used in KenLM (Heafield, 2013).

Analysis
Out-of-domain and Out-of-vocabulary We selected the Finnish language for which the number and ratio of OOVs are close on its out-of-domain and in-domain test sets (NT and EU), while showing substantial reduction in perplexity on out-ofdomain test set, see FI bars on Figure 1. Figure 2 (left), shows the full perplexity results for Finnish for vanilla MKN, and our extensions when tested on in-domain (EU) and out-of-domain (NT) test sets. The discount plot, Figure 2 (middle) illustrates the behaviour of the various discount parameters. We also measured the average hit length for queries by varying m on in-domain and out-of-domain test sets. As illustrated in Figure 2 (right) the in-domain test set allows for longer matches to the training data as m grows. This indicates that having more discount parameters is not only useful for test sets with extremely high number of OOV, but also allows for a more elegant way of assigning mass in the smoothing process when there is a domain mismatch.
Interdependency of m, data size, and discounts To explore the correlation between these factors we selected the German and investigated this correlation on two different training data sizes: Europarl (61M words), and CommonCrawl 2014 (984M words). Figure 3 illustrates the correlation between these factors using the same test set but with small and large training sets. Considering the slopes of the surfaces indicates that the small training data regime (left) which has higher sparsity, and more OOV in the test time benefits substantially from the more accurate discounting compared to the large training set (right) in which the gain from discounting is slight. 8

Conclusions
In this work we proposed a generalisation of Modified Kneser-Ney interpolative language modeling by introducing new discount parameters. We provide the mathematical proof for the discount bounds used in Modified Kneser-Ney and extend it further and illustrate the impact of our extension empirically on different Europarl languages using in-domain and out-of-domain test sets. The empirical results on various training and test sets show that our proposed approach allows for a more elegant way of treating OOVs and mass assignments in interpolative smoothing. In future work, we will integrate our language model into the Moses machine translation pipeline to intrinsically measure its impact on translation qualities, which is of particular use for out-of-domain scenario. the natural language follow the power law (Clauset et al., 2009) where s m is the parameter of the distribution, and C(u) is the random variable denoting the frequency of the mgrams pattern u. We now compute the expected number of unique patterns having a specific frequency E[n i [m]]. Corresponding to each m-grams pattern u, let us define a random variable X u which is 1 if the frequency of u is i and zero otherwise. It is not hard to see that n i [m] = u X u , and We can verify that which completes the proof of the inequalities.

B. Discount bounds proof sketch
The leave-one-out (leaving those m-grams which occurred only once) log-likelihood function of the interpolative smoothing is lower bounded by backoff model's (Ney et al., 1994), hence the estimated discounts for later can be considered as an approximation for the discounts of the former. Consider a backoff model with absolute discounting parameter D, were P (w i |w i−1 i−m+1 ) is defined as: is the number of unique right contexts for the w i−1 i−m+1 pattern. Assume that for any choice of 0 < D < 1 we can defineP such that P (w i |w i−1 im+1 ) sums to 1. For readability we use the λ(w i−1 Following (Chen and Goodman, 1999), rewriting the leave-one-out log-likelihood for KN (Ney et al., 1994) to include more discounts (in this proof up to D 4 ), results in: which can be simplified to, And after taking c(w i i−m+1 ) = 5 out of the summation, for D 4 :