Large Margin Neural Language Model

We propose a large margin criterion for training neural language models. Conventionally, neural language models are trained by minimizing perplexity (PPL) on grammatical sentences. However, we demonstrate that PPL may not be the best metric to optimize in some tasks, and further propose a large margin formulation. The proposed method aims to enlarge the margin between the “good” and “bad” sentences in a task-specific sense. It is trained end-to-end and can be widely applied to tasks that involve re-scoring of generated text. Compared with minimum-PPL training, our method gains up to 1.1 WER reduction for speech recognition and 1.0 BLEU increase for machine translation.


Introduction
Language models (LMs) estimate the likelihood of a symbol sequence {x t } T t=0 , based on the joint probability, p(x 0 , . . . , x T ) = p(x 0 ) T t=1 p(x t |x 0 , . . . , x t−1 ). (1) To measure the quality of an LM, a commonly adopted metric is perplexity (PPL), defined as log p(x t |x 0 , . . . , x t−1 ) , A good language model has a small PPL, being able to assign higher likelihoods to sentences that are more likely to appear. LMs are widely applied in automatic speech recognition (ASR) (Yu and Deng, 2014) and machine translation (MT) (Koehn, 2009). Following Koehn (2009), one may interpret the language * Contributions were made while at Baidu Research. model as prior knowledge on the text to be inferred, which provides information complementary to the ASR or MT system itself. In practice, there are several ways to incorporate the language model. The simplest way may be re-scoring an n-best list returned by the ASR or MT system (Mikolov et al., 2010;Sundermeyer et al., 2012). A slightly more sophisticated way is to jointly consider the ASR/MT and language model in a beam search decoder (Amodei et al., 2016). Specifically, at each time step, the decoder appends every symbol in the vocabulary to each sequence in the current candidate set. For every hypothesis, a score is calculated as a linear combination of the log-likelihoods given by both the ASR/MT and language models. Then, only the top K hypotheses with the highest scores are retained, as an updated candidate set. More recently, Gulcehre et al. (2015) and Sriram et al. (2017) propose to predict the next symbol based on a fusion of the hidden states in the ASR/MT and language models. A gating mechanism is jointly trained to determine how much the language model should contribute.
The afore-discussed language models are generative in the sense that they merely model the joint distribution of a symbol sequence (Eq. (1)). While the research community is mostly focused on pushing the limit of PPL (e.g., Jozefowicz et al., 2016), very limited attention has been paid to the discrimination power of language models when they are applied to real tasks, such as ASR and MT (Li and Khudanpur, 2008). By contrast, discriminative language modeling aims at enhancing the performance in downstream applications. For example, existing works (Roark et al., 2004(Roark et al., , 2007 often target at improving ASR accuracy. The key motivation underlying them is that the model should be able to discriminate between "good" and "bad" sentences in a task-specific sense, instead of just modeling grammatical ones. The common methodology (Dikic et al., 2013) is to build a binary classifier upon hand-crafted features extracted from the sentences. However, it is not obvious how these methods can utilize large unannotated corpus, which is often easily available, and the hand-crafted features are also ad hoc and may result in suboptimal performance.
In this work, we study how to improve the discrimination ability of a recurrent network-based neural language model (RNNLM). The goal is to enlarge the difference between the log-likelihoods of "good" and "bad" sentences. In an contrast to the existing works (Roark et al., 2004(Roark et al., , 2007, our method does not rely on hand-crafted features, and is trained in end-to-end manner and able to take advantage of large external text corpus. In fact, it is a general training criterion that is transparent to the network architecture of the RNNLM, and can be applied to various text generation tasks, including ASR and MT. Experiments on state-of-art ASR and MT systems show its significant advantage over an LM trained by minimizing PPL.

Background on RNNLM
We first give some background knowledge on RNNLMs. The prototypical RNNLM (Mikolov et al., 2010) has one layer of recurrent cell and works as follows. Denote a sentence as x = [x 0 , . . . , x t , . . . ], where the x t 's are words. Let x t be the embedding vector for x t . The recurrent cell takes in the embedding and produces a hidden state h t by where σ(z) = 1 1+e −z is sigmoid activation function. h t−1 is the hidden state at the last timestep. U and V are learnable parameters. The h t is then passed into a multi-way classifier to produce a probability distribution over the vocabulary (for the next word), The W and b are also trainable parameters. The training objective is to maximize the loglikelihood of the next word, and the parameters are learned by back-propagation algorithm.
The vanilla recurrent cell can also be replaced by one or multiple layers of LSTM cells, which produces better results (Zaremba et al., 2014). In a more general form, the RNNLM can be represented as a conditional probability, p θ (x t |x 0 , . . . , x t−1 ), parameterized by θ. In the prototypical case, θ = [U, V, W, b]. We could define the LM-score of a sentence x as The RNNLM is trained by maximizing the average LM-score over all the x's in a corpus, or equivalently, minimizing the PPL on the corpus.

Problem Formulation
We motivate and formulate a large margin training criterion in this section. Suppose for every reference sentence x i , we have a collection of hypotheses x i,j , j = 1, . . . , K, usually obtained as the top-K candidates by a beam search decoder.

A Motivating Example
An RNNLM trained by minimizing PPL cannot guarantee a higher score on the "gold" reference than the inferior hypothesis, which is undesirable. One example is given in Tab. 1. The reference is taken from the text labels of dev93' set of Wall Street Journal (WSJ) dataset. The hypothesis is generated by a CTC-based (Graves et al., 2006) ASR system trained on WSJ training set. Words in red are mistakes made by the hypothesis. We then train an RNNLM on Common Crawl 1 copora by minimizing PPL. Training follows a typical setup (Jozefowicz et al., 2016) with a vocabulary of 400K the most frequent words. Any out-ofvocabulary word is replaced by an UNK token. The RNNLM is then employed to score the sentences. The LM-score of the erroneous hypothesis is higher than that of the reference. In fact, this is reasonable as "a decade as concerns" seems to be a more common phrase. In the training corpus, we find that "a decade as concerns" appears once, but "its defeat is confirmed" does not appear. Moreover, "a decade as" appears 2,280 times, but "its defeat is" appears only 24 times. However, this is undesirable because if there is another hypothesis that happens to be the same as reference, which will not be ranked as the best candidate.
It would be helpful if the LM can also learn from the imperfect hypotheses so that it can tell Sentence LM-score reference coniston declined to discuss its plans for its defeat is confirmed but indicated that it doesn't plan to simply walk away -116.52 hypothesis coniston declined to discuss its plans for a decade as concerns but indicated that it doesn't plan to simply walk away -112.65 Table 1: Reference and one hypothesis, scored by an RNNLM. Words in red are mistakes in the hypothesis. The RNNLM is trained on Common Crawl copora by minimizing PPL. We want the reference to be higher scored than the hypothesis, but it does not happen here. apart "good" and "bad" candidates. With this motivation, we train to assign larger LM-scores for the x i 's but smaller ones for the (imperfect) x i,j 's. A quantity of particular interest is log p(x i ) − log p(x i,j ), the margin/difference between the LM-scores of the references and the (imperfect) hypotheses. The intuition is that the more positive the margin, the better the LM is at discrimination.

Straightforward but Failed Formulation
Without loss of generality, we assume that all the x i,j 's are imperfect and different from x i . A straightforward way is to adopt the following objective: (2) Similar formulation is also seen in (Tachioka and Watanabe, 2015), where they only utilize one beam candidate, i.e., K = 1. Optimization can be carried out by mini-batch stochastic gradient descent (SGD). Each iteration, SGD randomly samples a batch of i's and j's, computes stochastic gradient w.r.t. θ, and takes an update step. However, a potential problem with this formulation is that the second term (corresponding to the inferior hypotheses) may dominate the optimization. Specifically, the training is almost always driven by the x i,j 's, but does not effectively enhance the discrimination. We illustrate this fact in the following experiment.
Using the ASR system in section 3.1, we extract 256 beam candidates for every training example in Wall Street Journal (WSJ) dataset. Warm started from the pre-trained RNNLM in section 3.1, we apply SGD to minimize the loss in Eq. (2), with a mini-batch size of 128. The training loss is shown in Fig. 1a. We observe that the learning dynamic is very unstable, and deceases to be negative. The unbound decreasing is due to the second term in Eq. (2) being negative and dominating the training process. Next, we inspect log p θ (x i ) − log p θ (x i,j ), the margin between the scores of a ground-truth and a candidate. In Fig. 2a, we histogram the margins for all the i, j's in a dev set. The distribution appears to be symmetric around zero, which indicates poor discrimination ability. Given these facts, we conclude that the straightforward formulation in Eq. (2) is not effective.

Large Margin Formulation
To effectively utilize all the imperfect beam candidates, we propose the following objective, is the margin between the scores of a ground-truth x i and a can-didate x i,j . The hinge loss on the margin encourages the log-likelihood of the ground-truth to be at least τ larger than that of the imperfect hypothesis. We call an LM trained by the above formulation as Large Margin Language Model (LMLM).
We repeat the same experiment in section 3.2, but change the objective function to Eq. (3) and set τ = 1. Fig. 1b shows the training loss, which steadily decreases and approaches zero rapidly. Compared with the learning curve of naive formulation (Fig. 1a), the large margin based training is much more stable. In Fig. 2b, we also examine the histogram of log is now the LM learned by LMLM. Compared with the histogram by the conventional RNNLM, LMLM significantly moves the distribution to the positive side, indicating more discrimination.

Ranking Loss Type Formulation
In most cases, all beam candidates are imperfect. It may be beneficial to exploit the information that some candidates are relatively better than the others. We consider ranking them according to some metrics w.r.t. the ground-truth sentences. For ASR, the metric is WER, and for MT, the metric is BLEU score. We define x i,0 x i and assume that the candidates {x i,j } K j=1 are sorted such that for ASR, and for MT. In other words, x i,j−1 has better quality than x i,j . We then enforce the "better" sentences to have a score at least τ larger than those "worse" ones. This leads to the following formulation, Compared with LMLM formulation Eq. (3), the above introduces more comparisons among the candidates, and hence more computational cost during training. We call this formulation rankingloss-based LMLM (rLMLM).
To summarize this section, we have proposed LMLM and rLMLM that aim at discriminating between hypotheses in a task-specific (e.g., WER or BLEU) sense, instead of minimizing PPL.

Experiments on ASR
We apply the LMs trained under different criteria to rescore the beams in various ASR systems. In particular, we are interested in knowing which of the two training mechanisms is better: minimizing PPL (e.g., the RNNLM in Section 3.1), or fitting to the WER metric by the proposed methods.
Adapting an RNNLM to a specific domain has been of interest, especially to the speech community (Park et al., 2010;Chen et al., 2015;Ma et al., 2017). We adopt Ma et al. (2017) that fine-tune the softmax layer of RNNLM by minimizing the PPL on the text labels of training set. According to Ma et al. (2017), the reason not to fine-tune all the layers is due to the limited text labels in the target domain. Indeed, we also observe overfitting if adapting all layers, but adapting only the softmax layer effectively decreases the PPL on the text labels of dev sets. We refer to this fine-tuning as RNNLM-adapted in the following sections.
To make a fair comparison with the adapted model, we also use the RNNLM as an initialization for our LMLM and rLMLM. In total, there are four language models for rescoring the beams. RNNLM and its adapted version that aim at reducing PPL; and the two proposed methods, LMLM and rLMLM that try to fit to WER.

WSJ Dataset
The WSJ corpora consists of about 80 hours of read speech with texts drawn from a machinereadable corpus of Wall Street Journal news. We use the standard configuration of train si284 dataset for training, dev93 for development and eval92 for testing.
Our ASR model has one convolution layer, followed by 5 bidirectional RNNs and one fully connected layer, with a CTC loss on top. The text labels of the training set are used to train a 4-gram language model, which is employed in the ASR decoder. The beam search decoder has a beam width of 2000. Before beam rescoring, this ASR system achieves a WER of 12.16 on dev93 set and 7.69 on eval92 set. To put this into perspective, we list some previous state-of-the-art system in Tab. 2. Compared with them, our baseline is already very competitive.

WERs and PPLs
The out-of-vocabulary rate of WSJ text is only 0.28%, making the RNNLM reasonable to use.
We apply the RNNLM, RNNLM-adapted (Ma et al., 2017), LMLM and rLMLM to rescore the beams on dev and test set. The final score assigned to a beam is a weighted sum of the ASR and language model scores. The weight is found by minimizing the WER on the dev set.
Tab. 3 reports the WERs on dev93 and eval92 sets. All methods reduce the WER over the baseline without rescoring. However, LMLM and rLMLM are notably better than the other two methods. Moreover, although RNNLM and RNNLM-adapted achieve smaller PPLs on the text labels, the advantage does not transfer to WER. ASR Models WER dev93 eval92 EESEN (Miao et al., 2015) N/A 7.34 Attention (Bahdanau et al., 2016)

Correlation between scores and WERs
To better understand the proposed methods, we calculate the correlation coefficients between the hypotheses' WERs and their scores (by different language models). In specific, for every utterance in the test set, we have a set of beam candidates, their word level accuracies (100-WER) and scores given by an LM, from which a Pearson correlation coefficient can be calculated. We calculate the coefficients for all the utterances in the test set, and boxplot these coefficients in Fig. 3. The correlation coefficients by LMLM and rLMLM tend Whiskers are 5% and 95% quantiles. Lower and upper box boundaries are 25% and 75% quantiles.
to be higher than RNNLM and RNNLM-adapted. This indicates that LMLM and rLMLM are more aligned with the goal of reducing WER.

Case Study
Tab. 4 posts some examples from the test set. The first column lists the ground-truth labels, and their corresponding best candidates as re-ranked by the four LMs (see notes in the second column). Words in red are mistakes made by the candidate sentences. Scores of these sentences are listed in the last four columns. We have the following observations: 1. LMLM and rLMLM give worse scores on the ground-truth labels than RNNLM and RNNLM-adapted, which explains their higher PPL in Tab Table 4: Some "gold" references and best hypotheses (after rescoring by different language models) for eval92 set. In red are errors or missing word (denoted as ' ').
some weakness in the ASR, which is not achieved by RNNLM and RNNLM-adapted.

10K Speech Dataset
We further validate our methods on a larger noisy dataset collected by Liu et al. (2017). The dataset has about 10K hours of spontaneous speech. The utterances are corrupted by background noise, and a large portion of them are accented. Therefore it is much more challenging than WSJ. We adopt the same training-dev-test split as in Liu et al. (2017  The ASR we build has the same architecture as in Liu et al. (2017), except that its decoder integrates an in-domain 5-gram language model. This system achieves a WER of 19.17 on dev set, better than the reported 19.77 baseline in Liu et al. (2017). Based on the ASR, we repeat the same experiments in section 4.1. Tab. 5 reports WERs and PPLs on dev and test sets. Both LMLM and rLMLM outperform the other methods in WER, although their PPLs are higher. This trend is similar to that in Tab. 3.

Experiments on NMT
In this section, we experiment the large-margin criterion trained LM with a competitive Chineseto-English NMT system. The NMT model is trained from 2M parallel sentence pairs. Following Shen et al. (2016), we use NIST 06 newswire portion (616 sentences) for development and NIST 08 newswire portion (691 sentences) for testing. We use OpenNMT-py 2 package with the default configuration to train the model: batch size is 64; word embedding size is 500; dropout rate is 0.3; target vocabulary size is 50K; number of epochs is 20, after which a minimum dev perplexity of 7.72 is achieved.

BLEUs and PPLs
We use a beam size of 10 for decoding, and report case-insensitive 4-reference BLEU-4 scores (by calling "multi bleu.perl" 3 ). The NMT model achieves 35.18 BLEU score on dev set and 31.52 on test set (see table 6). To put this into perspective, Shen et al. (2016) trains their models on 2.56M pairs of sentences and reports a dev BLEU score of 32.7 (via MOSES) or 30.7 (via RNNsearch, beam size of 10). So our NMT model is already very competitive.
To construct the training data for LMLM and rLMLM, 10 beam candidates are extracted for every sentence in the training set. We then follow the same experimental steps outlined in section 4.1, except that the ASR score is now changed to NMT score. In addition, we also find that normalizing the LM score by sentence length can improve the re-scoring performance substantially. Tab. 6 compares the BLEU score after re-ranking by the different LMs. LMLM and rLMLM both improve upon the baseline significantly, and outperform RNNLM and RNNLM-adapted by a notable margin. We also observe that the PPLs of LMLM and rLMLM are much larger than those of RNNLM and RNNLM-adapted, suggesting that the PPL metric may be very poorly correlated with BLEU.
Interestingly, RNNLM-adapted does not show any gain in BLEU score over RNNLM. To understand this, we recall that NMT is trained by minimizing PPL on target text. Its decoder is implicitly an RNNLM on target language. We conjecture that adapting an LM to the target domain can only duplicate the functionality of the NMT decoder, which does not bring any additional benefit.

Correlation between scores and BLEUs
We measure the correlation between the LM scores and BLUEs. The calculation is done on dev06 set in the same way as Section 4.1.2, but now we change the WERs to BLEUs. The boxplot of the correlation coefficients are shown in Fig. 4. Compared with the boxplot in Fig. 3  and rLMLM, however, is considerably higher than those by RNNLM and RNNLM-adapted.

Related Work
"Language modeling is an art of determining the probability of a sequence of words" (Goodman, 2001). In the past decades, there has been a trend of increasing the context that an LM can condition on. N-gram models (Chen and Goodman, 1996) assume that each symbol depends on the previous N − 1 symbols. Feed forward neural network based LMs (Bengio et al., 2003) are not count based but they inherit the restrictive assumption.
To model longer-term dependencies, RNNLMs (Mikolov et al., 2010) are proposed. RNNLMs often achieve smaller PPLs than the N-gram counterparts (Sundermeyer et al., 2012;Zaremba et al., 2014;Jozefowicz et al., 2016). This paper focuses on RNNLM-type architectures. While these works all adopt PPL as the metric to optimize, sometimes one may optimize a taskspecific objective. For example, Kuo et al. (2002); Roark et al. (2007) and Dikic et al. (2013) propose discriminative LMs to improve speech recognition. The common methodology therein is to fit a probabilistic model, e.g., conditional random field (Roark et al., 2004), to the space of text candidates, and maximize the probability at the desired candidate. The problem is often solved by perceptron algorithm. However, these methods all rely on ad-hoc choice of features, e.g., counts of n-grams where n varies in a small range (e.g.,1 to 3). Moreover, it is also not clear how these methods would take advantage of an existing language model (trained on large unsupervised corpus). Nevertheless, the same methodology can be extended to RNNLMs, thus avoiding the aforementioned limitations. For example, Auli and Gao (2014) train an RNNLM by favoring sentences with high BLEU scores and integrate it into a phrase-based MT decoder. If we cast the problem of picking the best text sequence as a ranking problem, the aforementioned works can be considered as "pointwise" learning-to-rank approaches (Cossock and Zhang, 2008). In contrast, the proposed method is a "pairwise" approach (Liu, 2009), as it learns a neural language model by comparison between pairs of sentences. Earlier works in this fashion may date back to (Collins and Koo, 2005), which improves a semantic parser. Learning "by pairwise comparison" is also seen in several MT literatures. For example, Hopkins and May (2011) propose to train a phrase-based MT system by minimizing a pairwise ranking loss. Wiseman and Rush (2016) optimize the beam search process in a Neural Machine Translation (NMT) system. They enforce the score of a reference to be higher than that of its decoded k-th candidate by at least a unit margin.
Rather than optimizing the MT system itself, this work proposes a general method of training recurrent neural language models, which can benefit various text generation tasks, including speech recognition and machine translation.

Conclusions
We have proposed a large margin criterion for training recurrent neural language models. Rather than minimizing PPL, the proposed criterion is based on comparison between pairs of sen-tences. We have formulated two algorithms that implement the training criterion. One compares between references and imperfect hypotheses (LMLM), the other compares between all pairs of hypotheses (rLMLM). We applied the language models trained by these two algorithms to speech recognition and machine translation. Both of them demonstrate superior performance over their minimum-PPL counterparts. However, the performance gain from LMLM to rLMLM is small, although rLMLM is built on more pairwise comparisons and requires more training efforts. The efficiency with respect to the number of pairs is a future research topic.