Greedy Search with Probabilistic N-gram Matching for Neural Machine Translation

Neural machine translation (NMT) models are usually trained with the word-level loss using the teacher forcing algorithm, which not only evaluates the translation improperly but also suffers from exposure bias. Sequence-level training under the reinforcement framework can mitigate the problems of the word-level loss, but its performance is unstable due to the high variance of the gradient estimation. On these grounds, we present a method with a differentiable sequence-level training objective based on probabilistic n-gram matching which can avoid the reinforcement framework. In addition, this method performs greedy search in the training which uses the predicted words as context just as at inference to alleviate the problem of exposure bias. Experiment results on the NIST Chinese-to-English translation tasks show that our method significantly outperforms the reinforcement-based algorithms and achieves an improvement of 1.5 BLEU points on average over a strong baseline system.


Introduction
Neural machine translation (NMT) (Kalchbrenner and Blunsom, 2013;Sutskever et al., 2014; has now achieved impressive performance Gehring et al., 2017;Vaswani et al., 2017;Hassan et al., 2018;Lample et al., 2018) and draws more attention. NMT models are built on the encoder-decoder framework where the encoder network encodes the source sentence to distributed representations and the decoder network reconstructs the target sentence form the representations word by word.
Currently, NMT models are usually trained with the word-level loss (i.e., cross-entropy) under the teacher forcing algorithm (Williams and Zipser, *Corresponding Author 1989), which forces the model to generate translation strictly matching the ground-truth at the word level. However, in practice it is impossible to generate translation totally the same as ground truth. Once different target words are generated, the word-level loss cannot evaluate the translation properly, usually under-estimating the translation. In addition, the teacher forcing algorithm suffers from the exposure bias (Ranzato et al., 2015) as it uses different inputs at training and inference, that is ground-truth words for the training and previously predicted words for the inference. Kim and Rush (2016) proposed a method of sequence-level knowledge distillation, which use teacher outputs to direct the training of student model, but the student model still have no access to its own predicted words. Scheduled sampling(SS) (Bengio et al., 2015;Venkatraman et al., 2015) attempts to alleviate the exposure bias problem through mixing ground-truth words and previously predicted words as inputs during training. However, the sequence generated by SS may not be aligned with the target sequence, which is inconsistent with the word-level loss. In contrast, sequence-level objectives, such as BLEU (Papineni et al., 2002), GLEU , TER (Snover et al., 2006), andNIST (Doddington, 2002), evaluate translation at the sentence or n-gram level and allow for greater flexibility, and thus can mitigate the above problems of the word-level loss. However, due to the nondifferentiable of sequence-level objectives, previous works on sequence-level training (Ranzato et al., 2015;Shen et al., 2016;Bahdanau et al., 2016;Wu et al., 2017;Yang et al., 2017) mainly rely on reinforcement learning algorithms (Williams, 1992;Sutton et al., 2000) to find an unbiased gradient estimator for the gradient update. Sparse rewards in this situation often cause the high variance of gradient estimation, which consequently leads to unstable training and limited improvements. Lamb et al. (2016); Gu et al. (2017); Ma et al. (2018) respectively use the discriminator, critic and bag-of-words target as sequence-level training objectives, all of which are directly connected to the generation model and hence enable direct gradient update. However, these methods do not allow for direct optimization with respect to evaluation metrics.
In this paper, we propose a method to combine the strengths of the word-level and sequencelevel training, that is the direct gradient update without gradient estimation from word-level training and the greater flexibility from sequence-level training. Our method introduces probabilistic ngram matching which makes sequence-level objectives (e.g., BLEU, GLEU) differentiable. During training, it abandons teacher forcing and performs greedy search instead to take into consideration the predicted words. Experiment results show that our method significantly outperforms word-level training with the cross-entropy loss and sequence-level training under the reinforcement framework. The experiments also indicate that greedy search strategy indeed has superiority over teacher forcing.

Background
NMT is based on an end-to-end framework which directly models the translation probability from the source sentence x to the target sentenceŷ: where T is the target length and θ is the model parameters. Given the training set D = {X M , Y M } with M sentences pairs, the training objective is to maximize the log-likelihood of the training data as where the superior m indicates the m-th sentence in the dataset and l m is the length of m-th target sentence. In the above model, the probability of each target word p(ŷ m j |ŷ m <j , x m , θ) is conditioned on the previous target words. The scenario is that in the training time, the teacher forcing algorithm is employed and the ground truth words from the target sentence are fed as context, while during inference, the ground truth words are not available and the previous predicted words are instead fed as context. This discrepancy is called exposure bias.

Sequence-Level Objectives
Many automatic evaluation metrics of machine translation, such as BLEU, GLEU and NIST, are based on the n-gram matching. Assuming that y andŷ are the output sentence and the ground truth sentence with length T and T respectively, the count of an n-gram g = (g 1 , . . . , g n ) in sentence y is calculated as where 1{·} is the indicator function. The matching count of the n-gram g betweenŷ and y is given by Then the precision p n and the recall r n of the predicted n-grams are calculated as follows BLEU, the most widely used metric for machine translation evaluation, is defined based on the n-gram precision as follows where BP stands for the brevity penalty and w n is the weight for the n-gram. In contrast, GLEU is the minimum of recall and precision of 1-4 grams where 1-4 grams are counted together:

probabilistic Sequence-Level Objectives
In the output sentence y, the prediction probability varies among words. Some words are translated by the model with high confidence while some words are translated with high uncertainty. Figure 1: The overview of our model with greedy search. At each decoding step, the predicted word which has the highest probability in the probability vector is selected as context and fed into the RNN, and meanwhile this word and its probability are also used to calculate the probabilistic n-gram count.
However, when calculating the count of n-grams in Eq. (3), all the words in the output sentence are treated equally, regardless of their respective prediction probabilities.
To give a more precise description of n-gram counts which considers the variety of prediction probabilities, we use the prediction probability p(y j |y <j , x, θ) as the count of word y j , and correspondingly the count of an n-gram is the product of these probabilistic counts of all the words in the n-gram, not one anymore. Then the probabilistic count of g = (g 1 , . . . , g n ) is calculated by summing over the output sentence y as Now the probabilistic sequence-level objective can be got by replacing C y (g) with C y (g) (the tilde over the head indicates the probabilistic version) and keeping the rest unchanged. Here, we take BLEU as an example and show how the probabilistic BLEU (denoted as P-BLEU) is defined. From this purpose, the matching count of n-gram g in Eq.(4) is modified as follows Cŷ y (g) = min( C y (g), Cŷ(g)).
(10) and the predict precision of n-grams changes intõ p n = g∈y Cŷ y (g) g∈y C y (g) .
Finally, the probabilistic BLEU (P-BLEU) is defined as Probabilistic GLEU (P-GLEU) can be defined in a similar way. Specifically, we denote the probabilistic precision of n-grams as P-Pn. The probabilistic precision is more reasonable than recall since the denominator in Eq.(11) plays a normalization role, so we modify the definition in Eq.(8) and define P-GLEU as simply the probabilistic precision of 1-4 grams.
The general probabilistic loss function is: where P represents the probabilistic sequencelevel objectives, and y m andŷ m are the predicted translation and the ground truth for the m-th sentence respectively. The calculation of the probabilistic objective is illustrated in Figure 1. This probabilistic loss can work with decoding strategies such as greedy search and teacher forcing. In this paper we employ greedy search rather than teacher forcing so as to use the previously predicted words as context and alleviate the exposure bias problem.  (Papineni et al., 2002) for the translation task. We apply our method to an attention-based NMT system  implemented by Pytorch. Both source and target vocabularies are limited to 30K. All word embedding sizes are set to 512, and the sizes of hidden units in both encoder and decoder RNNs are also set to 512. All parameters are initialized by uniform distribution over [−0.1, 0.1]. The mini-batch stochastic gradient descent (SGD) algorithm is employed to train the model with batch size of 40. In addition, the learning rate is adjusted by adadelta optimizer (Zeiler, 2012) with ρ = 0.95 and = 1e-6. Dropout is applied on the output layer with dropout rate of 0.5. The beam size is set to 10.

Performance
Systems We first pretrain the baseline model by maximum likelihood estimation (MLE) and then refine the model using probabilistic sequencelevel objectives, including P-BLEU, P-GLEU and P-P2 (probabilistic 2-gram precision). In addition, we reproduce previous works which train the NMT model through minimum risk training (MRT) (Shen et al., 2016) and REINFORCE algo-rithm (RF) (Ranzato et al., 2015). When reproducing their works, we set BLEU, GLEU and 2-gram precision as training objectives respectively and find out that GLEU yields the best performance. In the following, we only report the results with training objective GLEU. Performance Table 1 shows the translation performance on test sets measured in BLEU score. Simply training NMT model by the probabilistic 2-gram precision achieves an improvement of 1.5 BLEU points, which significantly outperforms the reinforcement-based algorithms. We also test the precision of other n-grams and their combinations, but do not notice significant improvements over P-P2. Notice that our method only changes the loss function, without any modification on model structure and training data.

Why Pretraining
We use the probabilistic loss to finetune the baseline model rather than training from scratch. This is in line with our motivation: to alleviate the exposure bias and make the model exposed to its own output during training. In the very beginning of the training, the model's translation capability is nearly zero and the generated sentences are often meaningless and do not contain useful information for the training, so it is unreasonable to directly apply the greedy search strategy. Therefore, we first apply the teacher forcing algorithm to pretrain the model, and then we let the model generate the sentences itself and learn from its own outputs.
Another reason favoring pretraining is that pretraining can lower the training cost. The training cost of the introduced probabilistic loss is about three times higher than the cost of cross entropy. Without pretraining, the training time will be much higher than usual. Otherwise, the training cost is acceptable if the probabilistic loss is only for finetuning.

Effect of Decoding Strategy
The probabilistic loss, defined in Eq.(13), is computed from the model output y and referenceŷ. In this section, we apply two different decoding strategies to generate y: 1. teacher forcing, which uses the ground truth as decoder input. 2. greedy search, which feeds the word with maximum probability. By conducting this experiment, we attempt to figure out where the improvements come from: the modification of loss or the mitigation of exposure bias? Figure 2 shows the learning curves of the two decoding strategies with training objective P-P2. Teacher forcing raises about 0.5 BLEU improvements and greedy search outperform the teacher forcing algorithm by nearly 1 BLEU point. We conclude that the probabilistic loss has its own advantage even when trained by the teacher forcing algorithm, and greedy search is effective in alleviating the exposure bias.
Notice that the greedy search strategy highly relys on the probabilistic loss and can not be conducted independently. Greedy search together with the word-level loss is very similar with the scheduled sampling(SS). However, SS is inconsistent with the word-level loss since the word-level loss requires strict alignment between hypothesis and reference, which can only be accomplished by the teacher forcing algorithm.

Correlation with Evaluation Metrics
In this section, we explore how the probabilistic objective correlates with the real evaluation metric. We randomly sample 100 pairs of sentences from the training set and compute their P-GLEU and GLEU scores  indicates that GLEU have better performance in the sentencelevel evaluation than BLEU).
Directly computing the correlation between GLEU and P-GLEU gives the correlation coefficient 0.86, which indicates strong correlation. In addition, we draw the scatter diagram of the 100 pairs of sentences in Figure 3 with GLEU as x-axis and P-GLEU as y-axix. Figure 3 shows that P-GLEU correlates well with GLEU, suggesting that it is reasonable to directly train the NMT model with P-GLEU.

Conclusion
Word-level loss cannot evaluate the translation properly and suffers from the exposure bias, and sequence-level objectives are usually indifferentiable and require gradient estimation. We propose probabilistic sequence-level objectives based on ngram matching, which relieve the dependence on gradient estimation and can directly train the NMT model. Experiment results show that our method significantly outperforms previous sequence-level training works and successfully alleviates the exposure bias through performing greedy search.