Simple Fusion: Return of the Language Model

Neural Machine Translation (NMT) typically leverages monolingual data in training through backtranslation. We investigate an alternative simple method to use monolingual data for NMT training: We combine the scores of a pre-trained and fixed language model (LM) with the scores of a translation model (TM) while the TM is trained from scratch. To achieve that, we train the translation model to predict the residual probability of the training data added to the prediction of the LM. This enables the TM to focus its capacity on modeling the source sentence since it can rely on the LM for fluency. We show that our method outperforms previous approaches to integrate LMs into NMT while the architecture is simpler as it does not require gating networks to balance TM and LM. We observe gains of between +0.24 and +2.36 BLEU on all four test sets (English-Turkish, Turkish-English, Estonian-English, Xhosa-English) on top of ensembles without LM. We compare our method with alternative ways to utilize monolingual data such as backtranslation, shallow fusion, and cold fusion.


Introduction
Machine translation (MT) relies on parallel training data, which is difficult to acquire. In contrast, monolingual data is abundant for most languages and domains. Traditional statistical machine translation (SMT) effectively leverages monolingual data using language models (LMs) (Brants et al., 2007). The combination of LM and TM in SMT can be traced back to the noisy-channel model which applies the Bayes rule to decompose a 0 This work was done when the first author was on an internship at Facebook. translation system (Brown et al., 1993): where x = (x 1 , . . . , x m ) is the source sentence, y = (y 1 , . . . , y n ) is the target sentence, and P T M (·) and P LM (·) are translation model and language model probabilities.
In contrast, NMT (Sutskever et al., 2014;Bahdanau et al., 2014) uses a discriminative model and learns the distribution P (y|x) directly end-to-end. Therefore, the vanilla training regimen for NMT is not amenable to integrating an LM or monoglingual data in a straightforward manner.
An early attempt to use LMs for NMT, also known as shallow fusion, combines LM and NMT scores at inference time in a log-linear model (Gulcehre et al., 2015(Gulcehre et al., , 2017. In contrast, we integrate the LM scores during NMT training. Our training procedure first trains an LM on a large monolingual corpus. We then hold the LM fixed and train the NMT system to optimize the combined score of LM and NMT on the parallel training set. This allows the NMT model to focus on modeling the source sentence, while the LM handles the generation based on the targetside history. Sriram et al. (2017) explored a similar idea for speech recognition using a gating network for controlling the relative contribution of the LM. We show that our simpler architecture without an explicit control mechanism is effective for machine translation. We observe gains of up to more than 2 BLEU points from adding the LM to TM training. We also show that our method can be combined with backtranslation (Sennrich et al., 2016a), yielding further gains over systems without LM. 204 2 Related Work

Inference-time Combination
Shallow fusion (Gulcehre et al., 2015) integrates an LM by changing the decoding objective to: y = argmax y log P TM (y|x) + λ log P LM (y).

Cold Fusion
Shallow fusion combines a fixed TM with a fixed LM at inference time. Sriram et al. (2017) proposed to keep the LM fixed, but train a sequence to sequence (Seq2Seq) NMT model from scratch which includes the LM as a fixed part of the network. They argue that this approach allows the Seq2Seq network to use its model capacity for the conditioning on the source sequence since the language modeling aspect is already covered by the LM. Their cold fusion architecture includes a gating network which learns to regulate the contributions of the LM at each time step. They demonstrated superior performance of cold fusion on a speech recognition task. Gulcehre et al. (2015Gulcehre et al. ( , 2017 suggest to combine a pre-trained RNN-LM with a pre-trained NMT system using a controller network that dynamically adjusts the weights between RNN-LM and NMT at each time step (deep fusion). Both deep fusion and n-best reranking with count-based LMs have been used in WMT evaluation systems (Jean et al., 2015;Wang et al., 2017). An important limitation of these approaches is that LM and TM are trained independently.

Other Approaches
A second line of research augments the parallel training data with additional synthetic data from a monolingual corpus in the target language. The source sentences can be generated with a separate translation system (Schwenk, 2008;Sennrich et al., 2016a) (backtranslation), or simply copied over from the target side (Currey et al., 2017). Since data augmentation methods rely on some balance between real and synthetic data (Sennrich et al., 2016a;Currey et al., 2017;Poncelas et al., 2018), they can often only use a small fraction of the available monolingual data. A third class of approaches change the NMT training loss function to incorporate monolingual data. For example, Cheng et al. (2016); Tu et al. (2017) proposed to add autoencoder terms to the training objective which capture how well a sentence can be reconstructed from its translated representation. However, training with respect to the new loss is often computationally intensive and requires approximations. Alternatively, multi-task learning has been used to incorporate source-side (Zhang and Zong, 2016) and target-side (Domhan and Hieber, 2017) monolingual data. Another way of utilizing monolingual data in both source and target language is to warm start Seq2Seq training from pre-trained encoder and decoder networks (Ramachandran et al., 2017;Skorokhodov et al., 2018). We note that pre-training can be used in combination with our approach.
An extreme form of leveraging monolingual training data is unsupervised NMT (Lample et al., 2017;Artetxe et al., 2017) which removes the need for parallel training data entirely. In this work, we assume to have access to some amount of parallel training data, but aim to improve the translation quality even further by using a language model.

Translation Model Training under Language Model Predictions
In spirit of the cold fusion technique of Sriram et al. (2017) we also keep the LM fixed when training the translation network. However, we greatly simplify the architecture by removing the need for a gating network. We follow the usual left-to-right factorization in NMT: Let S TM (y t |y t−1 1 , x) be the output of the TM projection layer without softmax, i.e., what we would normally call the logits. We investigate two different ways to parameterize P (y t |y t−1 1 , x) using S TM (y t |y t−1 1 , x) and a fixed and pretrained language model P LM (·): POSTNORM and PRENORM.
POSTNORM This variant is directly inspired by shallow fusion (Eq. 2) as we turn S TM (y t |y t−1 1 , x) 205 into a probability distribution using a softmax layer, and sum its log-probabilities with the logprobabilities of the LM, i.e. multiply their probabilities: PRENORM Another option is to apply normalization after combining the raw S TM (y t |y t−1 1 , x) scores with the LM log-probability: 3.1 Theoretical Discussion of POSTNORM and PRENORM Note that P (y t |y t−1 1 , x) might not represent a valid probability distribution under the POST-NORM criterion since, as component-wise product of two distributions, it is not guaranteed to sum to 1. A way to fix this issue would be to combine TM and LM probabilities in the probability space rather than in the log space. However, we have found that probability space combination does not work as well as POSTNORM in our experiments. We can describe S TM (y t |y t−1 1 , x) under POSTNORM informally as the residual probability added to the prediction of the LM.
It is interesting to investigate what signal is actually propagated into S TM (y t |y t−1 1 , x) when training with the PRENORM strategy. We can rewrite P (y t |y t−1 1 , x) as: Alternatively, we can decompose P (y t |y t−1 1 , x) as  follows using Eq. 5: Combining Eq. 6 and Eq. 7 leads to: This means that S TM (y t |y t−1 1 , x) under PRENORM is trained to predict how much more likely the source sentence becomes when a particular target token y t is revealed.

Experimental Setup
We evaluate our method on a variety of publicly available and proprietary data sets. For our Turkish-English (tr-en), English-Turkish (entr), and Estonian-English (et-en) experiments we use all available parallel data from the WMT18 evaluation campaign to train the translation models. Our language models are trained on News Crawl 2017. We use news-test2017 as development ("dev") set and news-test2018 as test set.
Additionally, we collected our own proprietary corpus of public posts on Facebook. We refer to it as 'INTERNAL' data set. This corpus consists of monolingual English in-domain sentences and parallel data in Xhosa-English. Training set sizes are summarized in Tables 1 and 2. Our preprocessing consists of lower-casing, tokenization, and subword-segmentation using joint  byte pair encoding (Sennrich et al., 2016b) with 16K merge operations. On Turkish, we additionally remove diacritics from the text. On WMT we use lower-cased Sacre-BLEU 1 (Post, 2018) to be comparable with the literature. 2 On our internal data we report tokenized BLEU scores.
Our Seq2Seq models are encoder-decoder architectures (Sutskever et al., 2014;Bahdanau et al., 2014) with dot-product attention (Luong et al., 2015b) trained with our PyTorch Translate library. 3 Both decoder and encoder consist of two 512-dimensional LSTM layers and 256dimensional embeddings. The first encoder layer is bidirectional, the second one runs from right to left. Our training and architecture hyperparameters are summarized in Tab. 3. Our LSTM-based LMs have the same size and architecture as the decoder networks, but do not use attention and do not condition on the source sentence. We run beam search with beam size of 6 in all our experiments.
For each setup we train five models using SGD (batch size of 32 sentences) with learning rate decay and label smoothing, and either select the best one (single system) or ensemble the four best models based on dev set BLEU score.

Results
Tab. 4 compares our methods PRENORM and POSTNORM on the tested language pairs. Shallow fusion (Sec. 2.1) often leads to minor improvements over the baseline for both single systems and ensembles. We also reimplemented the  Table 4: Comparison of our PRENORM and POST-NORM combination strategies with shallow fusion (Gulcehre et al., 2015) and cold fusion (Sriram et al., 2017) under an RNN-LM. cold fusion technique (Sec. 2.2) for comparison. For our machine translation experiments we report mixed results with cold fusion, with performance ranging between 0.33 BLEU gain on Xhosa-English and slight BLEU degradation in most of our Turkish-English experiments.
Both of our methods, PRENORM and POST-NORM yield significant improvements in BLEU across the board. We report more consistent gains with POSTNORM than with PRENORM. All our POSTNORM systems outperform both shallow fusion and cold fusion on all language pairs, yielding test set gains of up to +2.36 BLEU (Xhosa-English ensembles).

Discussion and Analysis
Backtranslation A very popular technique to use monolingual data for NMT is backtranslation (Sennrich et al., 2016a). Backtranslation  uses a reverse NMT system to translate monolingual target language sentences into the source language, and adds the newly generated sentence pairs to the training data. The amount of monolingual data which can be used for backtranslation is usually limited by the size of the parallel corpus as the translation quality suffers when the mixing ratio between synthetic and real source sentences is too large (Poncelas et al., 2018). This is a severe limitation particularly for low-resource MT. Fig. 1 shows that both our baseline system without LM and our POSTNORM system benefit greatly from backtranslation up to a mixing ratio of 1:8, but degrade slightly if this ratio is exceeded. POSTNORM is significantly better than the baseline even when using it in combination with backtranslation.
Training convergence We have found that training converges faster under the POSTNORM loss. Fig. 2 plots the training curves of our sys-   tems. The baseline (orange curve) reaches its maximum of 19.39 BLEU after 28 training epochs. POSTNORM surpasses this BLEU score already after 12 epochs.
Language model type So far we have used recurrent neural network language models (Mikolov et al., 2010, RNN-LM) with LSTM cells in all our experiments. We can also parameterize an n-gram language model with a feedforward neural network (Bengio et al., 2003, FFN-LM). In order to compare both language model types we trained a 4-gram feedforward LM with two 512dimensional hidden layers and 256-dimensional embeddings on Turkish monolingual data. Tab. 5 shows that the PRENORM strategy works particularly well for the n-gram LM. However, using an RNN-LM with the POSTNORM strategy still gives the best overall performance. Using both RNN and n-gram LM at the same time does not improve translation quality any further (Tab. 6).
Impact on the TM distribution With the POST-NORM strategy, the TM still produces a distribution over the target vocabulary as the scores are

Reference
He says that years later, he still lives in fear.

Baseline (no LM)
He says that, for years, he still lives in fear. This work (PRENORM) He says that many years later he still lives in fear.

Reference
"I'm afraid," he says. Baseline (no LM) "I fear," says he. This work (PRENORM) "I am afraid," he says. normalized before the combination with the LM. This raises a natural question: How different are the distributions generated by a TM trained under POSTNORM loss from the distributions of the baseline system without LM? Tab. 7 gives some insight to that question. As expected, the RNN-LM has higher perplexity than the baseline as it is a weaker model of translation. The RNN-LM also has a higher average entropy which indicates that the LM distributions are smoother than those from the baseline translation model. The TM trained under POSTNORM loss has a much higher perplexity which suggests that it strongly relies on the LM predictions and performs poorly when it is not combined with it. However, the average entropy is much lower (1.82) than both other models, i.e. it produces much sharper distributions.
Language models improve fluency A traditional interpretation of the role of an LM in MT is that it is (also) responsible for the fluency of translations (Koehn, 2009). Thus, we would expect more fluent translations from our method than from a system without LM. Tab. 8 breaks down the BLEU score of the baseline and the PRENORM ensembles on Estonian-English into n-gram precisions. Most of the BLEU gains can be attributed to the increase in precision of higher order n-grams, indicating improvements in fluency. Tab. 9 shows some examples where our PRENORM system produces a more fluent translation than the baseline.
Training set size We artificially reduced the size of the English-Turkish training set even further to investigate how well our method performs in low-resource settings (Fig. 3). Our POSTNORM strategy outperforms the baseline regardless of the number of training sentences, but the gains are smaller on very small training sets.

Conclusion
We have presented a simple yet very effective method to use language models in NMT which incorporates the LM already into NMT training. We reported significant and consistent gains from using our method in four language directions over two alternative ways to integrate LMs into NMT (shallow fusion and cold fusion) and showed that our approach works well even in combination with backtranslation and on top of ensembles. Our method leads to faster training convergence and more fluent translations than a baseline system without LM.