Using Context in Neural Machine Translation Training Objectives

We present Neural Machine Translation (NMT) training using document-level metrics with batch-level documents. Previous sequence-objective approaches to NMT training focus exclusively on sentence-level metrics like sentence BLEU which do not correspond to the desired evaluation metric, typically document BLEU. Meanwhile research into document-level NMT training focuses on data or model architecture rather than training procedure. We find that each of these lines of research has a clear space in it for the other, and propose merging them with a scheme that allows a document-level evaluation metric to be used in the NMT training objective. We first sample pseudo-documents from sentence samples. We then approximate the expected document BLEU gradient with Monte Carlo sampling for use as a cost function in Minimum Risk Training (MRT). This two-level sampling procedure gives NMT performance gains over sequence MRT and maximum-likelihood training. We demonstrate that training is more robust for document-level metrics than with sequence metrics. We further demonstrate improvements on NMT with TER and Grammatical Error Correction (GEC) using GLEU, both metrics used at the document level for evaluations.


Introduction
Neural Machine Translation (NMT) research has explored token-level likelihood functions (Sutskever et al., 2014;Bahdanau et al., 2015) and sequence-level objectives inspired by reinforcement learning (Ranzato et al., 2016;Bahdanau et al., 2016) or expected Minimum Risk Training (MRT) . A typical sequence objective in these cases is based on sentence-level BLEU (sBLEU) (Edunov et al., 2018). However * Now at Google sBLEU, even if aggregated over sentences, is only an approximation of the desired metric, documentlevel BLEU. Beyond translation, many metrics for natural language tasks do not have robust sentencelevel approximations. A logical progression is the extension of sequence-level NMT training objectives to include context from outside the sentence.
Document-based NMT, by contrast, aims to use out-of-sentence context to improve translation. Recent research explores lexical consistency by providing additional sentences during training (Maruf et al., 2019;Voita et al., 2018Voita et al., , 2019 or inference (Voita et al., 2019;Stahlberg et al., 2019), potentially with adjustments to model architecture. However, to the best of our knowledge, no attempt has been made to extend sequence-level neural training objectives to include document-level reward functions. This is despite document-level BLEU being arguably the most common NMT metric, and being the function originally optimised by Minimum Error Rate Training (MERT) for Statistical Machine Translation (SMT) (Och, 2003).
We propose merging lines of research on training objectives and document-level translation. We achieve this by presenting a document-level approach to sequence-level objectives which brings the training objective closer to the actual evaluation metric, using MRT as a representative example. We demonstrate MRT under document-level BLEU as well as Translation Edit Rate (TER) (Snover, 2006), which while decomposable to sentence level is less noisy when used over documents. We consider both pseudo-documents where sentences are assigned randomly to a mini-batch, and true document context where all sentences in the batch are from the same document.
We finally apply our scheme to supervised Grammatical Error Correction, for which using neural models is becoming increasingly popular (Xie et al., 2016;Stahlberg et al., 2019).

Related Work
Minimum Error Rate Training was introduced for phrase-based SMT with document-level BLEU (Och, 2003).  extend these ideas to NMT, using expected minimum risk at the sequence level with an sBLEU cost for end-to-end NMT training. Edunov et al. (2018) explore random and beam sampling for NMT sequence-MRT, as well as other sequence-level training losses.
Related developments in NMT include combined reinforcement-learning/cross-entropy approaches such as MIXER (Ranzato et al., 2016), which itself has origins in the REINFORCE algorithm described by Williams (1992). We do not explore such approaches, although our documentsampling and document-metric schemes could in principle be extended to them.
Sequence-level MRT has seen success outside NMT. Ayana et al. (2016) use sequence MRT for summarization, while Shannon (2017) uses a related approach for speech recognition. MRT can be seen as a special case of neural reinforcement learning, which  apply to GEC with sequence-level costs. Closest to our approach is the work of Jean and Cho (2019) on NMT with a minibatch-context-sensitive training procedure. However, they do not optimize on document metrics over those contexts. They also sample contexts randomly, while we find diverse context sampling is important for the success of document-MRT.

Sequence-level MRT
Sentence-level MRT for NMT aims to minimize the expected loss on training data with a loss function between sampled target sentences y and gold reference sentences y * . For NMT a common sentencelevel cost function ∆(y, y * ) is 1 -sBLEU, where sBLEU is smoothed by setting initial n-gram counts to 1 (Edunov et al., 2018).
We take N samples for each of the S sentences in a mini-batch. We write the cost function between the s th reference in a mini-batch, y (s) * , and its n th sample, y n , y (s) * ). The risk gradient for end-to-end NMT with MRT as in , with sample-count scaling, is then: (1)

Document-level MRT
By analogy with sequence-level MRT, we consider MRT over batches of S sentence pairs, which we treat as a pseudo-document. In practice we experiment both with sentences chosen randomly from all training data, and with true context where all sentences per batch are from a single document.
, which may be non-differentiable, replaces the sequence-level metric ∆(y, y (s) * ). We define the document-level risk: Using p θ ∇ θ log p θ = ∇p θ , and defining L(Y ) = log P (Y |X; θ) for brevity: Using simple Monte-Carlo, after Shannon (2017), we replace the expectation by an average taken over N sampled translation documents The n th sample for the s th sentence in the batchlevel document, y (s) n , contributes the following term to the overall gradient: In other words the gradient of each sample is weighted by the aggregated document-level scores for documents in which the sample appears. Figure 1: Sample-ordering schemes for MRT with S = 2 sentences / batch and N = 3 samples / sentence, showing sample costs. In sequence-MRT each sample has its own cost (e.g. sBLEU). For doc-MRT (ordered), samples are ordered and sorted into N-wise 'documents', each with a combined cost (e.g. document BLEU). The ordered assignment enforces an extreme range of combined costs. In doc-MRT (random), samples are randomly assigned, making documents on average less diverse with less distinct scores, with a low likelihood of extreme distributions.

Mini-batch level document sampling
To generate sample documents we first sample sentences. Sentence sampling for NMT generates new tokens in a left-to-right manner .
In left-to-right generation each token is sampled from a distribution conditioned on previously sampled tokens, minimizing exposure bias to gold references which the model is unlikely to see at inference time (Ranzato et al., 2016). Sampling can be via beam search, or random sampling from the model distribution given previously sampled tokens. Beam search produces more likely samples which may be less diverse compared to random sampling (Edunov et al., 2018).
Here we only consider sampling during training. While samples can be more easily generated offline with respect to fixed model parameters, such samples are not representative of the current model.
With N sample translations for each of the S sentence pairs per batch we can construct N S possible sample documents as sequences of S sentences. Considering all possible documents is intractable unless N and S are small. It also carries the risk that a single sentence will appear in multiple sampled documents, giving it undue weight.
Instead we propose creating N documents by first ordering samples for each sentence (e.g. by sBLEU), then creating the n th sample document Y n by concatenating the n th sample from each sentence. This gives a set of N diverse documents sampled from N S possibilities. We expect the sampled documents to be diverse in contents, since a given sentence will only ever occur in a single document context, and diverse in score. We refer to this scheme as ordered document sampling. Figure 1 illustrates ordered document sampling by comparison to a scheme which randomly samples sentences to form documents.

Experiments
We report on English-German NMT. We initialize with a baseline trained on 17.5M sentence pairs from WMT19 news task datasets (Barrault et al., 2019), on which we learn a 32K-merge joint BPE vocabulary (Sennrich et al., 2016). We validate on newstest2017, and evaluate on newstest2018.
We apply MRT only during fine-tuning, following previous work (Edunov et al., 2018;. In early experiments, we found that training from scratch with discriminative objectives (sequence-or document-based) is ineffective. We suspect samples produced early in training are so unlike the references that the model never receives a strong enough signal for effective training.
We fine-tune on old WMT news task test sets (2008)(2009)(2010)(2011)(2012)(2013)(2014)(2015)(2016) in two settings. With random batches sentences from different documents are shuffled randomly into mini-batches. In this case doc-MRT metrics are over pseudo-documents. With document batches each batch contains only sentences from one document, and doc-MRT uses true document context. We use the same sampling temperatures and the same risk sharpness factors for both forms of MRT for each experiment.
For Grammatical Error Correction (GEC) we train on sentences from NUCLE (Dahlmeier et al., 2013) and Lang-8 Learner English (Mizumoto et al., 2012) with at least one correction, a total of 660K sentences. We evaluate on the JFLEG (Napoles et al., 2017)  For all models we use a Transformer model (Vaswani et al., 2017) with the 'base' Ten-sor2Tensor parameters (Vaswani et al., 2018).
We train to validation set BLEU convergence on a single GPU. The batch size for baselines and MLE is 4096 tokens. For MRT, where each sentence in the batch is sampled N times, we reduce batch size by N while delaying gradient updates by the same factor to keep the effective batch size constant (Saunders et al., 2018). At inference time we decode using beam size 4. All BLEU scores are for cased, detokenized output, calculated using SacreBLEU (Post, 2018).

Computation and sample count
Our proposed document-MRT approach is more complex than sequence-MRT due to the additional score-aggregation and context-sampling steps. In practice we find that the extra computation of ordering and aggregating sequence scores is negligible when compared to the computational cost of sentence sampling, required for all forms of MRT.
Our MRT experiments use N = 8 random samples per sentence unless otherwise stated. In this we choose the highest N we can practically experiment with, since previous work finds MRT performance increasing steadily with more samples per sentence .
That we see improvements with so few samples is in contrast to previous work which finds BLEU gains only with 20 or more samples per sentence for sequence-MRT Edunov et al., 2018). However, we find that document-MRT allows improvements with far fewer samples, perhaps because the aggregation of scores over sentences in a context increases robustness to variation in individual samples.
Relatedly, we find that add-one BLEU smoothing (Lin and Och, 2004) is required for sequence-MRT as in . However we find that doc-MRT can achieve good results without smoothing, perhaps because n-gram precisions are far less likely to be 0 when calculated over a document. In Table 1, we fine-tune an en-de baseline on documents from past news sets. We compare sentence-BLEU and document-BLEU MRT to fine-tuning with Maximum Likelihood Estimation (MLE).  MLE fine-tuning degrades the baseline. This suggests the baseline is well-converged, as is desirable for applying MRT . The degradation is smaller with batches containing only sentences from the same document. We connect this to the idea that NMT batches with fewer sentence pairs have 'noisier' estimated gradients, harming training (Saunders et al., 2018). We expect batches of sentences from a single document to be similar and therefore give less noisy gradient estimates.

MRT for NMT
Both seq-MRT and doc-MRT improve over the baseline with random sampling and N = 8. We also explore MRT at N = 4, with batch size adjusted as described in section 3 for the same effective batch size per update, and with fewer training steps such that the model 'sees' a similar proportion of the overall dataset. We do not report beam sampling results as early experiments indicate beam sampling gives similarly poor results for both seq-MRT and doc-MRT. This may be because beam search produces insufficiently diverse samples for this task (Freitag and Al-Onaizan, 2017).
Sequence-MRT gives a 0.8 BLEU gain over the baseline with both batching schemes using N = 8 samples, but starts to degrade the baseline with N = 4 samples. With document batches and N = 8 Doc-MRT (ordered) outperforms seq-MRT by a further 0.4 BLEU. With N = 4 doc-MRT (ordered) still achieves a 0.7 BLEU improvement over the baseline, or a 0.8 BLEU improvement over seq-MRT. We suggest therefore that doc-MRT (ordered) may be a computationally more efficient alternative to seq-MRT when large sample counts are not practical.
For contrast with the ordered document sampling approach of Section 2.3, we give results for doc-MRT (random), which uses randomly sampled contexts. This approach falls significantly behind doc-MRT (ordered) with either batching scheme. Since doc-MRT (random) with random batches is exposed to randomness at the batch construction, sentence sampling and document sampling stages,  these results are averages over 3 experimental runs, which gave fairly consistent results (<0.2 BLEU range). In general we do find that results with random batches and random ordering are variable and sensitive to batch size and batching scheme. We interpret these results by considering the effect on the per-sentence cost for the different schemes. We find MRT works well when sample scores are different enough to be discriminated, but suffers if scores are too different. This is in line with the findings of Edunov et al. (2018) that including the gold reference causes the model to assign low relative probabilities to every other sample.
Doc-MRT aggregates scores over many samples, while seq-MRT uses individual scores. We believe this explains the stronger performance of doc-MRT for small values of N , especially for the ordered document scheme, which ensures scores are still different enough for MRT to discriminate.
Our approach can also be used with documentlevel metrics that are not intended to be used with individual sentences. In Table 2 we demonstrate this with TER, which estimates the edit rate required to correct a set of translation hypotheses. Document-TER MRT improves over a strong baseline, although batching scheme has less of an impact here. Notably seq-level MRT does not improve TER over the baseline, indicating TER may be too noisy a metric for use at the sentence level.

MRT for GEC
Finally, we apply our MRT approach to the GEC GLEU metric (Napoles et al., 2015), an n-gram edit measure typically used at the document level. Table 3 shows that document MRT fine-tuning improves GLEU over the baseline, MLE fine-tuning, and a sequence-GLEU MRT formulation. Also notable is the change in M2, which finds the phraselevel edit sequence achieving the highest overlap with the gold-standard (Dahlmeier and Ng, 2012). MLE and sequence-MRT improve recall at a detriment to precision, suggesting over-generation of spurious corrections. Document-MRT likewise improves recall, but with a precision score closer to the baseline for more balanced performance. There is clear indication of a tension between M2 and GLEU: a small increase in GLEU under doc-MRT on CONLL leads to a large increase in M2, while a large increase in GLEU under doc-MRT on JFLEG leads to a small decrease in M2.
We note that our improvements on JFLEG are similar to the improvements shown by  for neural reinforcement learning with a sequence-GLEU cost metric. However, their results involve N=20 samples and 600k updates, compared to N=8 and 3k updates with our approach.

Conclusions and future work
We present a novel approach for structured loss training with document-level objective functions. Our approach relies on a procedure for sampling a set of diverse batch-level contexts using N-wise sample ordering. As well as randomly selecting training data, we assess training with mini-batches consisting only of single document contexts. While the scope of this work does not extend to sampling sentences given document context, this would be an interesting direction for future work.
We demonstrate improvements covering three document-level evaluation metrics: BLEU and TER for NMT and GLEU for GEC. We finish by noting that the original MERT procedure developed for SMT optimised document-level BLEU and with our procedure we reintroduce this to NMT.