Log-Linear Reformulation of the Noisy Channel Model for Document-Level Neural Machine Translation

We seek to maximally use various data sources, such as parallel and monolingual data, to build an effective and efficient document-level translation system. In particular, we start by considering a noisy channel approach (CITATION) that combines a target-to-source translation model and a language model. By applying Bayes’ rule strategically, we reformulate this approach as a log-linear combination of translation, sentence-level and document-level language model probabilities. In addition to using static coefficients for each term, this formulation alternatively allows for the learning of dynamic per-token weights to more finely control the impact of the language models. Using both static or dynamic coefficients leads to improvements over a context-agnostic baseline and a context-aware concatenation model.


Introduction
Neural machine translation (NMT) (Sutskever et al., 2014;Bahdanau et al., 2015) has been reported to reach near human-level performance on sentence-by-sentence translation (Läubli et al., 2018). Going beyond sentence-level, documentlevel NMT aims to translate sentences by taking into account neighboring source or target sentences in order to produce a more cohesive output (Jean et al., 2017;Wang et al., 2017;Maruf et al., 2019). These approaches often train new models from scratch using parallel data.
In this paper, in a similar spirit to Voita et al. (2019a); Yu et al. (2020), we seek a document-level approach that maximally uses various available corpora, such as parallel and monolingual data, leveraging models trained at the sentence and document levels, while also striving for computational efficiency. We start from the noisy channel model (Yu et al., 2020) which combines a target-to-source translation model and a document-level language model. By applying Bayes' rule, we reformulate this approach into a log-linear model. It consists of a translation model, as well as sentence and document-level language models. This reformulation admits an auto-regressive expression of tokenby-token target document probabilities, facilitating the use of existing inference algorithms such as beam search. In this log-linear model, there are coefficients modulating the impact of the language models. We first consider static coefficients and, for more fine-grained control, we train a merging module that dynamically adjusts the LM weights.
With either static or dynamic coefficients, we observe improvements over a context-agnostic baseline, as well as a context-aware concatenation model (Tiedemann and Scherrer, 2017). Similarly to the noisy channel model, our approach reuses off-the-shelf models and benefits from future translation or language modelling improvements.
2 Log-linear reformulation of the noisy channel model Given the availability of various heterogeneous data sources that could be used for document-level translation, we seek a strategy to maximally use them. These sources include parallel data, at either the sentence or document level, as well as more broadly available monolingual data.
As the starting point, we consider the noisy channel approach proposed by Yu et al. (2020). Given a source document (X (1) , . . . , X (N ) ) and its translation (Y (1) , . . . , Y (N ) ), they assume a generation process where target sentences are produced from left to right, and where each source sentence is translated only from the corresponding target sentence. Under these assumptions, the probability of a source-target document pair is given by As such, the conditional probability of the target document given the source is expressed by .
We therefore generate context-aware translations by combining a translation model (TM) P (Y (n) |X (n) ) with both sentence-level P (Y (n) ) and document-level P (Y (n) |Y (<n) ) language models (LM). To calibrate the generation process, we introduce coefficients α ∈ R and β ∈ R to control the contribution of each language model, which are tuned on a validation set: is a normalization constant and L n is the target sentence length.
Similarly to the noisy channel approach (Yu et al., 2020), we use off-the-shelf translation and language models. As such, future improvements to either translation or language modelling can easily be leveraged. Our reformulation however admits a more efficient search procedure, unlike that by Yu et al. (2020).

Model parameterization
The translation model is implemented as any auto-regressive neural translation model. We use the Transformer encoder-decoder architecture (Vaswani et al., 2017). Given a source sentence x 1 , . . . , x L , each token and its position are projected into a continuous embedding s 0,1 , . . . , s 0,L . These representations are passed through a sequence of M encoder layers that each comprise self-attention and feed-forward modules, resulting in the final representations s M,1 , . . . , s M,L . The decoder updates target embeddings through similar layers, which additionally attend to the encoder output, to obtain final hidden states t M,1 , . . . , t M,L . Token probabilities may be obtained by projecting these representations and applying softmax normalization.
Language models are implemented as Transformer decoders without cross-attention. We use a single language model trained on sequences of consecutive sentences to obtain both sentence-level and document-level probabilities.

Dynamic merging
As extra-sentential information is not uniformly useful for translation, we propose dynamic coefficients for the different models by generalizing Eq. 1: With the translation and language models kept fixed, the coefficients α (n) i and β (n) i are computed by an auxiliary neural network which uses Y (<n) , Y (n) and X (n) . We call this network a merging module and implement it as a feed-forward network on top of the translation and language models.

Dynamic coefficient computation
For every token, the corresponding last hidden states of the translation model, sentence-level LM and document-level LM are concatenated. Each non-final layer (k = 1, . . . , K − 1) is a feedforward block where LN and drop respectively denote layer normalization and dropout (Ba et al., 2016;Srivastava et al., 2014). The final layer is similar, but there is no residual connection (and no dropout) as the final linear transformation projects the result to 2 dimensions, so that (α, β) = W K,2 (ReLU(W K,1 h K−1 )).

Settings
Data We run experiments on English-Russian data from OpenSubtitles (Lison et al., 2018), which was used in many recent studies on document-level translation (Voita et al., 2019b,a;Mansimov et al., 2020;Jean et al., 2019). Language models are trained on approximately 30M sequences of 4 consecutive sentences (Voita et al., 2019a).The parallel data was originally preprocessed by Voita et al. (2019b), yielding 6M examples. For 1.5M of these data points, the 3 preceding source and target sentences are provided. We use this subset to train the merging module that predicts the per-token coefficients for each model. We uniformly set the number of contextual sentences between 1 and 3 to match the test condition.
We apply byte-pair encoding (BPE) (Sennrich et al., 2016), with a total of 32k merge operations, separately on each language pair, as Russian and English use different sets of alphabets.
Models Translation models are standard Transformers in their base configuration (Vaswani et al., 2017). The language model is implemented as a Transformer decoder of the same size, except for a smaller feed-forward dimension d f f = 1024. The merging module has 2 layers, with d f f = 1536.
Learning The translation and language models, as well as the merging module, are trained with label smoothing set to 10%. The TM is trained with 20% dropout, while it is set to 10% for the LMs and merging module.
Evaluation Translation quality is evaluated with tokenized BLEU on lowercased data, using beam search with its width set to 5. We average 5 checkpoints for the translation models. Sentences are generated from left to right, and the beam is reset for every sentence.

Results
With our approach, using static coefficients, we reach a BLEU score of 34.31, which is a modest gain of 0.21 BLEU over the baseline and 0.8 over a model trained on concatenated sentences (Table 1)  DocRepair (Voita et al., 2019a), a two-pass method that post-edits the output of a baseline system, obtains a slightly higher BLEU score of 34.60. Both approaches could be combined by instead post-editing the output of our models, which we leave for future investigation.

BLEU-NLL correlation
We observe limited correlation between BLEU and reference NLL (Och, 2003;Lee et al., 2020). On the validation set, the per-token baseline loss (with label smoothing) is 13.09. Using static coefficients, it actually increases to 13.23, while it decreases to 12.86 with dynamic coefficients. Table 2 presents the BLEU scores on the validation set using greedy validation for different static values of α and β. Only using the document-level LM (α > 0, β = 0) leads to worse performance than the baseline. It is critical to counter-balance the document-level LM with the sentence-level LM.

Contribution of each language model (static)
Dynamic coefficients The dynamic coefficients α and β predicted by the merging module are highly correlated (Figure 1 (left)). As a conjecture, this high correlation may be explained by the use of the same language model to obtain both sentence and document-level scores. Figure 1 (right) shows the average value of the  dynamic coefficient α for frequent words within the validation reference set. In particular, Ты and Вы, which are translations of you that depend on plurality and formality, are assigned high weights.
Challenge sets While static and dynamic coefficients lead to similar BLEU, using dynamic coefficients often results in better performance on multiple-choice scoring-based challenge sets targeting specific translation phenomena (Table 3) (Voita et al., 2019b). 1 We conjecture this likely happens because dynamic coefficients can more narrowly focus on particular subsets of target sentences that benefit from document-level context.

Related work
Document-level NMT Neural machine translation may be extended to include extra-sentential information in many ways, as surveyed by Maruf et al. (2019). The model architecture may be modified, for example by encoding previous source sentences or generated translations and attending to them (Jean et al., 2017;Wang et al., 2017;Voita et al., 2018;Zhang et al., 2018;Miculicich et al., 2018;Maruf and Haffari, 2018;Tu et al., 2018). Otherwise, by simply concatenating multiple sentences together as input, existing model architectures may be used without additional changes (Tiedemann and Scherrer, 2017; Junczys-Dowmunt, 2019). Voita et al. (2019b) and Voita et al. (2019a) propose refining the output of a context-agnostic baseline, using a new model trained from either document-level parallel data or from round-trip translated monolingual data. The noisy channel approach similarly uses large-scale monolingual data (Yu et al., 2020) to refine translations, while using arbitrary, and potentially pre-trained, translation or language models, as discussed in Sec. 2.
Our approach shares many similarities with the above, but admits a more straightforward generation process. If desired, we could still rerank the beam search output with a channel model, which might improve general translation quality for reasons not necessarily related to context. Language modelling Language model probabilities have been used to rerank NMT hypotheses (see, e.g., Stahlberg et al., 2019). Additionally, direct integration of a language model into a translation model, using various fusion techniques, improves generation quality and admits the use of single-pass search algorithms (Gulcehre et al., 2015). To promote diversity in dialogue systems, model scores may be adjusted by negatively weighing a language model (Li et al., 2015).

Conclusion
In this paper, we set to use heterogeneous data sources in an effective and efficient manner for document-level NMT. We reformulated the noisy channel approach (Yu et al., 2020) and end up with a left-to-right log-linear model combining a baseline machine translation model with sentence-level and document-level language models.
To modulate the impact of the language models, we dynamically adapt their coefficients at each time step with a merging module taking into account the translation and language models. We observe improvements over a context-agnostic baseline and using dynamic coefficients helps capture documentlevel linguistic phenomena better.
Future directions include combining our approach with MT models trained on back-translated documents, exploring its applicability to other modalities such as vision and speech, and considering deeper fusion of the models.

A Expanded derivation
The conditional probability of the target document given the source is expressed by where C(X) = N n=1 P (X (n) ) P (X (1) ,...,X (N ) ) does not affect the optimal target sentences given a source document.

B Hyper-parameters
Translation model We validate models with greedy search. We use the base transformer configuration (Vaswani et al., 2017). We use effective batches of approximately 31500 source tokens and optimize models with Adam (Kingma and Ba, 2014). We follow a learning rate schedule similar to Vaswani et al. (2017), with 16,000 warmup steps and scaled by 4. We experimented with 10% and 20% dropout, obtaining higher validation BLEU with the latter. We use pre-LN transformer layers (Xiong et al., 2020).

Language model
We use a similar configuration to the translation model, except with 64,000 warmup steps and post-LN transformer layers (Xiong et al., 2020).
Dynamic coefficients We varied the number of layers between 1 and 3. We also considered adding cross-attention within the merging module, but we did not observe improvements in preliminary experiments.

C Label smoothing
If we train the merging module without label smoothing (instead of 10%), greedy validation BLEU drops by approximately 1 BLEU point. We also observe much higher variability in the coefficients, which may be caused by the unbounded optimal value of α when a target token is the most likely according to the document-level LM.