Domain Adaptive Inference for Neural Machine Translation

We investigate adaptive ensemble weighting for Neural Machine Translation, addressing the case of improving performance on a new and potentially unknown domain without sacrificing performance on the original domain. We adapt sequentially across two Spanish-English and three English-German tasks, comparing unregularized fine-tuning, L2 and Elastic Weight Consolidation. We then report a novel scheme for adaptive NMT ensemble decoding by extending Bayesian Interpolation with source information, and show strong improvements across test domains without access to the domain label.


Introduction
Neural Machine Translation (NMT) models are effective when trained on broad domains with large datasets, such as news translation (Bojar et al., 2017). However, test data may be drawn from a different domain, on which general models can perform poorly (Koehn and Knowles, 2017). We address the problem of adapting to one or more domains while maintaining good performance across all domains. Crucially, we assume the realistic scenario where the domain is unknown at inference time.
One solution is ensembling models trained on different domains (Freitag and Al-Onaizan, 2016). This approach has two main drawbacks. Firstly, obtaining models for each domain is challenging. Training from scratch on each new domain is impractical, while continuing training on a new domain can cause catastrophic forgetting of previous tasks (French, 1999), even in an ensemble (Freitag and Al-Onaizan, 2016). Secondly, ensemble weighting requires knowledge of the test domain.
We address the model training problem with regularized fine-tuning, using an L2 regularizer  and Elastic Weight Consolidation (EWC) (Kirkpatrick et al., 2017). We finetune sequentially to translate up to three domains with the same model.
We then develop an adaptive inference scheme for NMT ensembles by extending Bayesian Interpolation (BI) (Allauzen and Riley, 2011) to sequence-to-sequence models. 1 This lets us calculate ensemble weights adaptively over time without needing the domain label, giving strong improvements over uniform ensembling for baseline and fine-tuned models.

Adaptive training
In NMT fine-tuning, a model is first trained on a task A, typically translating a large generaldomain corpus (Luong and Manning, 2015). The optimized parameters θ * A are fine-tuned on task B, a new domain. Without regularization, catastrophic forgetting can occur: performance on task A degrades as parameters adjust to the new objective. A regularized objective is: where L A (θ) and L B (θ) are the likelihood of tasks A and B. We compare three cases: , a sample estimate of task A Fisher information. This effectively measures the importance of θ j to task A.
For L2 and EWC we tune Λ on the validation sets for new and old tasks to balance forgetting against new-domain performance.

Adaptive decoding
We extend the BI formalism to condition on a source sequence, letting us apply it to adaptive NMT ensemble weighting. We consider models p k (y|x) trained on K distinct domains, used for tasks t = 1, . . . , T . In our case a task is decoding from one domain, so T = K. We assume throughout that p(t) = 1 T , i.e. that tasks are equally likely absent any other information.
A standard, fixed-weight ensemble would translate with: The BI formalism assumes that we have tuned sets of ensemble weights λ k,t for each task. This defines a task-conditional ensemble which can be used as a fixed weight ensemble if the task is known. However if the task t is not known, we wish to translate with: At step i, where h i is history y 1:i−1 : This has the form of an adaptively weighted ensemble where, by comparison with Eq. 2: In decoding, at each step i adaptation relies on a recomputed estimate of the task posterior:

Static decoder configurations
In static decoding (Eq. 2), the weights W k are constant for each source sentence x. BI simplifies to a uniform ensemble when λ k,t = p(t|x) = 1 T . This leads to W k,i = 1 K (see Eq. 6) as a fixed equalweight interpolation of the component models.
Static decoding can also be performed with task posteriors conditioned only on the source sentence, which reflects the assumption that the history can be disregarded and that p(t|h i , x) = p(t|x). In the most straightforward case, we assume that only domain k is useful for task t: λ k,t = δ k (t) (1 for k = t, 0 otherwise). Model weighting simplifies to a fixed ensemble: and decoding proceeds according to Eq. 2. We refer to this as decoding with an informative source (IS). We propose using G t , an collection of n-gram language models trained on source language sentences from tasks t, to estimate p(t|x): In this way we use source language n-gram language models to estimate p(t = k|x) in Eq. 8 for static decoding with an informative source.

Adaptive decoder configurations
For adaptive decoding with Bayesian Interpolation, as in Eq. 5, the model weights vary during decoding according to Eq. 6 and Eq. 7. We assume here that p(t|x) = p(t) = 1 T . This corresponds to the approach in Allauzen and Riley (2011), which considers only language model combination for speech recognition. We refer to this in experiments simply as BI. A refinement is to incorporate Eq. 9 into Eq. 7, which would be Bayesian Interpolation with an informative source (BI+IS).
We now address the choice of λ k,t . A simple but restrictive approach is to take λ k,t = δ k (t). We refer to this as identity-BI, and it embodies the assumption that only one domain is useful for each task.
Alternatively, if we have validation data V t for each task t, parameter search can be done to optimize λ k,t for BLEU over V t for each task. This is straightforward but relatively costly. We propose a simpler approach based on the source language n-gram language models from Eq. 9. We assume that each G t is also a language model for its corresponding domain k. With G k,t = x∈Vt G k (x), we take: λ k,t can be interpreted as the probability that task t contains sentences x drawn from domain k as estimated over the V t . Figure 1 demonstrates this adaptive decoding scheme when weighting a biomedical and a general (news) domain model to produce a biomedical sentence under BI. The model weights W k,i are even until biomedical-specific vocabulary is produced, at which point the in-domain model dominates.

Summary
We summarize our approaches to decoding in Table 1.
Eq. 9 Eq. 10 Table 1: Setting task posterior p(t|x) and domain-task weight λ k,t for T tasks under decoding schemes in this work. Note that IS can be combined with either Identity-BI or BI by simply adjusting p(t|h i , x) according to Eq. 7.

Related Work
Approaches to NMT domain adaptation include training data selection or generation (Sennrich et al., 2016a;Wang et al., 2017;Sajjad et al., 2017) and fine-tuning output distributions (Dakwale and Monz, 2017;. Vilar (2018) regularizes parameters with an importance network, while  freeze subsets of the model parameters before finetuning. Both observe forgetting with the adapted model on the general domain data in the realistic scenario where the test data domain is unknown.  fine-tune with L2 regularization to reduce forgetting. Concurrently with our work, Thompson et al. (2019) apply EWC to reduce forgetting during NMT domain adaptation.
During inference, Garmash and Monz (2016) use a gating network to learn weights for a multisource NMT ensemble. Freitag and Al-Onaizan (2016) use uniform ensembles of general and noreg fine-tuned models.

Experiments
We report on Spanish-English (es-en) and English-German (en-de). For es-en we use the Scielo corpus (Neves et al., 2016), with Health as the general domain, adapting to Biological Sciences ('Bio'). We evaluate on the domain-labeled Health and Bio 2016 test data.
The en-de general domain is the WMT18 News task (Bojar et al., 2017), with all data except ParaCrawl oversampled by 2 . We validate on newstest17 and evaluate on newstest18. We adapt first to the IWSLT 2016 TED task (Cettolo et al., 2016), and then sequentially to the APE 2017 IT task (Turchi et al., 2017).
We filter training sentences for minimum three tokens and maximum 120 tokens, and remove sentence pairs with length ratios higher than 4.5:1 or lower than 1:4.5. Table 2 shows filtered training sentence counts. Each language pair uses a 32K-merge source-target BPE vocabulary trained on the general domain (Sennrich et al., 2016b).
We implement in Tensor2Tensor (Vaswani et al., 2018) and use its base Transformer model (Vaswani et al., 2017) for all NMT models. At inference time we decode with beam size 4 in SGNMT (Stahlberg et al., 2017) and evaluate with case-sensitive detokenized BLEU using Sacre-BLEU (Post, 2018). For BI, we use 4-gram KENLM models (Heafield, 2011).    We wish to improve performance on new domains without reduced performance on the general domain, to give strong models for adaptive decoding. For es-en, the Health and Bio tasks overlap, but catastrophic forgetting still occurs under noreg (Table 3). Regularization reduces forgetting and allows further improvements on Bio over noreg fine-tuning. We find EWC outperforms the L2 approach of  in learning the new task and in reduced forgetting.
In the en-de News/TED task (Table 4), all fine-tuning schemes give similar improvements on TED. However, EWC outperforms no-reg and L2 on News, not only reducing forgetting but giving 0.5 BLEU improvement over the baseline News model.
The IT task is very small: training on IT data alone results in over-fitting, with a 17 BLEU improvement under fine-tuning. However, no-reg fine-tuning rapidly forgets previous tasks. EWC reduces forgetting on two previous tasks while further improving on the target domain.

Adaptive decoding results
At inference time we may not know the test data domain to match with the best adapted model, let alone optimal weights for an ensemble on that domain. Table 5 shows improvements on data without domain labelling using our adaptive decoding schemes with unadapted models trained only on one domain (models 1+2 from Table 3 and 1+2+3 from Table 4). We compare with the 'oracle' model trained on each domain, which we can only use if we know the test domain.
Uniform ensembling under-performs all oracle models except es-en Bio, especially on general domains. Identity-BI strongly improves over uniform ensembling, and BI with λ as in Eq. 10 improves further for all but es-en Bio. BI and IS both individually outperform the oracle for all but IS-News, indicating these schemes do not simply learn to select a single model.
The combined scheme of BI+IS outperforms either BI or IS individually, except in en-de IT. We speculate IT is a distinct enough domain that p(t|x) has little effect on adapted BI weights.
In Table 6 we apply the best adaptive decoding scheme, BI+IS, to models fine-tuned with EWC. The es-en ensemble consists of models 1+6 from Table 3 and the en-de ensemble models 1+7+10 from Table 4. As described in Section 2.1 EWC models perform well over multiple domains, so the improvement over uniform ensembling is less striking than for unadapted models. Nevertheless adaptive decoding improves over both uniform ensembling and the oracle model in most cases.
With adaptive decoding, we do not need to assume whether a uniform ensemble or a single model might perform better for some potentially unknown domain. We highlight this in Table 7 by reporting results with the ensembles of Tables 5 and 6 over concatenated test sets, to mimic the realistic scenario of unlabelled test data. We additionally include the uniform no-reg ensembling approach given in Freitag and Al-Onaizan (2016) using models 1+4 from Table 3 and 1+5+8 from  Table 4.
Uniform no-reg ensembling outperforms unadapted uniform ensembling, since fine-tuning gives better in-domain performance. EWC    achieves similar or better in-domain results to noreg while reducing forgetting, resulting in better uniform ensemble performance than no-reg. BI+IS decoding with single-domain trained models achieves gains over both the naive uniform approach and over oracle single-domain models. BI+IS with EWC-adapted models gives a 0.9 / 3.4 BLEU gain over the strong uniform EWC ensemble, and a 2.4 / 10.2 overall BLEU gain over the approach described in Freitag and Al-Onaizan (2016).

Conclusions
We report on training and decoding techniques that adapt NMT to new domains while preserving performance on the original domain. We demonstrate that EWC effectively regularizes NMT finetuning, outperforming other schemes reported for NMT. We extend Bayesian Interpolation with source information and apply it to NMT decoding with unadapted and fine-tuned models, adaptively weighting ensembles to out-perform the ora-cle case, without relying on test domain labels. We suggest our approach, reported for domain adaptation, is broadly useful for NMT ensembling.