Unsupervised Domain Adaptation for Neural Machine Translation with Domain-Aware Feature Embeddings

The recent success of neural machine translation models relies on the availability of high quality, in-domain data. Domain adaptation is required when domain-specific data is scarce or nonexistent. Previous unsupervised domain adaptation strategies include training the model with in-domain copied monolingual or back-translated data. However, these methods use generic representations for text regardless of domain shift, which makes it infeasible for translation models to control outputs conditional on a specific domain. In this work, we propose an approach that adapts models with domain-aware feature embeddings, which are learned via an auxiliary language modeling task. Our approach allows the model to assign domain-specific representations to words and output sentences in the desired domain. Our empirical results demonstrate the effectiveness of the proposed strategy, achieving consistent improvements in multiple experimental settings. In addition, we show that combining our method with back translation can further improve the performance of the model.


Introduction
While neural machine translation (NMT) systems have proven to be effective in scenarios where large amounts of in-domain data are available (Gehring et al., 2017;Vaswani et al., 2017;Chen et al., 2018), they have been demonstrated to perform poorly when the test domain does not match the training data (Koehn and Knowles, 2017).Collecting large amounts of parallel data in all possible domains we are interested in is costly, and in many cases impossible.Therefore, it is essential to explore effective methods to train models that generalize well to new domains.Domain adaptation for neural machine translation has attracted much attention in the research community, with the majority of work focusing on the supervised setting where a small amount of in-domain data is available (Luong and Manning, 2015;Freitag and Al-Onaizan, 2016;Chu et al., 2017;Vilar, 2018).An established approach is to use domain tags as additional input, with the domain representations learned over parallel data (Kobus et al., 2017).In this work, we focus on unsupervised adaptation, where there are no in-domain parallel data available.Within this paradigm, Currey et al. (2017)  In this work, we propose a method of Domain-Aware Feature Embedding (DAFE) that performs unsupervised domain adaptation by disentangling representations into different parts.Because we have no in-domain parallel data, we learn DAFEs via an auxiliary task, namely language modeling.Specifically, our proposed model consists of a base network, whose parameters are shared across settings, as well as both domain and task embedding learners.By separating the model into different components, DAFE can learn representations tailored to specific domains and tasks, which can then be utilized for domain adaptation.
We evaluate our method in a Transformer-based NMT system (Vaswani et al., 2017) under two different data settings.Our approach demonstrates consistent improvements of up to 5 BLEU points over unadapted baselines, and up to 2 BLEU points over strong back-translation models.Combining our method with back translation can further improve the performance of the model, suggesting the orthogonality of the proposed approach and methods that rely on synthesized data.

Methods
In this section, we first illustrate the architecture of DAFE, then describe the overall training strategy.

Architecture
DAFE disentangles hidden states into different parts so that the network can learn representations for particular domains or tasks, as illustrated in Figure 1.Specifically, it consists of three parts: a base network with parameters θ base that learns common features across different tasks and domains, a domain-aware feature embedding learner that generates embeddings θ τ domain given input domain τ and a task-aware feature embedding learner that outputs task representations θ γ task given input task γ.The final outputs for each layer are obtained by a combination of the base network outputs and feature embeddings.
The base network is implemented in the encoder-decoder framework (Sutskever et al., 2014;Cho et al., 2014).Both the task and domain embedding learners directly output feature embeddings with look-up operations.
In this work, the domain-aware embedding learner learns domain representations θ in domain and Train {θ base , θ out domain , θ mt task } with Eqn. 1 8: end while θ out domain from in-domain and out-of-domain data respectively, and the task-aware embedding leaner learns task embeddings θ mt task and θ lm task for machine translation and language modeling.
The feature embeddings are generated at each encoding layer (including the source word embedding layer) and have the same size as the hidden states of the base model.It should be noted that feature embedding learners generate different embeddings at different layers.
Formally, given a specific domain τ and task γ, the output of the l-th encoding layer H (l) e would be: where θ γ,(l) domain and θ τ,(l) task are single vectors and LAYER e (•) can be any layer encoding function, such as an LSTM (Hochreiter and Schmidhuber, 1997) or Transformer (Vaswani et al., 2017).
In this paper, we adopt a simple, add operation to combine outputs of different parts which already achieves satisfactory performance as shown in Section 3. We leave investigating more sophisticated combination strategies for future work.

Training Objectives
In the unsupervised domain adaptation setting, we assume access to an out-of-domain parallel training corpus (X out , Y out ) and target-language indomain monolingual data Y in .
Neural machine translation.Our target task is machine translation.Both the base network and embedding learners are jointly trained with the objective: where θ = {θ base , θ mt task , θ out domain }.Language modeling.We choose masked language modeling (LM) as our auxiliary task.Following Lample et al. (2018a,b), we create corrupted versions C(y) for each target sentence y by randomly dropping and slightly shuffling words.During training, gradient ascent is used to maximize the objective: where θ = {θ base , θ lm task , θ Baselines.We compare our methods with two baseline models: 1) The copied monolingual data model (Currey et al., 2017) which copies target in-domain monolingual data to the source side; 2) Back-translation (Sennrich et al., 2016a) which enriches the training data by generating synthetic in-domain parallel data via a target-tosource NMT model.We characterize the two baselines as data-centric methods as they rely on synthesized data.In contrast, our approach is model-centric as we mainly focus on modifying the model architecture.We also perform an ablation study by removing the embedding learners (denoted as "DAFE w/o Embed") and the model will just perform multi-task learning.

Main Results
Adapting between domains.As shown in the first 6 results columns of Table 1, the unadapted baseline model (row 1) performs poorly when adapting between domains.The copy method (row 2) and back translation (row 3) both improve the model significantly, with back-translation being a better alternative to copying.DAFE (row 5) achieves superior performance compared to backtranslation, with improvements of up to 2 BLEU points.Also, removing the embedding learners leads to degraded performance (row 4), indicating the necessity of their existence.
Adapting from a general to a specific domain.
In the second data setting (last 6 columns of Table 1), with relatively large amounts of generaldomain datasets, back-translation achieves competitive performance.In this setting, DAFE improves the unadapted baseline significantly but it does not outperform back-translation.We hypothesize this is because the quality of back-translated data is relatively good.

Combining DAFE with Back-Translation
We conjecture that DAFE is complementary to the data-centric methods.We attempt to support this intuition by combining DAFE with backtranslation (the best data-centric approach).We try three different strategies to combine DAFE with back-translation, outlined in Table 1.
Simply using back-translated data to train DAFE (row 6) already achieves notable improvements of up to 4 BLEU points.We can also generate back-translated data using target-to-source models trained with DAFE, with which we train the forward model (Back-DAFE, row 7).By doing so, the back-translated data will be of higher quality and thus the performance of the source-to-target model can be improved.The overall best strategy is to use Back-DAFE to generate synthetic in-domain data and train the DAFE model with the back-translated data (Back-DAFE+DAFE, row 8).Across almost all adaptation settings, Back-DAFE+DAFE leads to higher translation quality, as per our intuition.An advan- Reference please report this bug to the developers .MED-embed please report this to the EMEA .

IT-embed
please report this bug to the developers .Reference for intramuscular use .MED-embed for intramuscular use .

IT-embed
for the use of the product .
Table 3: Controlling the output domain by providing different domain embeddings.We use comparemt (Neubig et al., 2019) to select examples.
tage of this setting is that the back-translated data allow us to learn θ in domain with the translation task.

Analysis
Low-resource scenarios.One advantage of DAFE over back-translation is that we do not need a good target-to-source translation model, which can be difficult be acquire in low-resource scenarios.We randomly sample different amounts of training data and evaluate the performance of DAFE and back-translation on the development set.As shown in Figure 2, DAFE significantly outperforms back-translation in data-scarce scenarios, as low quality back-translated data can actually be harmful to downstream performance.
Controlling the output domain.An added perk of our model is the ability to control the output domain by providing the desired domain embeddings.As shown in  (Chu and Wang, 2018).Data-centric approaches mainly focus on selecting or generating the domain-related data using existing in-domain monolingual data.Both the copy method (Currey et al., 2017) and back-translation (Sennrich et al., 2016a) are representative data-centric methods.In addition, Moore and Lewis (2010); Axelrod et al. (2011); Duh et al. (2013) use LMs to score the outof-domain data, based on which they select data similar to in-domain text.Model-centric methods have not been fully investigated yet.Gulcehre et al. (2015) propose to fuse LMs and NMT models, but their methods require querying two models during inference and have been demonstrated to underperform the data-centric ones (Chu et al., 2018).There are also work on adaptation via retrieving sentences or n-grams in the training data similar to the test set (Farajian et al., 2017;Bapna and Firat, 2019).However, it can be difficult to find similar parallel sentences in domain adaptation settings.

Conclusion
In this work, we propose a simple yet effective unsupervised domain adaptation technique for neural machine translation, which adapts the model by domain-aware feature embeddings learned with language modeling.Experimental results demonstrate the effectiveness of the proposed approach across settings.In addition, analysis reveals that our method allows us to control the output domain of translation results.Future work include designing more sophisticated architectures and combination as well as validating our model on other language pairs and datasets.

Figure 1 :
Figure 1: Main architecture of DAFE.Embedding learners generate domain-and task-specific features at each layer, which are then integrated into the output of the base network.
copy the in-domain monolingual data from the target side to the source side and Sennrich et al. (2016a) concatenate backtranslated data with the original corpus.However, these methods learn generic representations for all the text, as the learned representations are shared for all the domains and synthetic and natural data are treated equally.Sharing embeddings may be sub-optimal as data from different domains are inherently different.This problem is exacerbated arXiv:1908.10430v1[cs.CL] 27 Aug 2019 when words have different senses in different domains.

Table 1 :
Translation accuracy (BLEU) under different settings.The second and third rows list source and target domains respectively."DAFE w/o Embed" denotes DAFE without embedding learners and "Back-DAFE" denotes back-translation by target-to-source model trained with DAFE.DAFE outperforms other approaches when adapting between domains (row 1-5, column 2-7) and is complementary to back-translation (row 6-8).
out domain } for out-of-domain data and {θ base , θ lm task , θ in domain } for in-domain data.

Table 2 :
Providing mismatched domain embeddings leads to degraded performance.