Iterative Domain-Repaired Back-Translation

In this paper, we focus on the domain-specific translation with low resources, where in-domain parallel corpora are scarce or nonexistent. One common and effective strategy for this case is exploiting in-domain monolingual data with the back-translation method. However, the synthetic parallel data is very noisy because they are generated by imperfect out-of-domain systems, resulting in the poor performance of domain adaptation. To address this issue, we propose a novel iterative domain-repaired back-translation framework, which introduces the Domain-Repair (DR) model to refine translations in synthetic bilingual data. To this end, we construct corresponding data for the DR model training by round-trip translating the monolingual sentences, and then design the unified training framework to optimize paired DR and NMT models jointly. Experiments on adapting NMT models between specific domains and from the general domain to specific domains demonstrate the effectiveness of our proposed approach, achieving 15.79 and 4.47 BLEU improvements on average over unadapted models and back-translation.


Introduction
Neural Machine Translation (NMT) has achieved impressive performance when large amounts of parallel sentences are available Vaswani et al., 2017;Hassan et al., 2018). However, some previous works have shown that NMT models perform poorly in specific domains, especially when they are trained on the corpora from very distinct domains (Koehn and Knowles, 2017;Chu and Wang, 2018). The fine-tuning method (Luong and Manning, 2015) is a popular way to mitigate the 1 Our code is released in https://github.com/whr94621/ Iterative-Domain-Repaired-Back-Translation effect of domain drift. However, it is not realistic to collect large amounts of high-quality parallel data in every domain we are interested in. Since monolingual in-domain data are usually abundant and easy to obtain, it is essential to explore the unsupervised domain adaptation scenario that utilizes large amounts of out-of-domain bilingual data and in-domain monolingual data.
One straightforward and effective solution for unsupervised domain adaptation is to build indomain synthetic parallel data, including copying monolingual target sentences to the source side (Currey et al., 2017) or back-translation of in-domain monolingual target sentences (Sennrich et al., 2016;. Although the backtranslation approach has proven the superior effectiveness in exploiting monolingual data, directly applying this method in this scenario brings lowquality in-domain synthetic data. Table 1 gives two incorrect translation sentences generated by back-translation method. The main reason for this situation is that the synthetic parallel data is built by imperfect out-of-domain NMT systems, which leads to inappropriate word expressions or wrong translations. Fine-tuning on such synthetic data is very likely to hurt the performance of domain adaptation. In this paper, we extend back-translation by a Domain-Repair (DR) model to explicitly remedy this issue. Specifically, the DR model is designed to re-generate in-domain source sentences given the synthetic data. In this way, the pseudo parallel data's source side can be re-written with the in-domain style, and some wrong translations are fixed. To optimize the DR model, we use the roundtrip translation of monolingual source sentences to construct the corresponding training data.
Since source monolingual data is involved, it is natural to extend the back-translation method to bidirectional setting , which SRC: eine Gewichtszunahme wurde nach Markteinführung bei Patienten berichtet , denen ABIL-IFY verschrieben wurde . REF: weight gain has been reported post-marketing among patients prescribed ABILIFY .
SRC: es werden möglicherweise nicht alle Packungsgrößen in den Verkehr gebracht . REF: not all pack sizes may be marketed.
jointly optimizes source-to-target and target-tosource NMT models. Based on this setting, we propose the iterative domain-repaired back-translation (iter-DRBT) framework to fully exploit both source and target in-domain monolingual data. The whole framework starts with pre-trained out-of-domain bidirectional NMT models, and then these models are adopted to perform round-trip translation on monolingual data to obtain initial bidirectional DR models. Next, as illustrated in Figure 1, we design a unified training algorithm consisting of translation repair and round-trip translation procedures to jointly update DR and NMT models. More particularly, in the translation repair stage, the back-translated synthetic data can be well rewritten as in-domain sentences by the well-trained DR models to further improve NMT models. Then enhanced NMT models run the round-trip translation on monolingual data to build domain-mapping data, which helps DR models better identify mistakes made by the latest NMT models. This training process is iteratively carried out to make full use of the advantage of DR models to improve NMT models. We evaluate our proposed method on German-English multi-domain datasets (Tiedemann, 2012). Experimental results on adapting NMT models between specific domains and from the general domain to specific domains show that our proposed method obtains 15.79 and 4.47 BLEU improvements on average over unadapted models and backtranslation, respectively. Further analysis demonstrates the ability of DR models to repair the synthetic parallel data.
1 2 Figure 1: The training process of the iterative domainrepaired back-translation (iter-DRBT) framework at epoch k, where x and y represent the source and target sentences respectively, x and y denote the translation generated by NMT models. The whole framework consists of translation repair and round-trip translation procedures, which are used to generate corresponding training data for NMT and DR models respectively.

Related Work
Since in-domain parallel corpora are usually hard to obtain, many studies attempt to improve the performance of NMT models without any in-domain parallel sentences. One research line is to extract pseudo in-domain data from large amounts of outdomain parallel data. Biçici and Yuret (2011) use an in-domain held-out set to obtain parallel sentences from out-domain parallel sentences by computing n-gram overlaps. Instead, Moore and Lewis (2010), Axelrod et al. (2011) and Duh et al. (2013) use LMs score to select data similar to in-domain text. Recently, Chen et al. (2017) train a domain classifier to weight the out-domain training samples. There are also work on adaptation via retrieving sentences or n-grams in the training data similar to the test set (Farajian et al., 2017;Bapna and Firat, 2019). However, these methods cannot always guarantee to find domain-specific samples from out-domain data. Another research direction is to exploit plenty of in-domain monolingual data, e.g., integrating a language model during decoding (Ç aglar Gülçehre et al., 2015), copy method (Currey et al., 2017), back-translation (Sennrich et al., 2016) or obtaining domain-aware feature embedding via an auxiliary language modeling . Among these approaches, back-translation is a widely used and effective method in exploiting monolingual data. Our proposed method is also based on backtranslation and makes the most of it by improving the data quality with the DR model.
The methods of exploiting monolingual data in NMT can be naturally applied in unsupervised domain adaptation. Some studies are working on exploiting source-side monolingual data by selftraining (Zhang and Zong, 2016;Chinea-Ríos et al., 2017) or pre-training (Yang et al., 2019;Weng et al., 2020;Ji et al., 2020), and leveraging both source and target monolingual data simultaneously by semi-supervised learning (Cheng et al., 2016), dual learning  and joint training Hoang et al., 2018). Our method utilizes both source and target data as well, with different that we use monolingual data to train bidirectional DR models, and then these models are used to fix pseudo data.
As back-translation is widely considered more effective than the self-training method, several works find that performance of back-translation degrades due to the less rich translation or domain mismatch at the source side of the synthetic data (Edunov et al., 2018;Caswell et al., 2019). Edunov et al. (2018) attempt to use sampling instead of maximum a-posterior when decoding with the reverse direction model. Imamura et al. (2018) add noises to the results of beam search. Caswell et al. (2019) propose to add a tag token at the source side of the synthetic data. Unlike their methods, our method leverages the DR model to re-generate the source side of the synthetic data, which can also increase translation diversity and mitigate the effect of different domains.

Iterative Domain-Repaired Back-Translation
In this section, we first illustrate the overview of iter-DRBT framework, then describe the architecture of DR model and the joint training strategy.

Overview
Suppose that we have non-parallel in-domain monolingual sentences X = {x (s) } and Y = {y (t) } in two languages respectively, as well as two pre-trained out-of-domain translation models NMT 0 x→y and NMT 0 y→x , where x and y denote the source and target sentences respectively. The purpose of unsupervised domain adaptation is to train in-domain models NMT x→y and NMT y→x .
In this work, we incorporate a Domain-Repair (DR) model in the iterative back-translation process to fully exploit in-domain monolingual data, in which the DR model is used to refine translation sentences given the synthetic bilingual sentences. The whole framework consists of translation repair and round-trip translation procedures, which are used to generate corresponding training data for NMT and DR models, respectively. For convenience, we take source-to-target translation (x → y) as an example to explain the usage of our proposed method.
Translation Repair Stage. The basic process of back-translation method is to first translate y (t) into x (t) with NMT 0 y→x , and then fine-tune NMT 0 As the model NMT 0 y→x is not trained on truly indomain bilingual data, there exists domain mismatch between x (t) and the genuine in-domain sentences x. Given the synthetic parallel data Y * = { x (t) , y (t) }, we apply the corresponding DR model (DR ( x,y)→x ) to repair errors in translated sentences, e.g. wrong translations of in-domain phrases or domain-inconsistent expressions, and then obtain the new synthetic parallel data Y = {x (t) , y (t) } to train NMT x→y initialized with NMT 0 x→y .

Round-Trip Translation Stage.
In order to optimize DR ( x,y)→x , we use the round-trip translation of monolingual source sentences X = {x (s) } to construct the corresponding training data X = { x (s) , y (s) , x (s) }, where y (s) and x (s) are generated by NMT 0 x→y and NMT 0 y→x respectively (x (s) → y (s) → x (s) ). In this way, DR ( x,y)→x learns to identify mistakes made by NMT 0 y→x and corresponding mapping rules, which helps to better fix the errors in synthetic parallel data.
Similarly, these two stages are also applied in the reverse translation direction to train target-tosource NMT model (NMT y→x ) and corresponding DR model (DR ( y,x)→y ). As illustrated in Figure 1, it is natural to extend such a training process to a joint training framework, which alternately carries out the translation repair and round-trip translation procedures to make full use of the advantage of DR models to improve NMT models.

Domain-Repair Model
Since the DR model takes the synthetic bilingual sentences as input to produce the in-domain sen-  tences, we parameterize the DR model as a dualsource sequence-to-sequence model. As illustrated in Figure 2, the dual-source transformer model naturally extends the original architecture from Vaswani et al. (2017) by adding another encoder for translated sentences and stacking an additional multi-head attention component above the multihead self-attention component. As usual for the transformer architecture, each block is followed by a skip connection from the previous input and layer normalization. For simplicity, we omit these architecture details in Figure 2.
Our proposed framework involves two DR models (DR ( x,y)→x and DR ( y,x)→y ), both of which are optimized by maximizing the conditional log likelihood on the training corpus X = { x (s) , y (s) , x (s) } and Y = { x (t) , y (t) , y (t) } built by round-trip translation respectively: where θ 1 and θ 2 denote the model parameters of DR ( x,y)→x and DR ( y,x)→y respectively.

Joint Training Strategy
We design the iterative training framework to jointly optimize DR and NMT models, as illustrated in Algorithm 1. The whole training frame- and NMT 0 y→x , in-domain monolingual sentences X = {x (s) } and Y = {y (t) }, maximum iteration number T 2 Use NMT 0 x→y and NMT 0 y→x to perform round-trip translation on X and Y to construct dataset 3 Train DR 0 ( x,y)→x and DR 0 ( y,x)→y with X and Y ; 4 k = 0; 5 for k ≤ T do 6 Translation Repair Stage: 7 Use NMT k x→y and NMT k y→x to build synthetic data X * = {x (s) , y (s) } and Y * = { x (t) , y (t) } for X and Y respectively; 8 Use DR k ( y,x)→y and DR k ( x,y)→x to repair X * and Y * to construct in-domain synthetic data X = {x (s) , y (s) } and Y = {x (t) , y (t) }; 9 Update NMT Models: Round-Trip Translation Stage:

13
Use NMT k+1 x→y and NMT k+1 y→x to perform round-trip translation on X and Y to construct work starts with pre-trained out-of-domain bidirectional NMT models (NMT 0 x→y and NMT 0 y→x ) and in-domain monolingual data (X = {x (s) } and Y = {y (t) }). To train initial DR models, we use NMT 0 x→y and NMT 0 y→x to run roundtrip translation on X and Y to construct dataset Based on initial NMT and DR models, a joint training process is iteratively carried out to further optimize these models. This process consists of translation repair and round-trip translation stages. In the translation repair stage, we first adopt NMT models to translate monolingual data, based on which the DR models are used to further re-write the translated sentences as in-domain sentences. In this way, we can obtain better in-domain synthetic data to further improve NMT models. Next, in the round-trip translation stage, we perform roundtrip translation on monolingual data with enhanced NMT models to re-build training data for DR models. The DR models trained on such datasets can better identify mistakes made by latest NMT models (NMT k+1 x→y and NMT k+1 y→x ) and learn correspond-  ing mapping rules, which helps to better fix the synthetic parallel data in the next iteration. Note that we fine-tune the NMT and DR models in each iteration to speed up the whole training process.  (2019) randomly shuffle the bi-text data and split it into halves, which may bring more overlap than in natural monolingual data, i.e., bilingual sentences from a document are probably selected into monolingual data (e.g., one sentence on the source split and its translation on the target split).
To address the impact of the above two issues, we re-collect in-domain monolingual data and test sets in the following steps: • Download the XML files from OPUS 2 , extract parallel corpus from each documents and record the document boundaries. • Randomly take some documents as dev/test sets and use the rest as training data. • Divide the training set into two parts, where the number of sentences in the two parts is similar. Then the source and target sentences of the first and second halves are chosen as monolingual data, respectively. • De-duplicate all overlap sentences within train/dev/test sets. We choose medical (EMEA) and law (JRC-Acquis) domains for our experiments. All the data statistics are illustrated in Table 2.
Experimental Details. We implement all NMT models with Transformer base (Vaswani et al., 2017). More specifically, the number of layers in the encoder and decoder is set to 6, with 8 attention heads in each layer. Each layer in both encoder and decoder has the same dimension of input and output d model = 512, dimension of feed-forward layer's inner-layer d hidden = 2048. Besides, DR models follow the same setting as the NMT model.
The Adam (Kingma and Ba, 2014) algorithm is used to update DR and NMT models. For training initial NMT and DR models, following the setting of , we set the dropout as 0.1 and the label smoothing coefficient as 0.2. Besides, we adopt the setting of Fairseq (Ott et al., 2019) on IWSLT'14 German to English to fine-tune NMT and DR models. During training, we schedule the learning rate with the inverse square root decay scheme, in which the warm-up step is set as 4000, and the maximum learning rate is set as 1e-3 and 5e-4 for pre-training and fine-tuning, respectively.
For the joint training strategy, we set the maximum iteration number T in Algorithm 1 as 2 for balancing speed and performance. In practice, we train our framework on 2 Tesla P100 GPUs for all tasks, and it takes 2 days to finish the whole training.
Methods. We compare our approach with several baseline methods in our experiment:  BLEU (Post, 2018) in terms of case-sensitive tokenized BLEU (Papineni et al., 2002).

Main Results
Adapting between Specific Domains. We verify our approach by adapting NMT models from one distinct domain to another. As illustrated in the left four columns of As the back-translation method suffers from lowquality synthetic data, iter-BT is used to improve the quality of synthetic data and achieves 2.73 BLEU improvements on average, but it still has 0.46 BLEU behind DRBT. This result indicates that the DR model shows a better ability to repair the imperfections of synthetic data. The joint training of DR and NMT models (iter-DRBT) can further obtain 3.15 BLEU improvements compared to DRBT. It also proves that the joint training process helps DR models to better identify mistakes made by the latest NMT models and fix the synthetic parallel data in the following iteration.  Adapting from General to Specific Domains. We further evaluate our method when adapting a model trained on large amounts of general domain data. We use out-of-domain models trained on the WMT14 German-English dataset and adapt them to the Medical and Law domains, respectively. All results are shown in the right half of Table 3. These results show a similar pattern as previous experiments, except that the gap between our method and BT/iter-BT is reduced. We attribute this reduction to the improvements of general models on in-domain translation. Even so, the iter-DRBT yields the best performance on all test sets, with 11.03 and 0.79 BLEU improvements on average compared to Base and iter-BT, respectively.
Semi-supervised Adaptation. Our method can be easily applied in semi-supervised domain adaptation, with a limited number of in-domain parallel data available. The implementation in this setting is to mix the in-domain parallel data with the generated synthetic data for NMT models training. In addition to the round-translation on monolingual data, we conduct back-translation on parallel data to construct corresponding training data for DR models training.
We conduct experiments on adapting Germanto-English NMT models from the Law domain to   Table 4. We observe the consistent improvement of our proposed method. It is worth noting that given 50K in-domain parallel data, the gap between using repaired synthetic data and using the actual parallel data is rapidly reduced from 12.58 to 3.52 BLEU, and further decreased to only 2.78 by jointtraining with one more iteration, demonstrating the effectiveness of our method in the semi-supervised scenario.

Effect of Joint Training
We further investigate the effect of joint training with more iterations. Specifically, we conduct experiments on adapting from the Medical domain to the LAW domain from German to English, in which iterative back-translation is used for comparison.
We plot the BLEU curve of these two methods over the number of iterations. From Figure 3, we can observe that our proposed method (iter-DRBT) consistently outperforms iterative back-translation (iter-BT) under the same number of iterations. As the number of iterations increases, BLEU improvement achieved by iter-DRBT and iter-BT gradually decreases, but the gap remains.

Analysis of Domain Repair Models
In this section, we mainly discuss how DR models repair the source side of synthetic data to improve its quality. Compared to the original backtranslation data, we find that the change comes from three main points: an improvement in the overall quality of the source side, an improvement in the accuracy of the in-domain lexical translation, and a closer in-domain style of the source side.
Improvement of Translation Quality. We first assess the change in translation quality at the source side of back-translation data. We report the BLEU changes on all the development sets before and after using the DR model. All the results are listed in Table 5. We can see that the source side of the back translation data generated by the out domain model is inferior at the initial stage. The DR model significantly improves its quality, which improves the effectiveness of back-translation.
Improvement of Lexical Translation. We then assess the change in lexical translation at the source side of synthetic data before and after domain repair. Based on the frequency of words that appear in the out-of-domain training data, we allocate target side words of development sets into three buckets (< 1, [1,20) and ≥ 20, which represent zeroshot words, few-shot words, and frequent words, respectively), and compute the word translation f-scores within each bucket. We use compare-mt  to do all the analysis and plot the results in Figure 4. We can see that the synthetic data repaired by DR models show better word translation in all the buckets. It is worth noting that the improvement of word translation f-scores on zero/few-shot (< 20) words dramatically exceeds that on frequent words, which shows that DR models are especially good at repairing in-domain lexical mistranslations.
Improvement of Domain Consistent Style. We further evaluate how can DR models remedy the domain mismatch issue at the source side of back-SRC: Arzneimittel , deren Plasmaspiegel bei gemeinsamer Anwendung mit Telzir erhöht sein können REF: Medicinal products whose plasma levels may be increased when co-administered with Telzir w/o DR: Medicinal products whose plasma ponds may be increased if they are :::::::: commonly :::: used by telzir w/ DR: Medicinal products whose plasma aspiegel may be increased when co-administered with Telzir SRC: Johanniskraut ( Hypericum perforatum ) Die Serumspiegel von Amprenavir und Ritonavir können durch die gleichzeitige Anwendung von pflanzlichen Zubereitungen mit Johanniskraut ( Hypericum perforatum ) erniedrigt werden . REF: St John's wort ( Hypericum perforatum ) Serum levels of amprenavir and ritonavir can be reduced by concomitant use of the herbal preparation St John's wort ( Hypericum perforatum ) . w/o DR: ::::::::::: Johanniskraut ( Hypericum perforatum ) The serum levels of Amprenavir and Ritonavir can be reduced by ::: the :::::::::: simultaneous ::: use :: of ::::: plant preparations with currant ( hypericum perforatum ) . w/ DR: St. John's wort ( Hypericum perforatum ) Serum levels of amprenavir and ritonavir can be stratified by concomitant use of herbal preparations containing St John's wort ( Hypericum perforatum ) .  translated data, including domain inconsistent word selection and language style. We evaluate them by observing the perplexity change measured by indomain and out-of-domain language models before and after being repaired, in which all the language models are trained with KenLM (Heafield, 2011).
The out-of-domain language models are trained on out-of-domain training data, while in-domain language models are trained on the original translations of in-domain monolingual data. We list all the perplexity scores in Table 7. On both MED2LAW and WMT2LAW, we observe a consistent bias of perplexity scores towards in-domain language mod-els, which demonstrates that DR models correct the expression of the source side of synthetic data to be more domain consistent.  Case Study. We provide some examples to display how DR models improve the synthetic data. As shown in Table 6, the DR model can reduce some mistranslation, such as correcting the translation of "Johanniskraut" into "St John's wort", as well as generating more domain-related expressions, like "co-administered" and "concomitant use of herbal preparations". This shows the ability of domain repair models to improve the quality and domain consistency of synthetic data generated by imperfect out-of-domain NMT models.

Conclusion
In this paper, we argue that back-translation, the predominant unsupervised domain adaptation method in neural machine translation, suffers from the domain shift, restricting the performance of unsupervised domain adaptation. We propose to remedy this mismatch by leveraging a domain repair model that corrects the errors in back-translation sentences. Then the iterative domain-repaired back-translation framework is designed to make full use of the advantage of the domain repair model. Experiments on adapting translation models between specific domains and from general domain to specific domains demonstrate the effectiveness of our method, achieving significant improvements over strong back-translation baselines.
In the future, we would like to extend our method to enhance the back-translation method in multidomain settings.