Domain Adaptation of Neural Machine Translation by Lexicon Induction

It has been previously noted that neural machine translation (NMT) is very sensitive to domain shift. In this paper, we argue that this is a dual effect of the highly lexicalized nature of NMT, resulting in failure for sentences with large numbers of unknown words, and lack of supervision for domain-specific words. To remedy this problem, we propose an unsupervised adaptation method which fine-tunes a pre-trained out-of-domain NMT model using a pseudo-in-domain corpus. Specifically, we perform lexicon induction to extract an in-domain lexicon, and construct a pseudo-parallel in-domain corpus by performing word-for-word back-translation of monolingual in-domain target sentences. In five domains over twenty pairwise adaptation settings and two model architectures, our method achieves consistent improvements without using any in-domain parallel sentences, improving up to 14 BLEU over unadapted models, and up to 2 BLEU over strong back-translation baselines.


Introduction
Neural machine translation (NMT) has demonstrated impressive performance when trained on large-scale corpora (Bojar et al., 2018).However, it has also been noted that NMT models trained on corpora in a particular domain tend to perform poorly when translating sentences in a significantly different domain (Chu and Wang, 2018;Koehn and Knowles, 2017).Previous work in the context of phrase-based statistical machine translation (Daumé III and Jagarlamudi, 2011) has noted that unseen (OOV) words account for a large portion of translation errors when switching to new domains.However this problem of OOV words in cross-domain transfer is under-examined in the context of NMT, where both training methods and experimental results will differ greatly.In this paper, we try to fill this gap, examining domain adaptation methods for NMT specifically focusing on correctly translating unknown words.
As noted by Chu and Wang (2018), there are two important distinctions to make in adaptation methods for MT.The first is data requirements; supervised adaptation relies on in-domain parallel data, and unsupervised adaptation has no such requirement.There is also a distinction between model-based and data-based methods.Modelbased methods make explicit changes to the model architecture such as jointly learning domain discrimination and translation (Britz et al., 2017), interpolation of language modeling and translation (Gulcehre et al., 2015;Domhan and Hieber, 2017), and domain control by adding tags and word features (Kobus et al., 2017).On the other hand, data-based methods perform adaptation either by combining in-domain and out-of-domain parallel corpora for supervised adaptation (Luong and Manning, 2015;Freitag and Al-Onaizan, 2016) or by generating pseudo-parallel corpora from indomain monolingual data for unsupervised adaptation (Sennrich et al., 2016a;Currey et al., 2017).
Specifically, in this paper we tackle the task of data-based, unsupervised adaptation, where representative methods include creation of a pseudoparallel corpus by back-translation of in-domain monolingual target sentences (Sennrich et al., 2016a), or construction of a pseudo-parallel indomain corpus by copying monolingual target sentences to the source side (Currey et al., 2017).However, while these methods have potential to strengthen the target-language decoder through addition of in-domain target data, they do not explicitly provide direct supervision of domainspecific words, which we argue is one of the major difficulties caused by domain shift.
To remedy this problem, we propose a new data-based method for unsupervised adaptation that specifically focuses the unknown word problem: domain adaptation by lexicon induction (DALI).Our proposed method leverages large amounts of monolingual data to find translations of in-domain unseen words, and constructs a pseudo-parallel in-domain corpus via word-forword back-translation of monolingual in-domain target sentences into source sentences.More specifically, we leverage existing supervised (Xing et al., 2015) and unsupervised (Conneau et al., 2018) lexicon induction methods that project source word embeddings to the target embedding space, and find translations of unseen words by their nearest neighbors.For supervised lexicon induction, we learn such a mapping function under the supervision of a seed lexicon extracted from out-of-domain parallel sentences using word alignment.For unsupervised lexicon induction, we follow Conneau et al. (2018) to infer a lexicon by adversarial training and iterative refinement.
In the experiments on German-to-English translation across five domains (Medical, IT, Law, Subtitles, and Koran), we find that DALI improves both RNN-based (Bahdanau et al., 2015) and Transformer-based (Vaswani et al., 2017) models trained on an out-of-domain corpus with gains as high as 14 BLEU.When the proposed method is combined with back-translation, we can further improve performance by up to 4 BLEU.Further analysis shows that the areas in which gains are observed are largely orthogonal to backtranslation; our method is effective in translating in-domain unseen words, while back-translation mainly improves the fluency of source sentences, which helps the training of the NMT decoder.

Domain Adaptation by Lexicon Induction
Our method works in two steps: (1) we use lexicon induction methods to learn an in-domain lexicon from in-domain monolingual source data D src-in and target data D tgt-in as well as out-of-domain parallel data D parallel-out , (2) we use this lexicon to create a pseudo-parallel corpus for MT.

Lexicon Induction
Given separate source and target word embeddings, X, Y ∈ R d×N , trained on all available monolingual source and target sentences across all domains, we leverage existing lexicon induction methods that perform supervised (Xing et al., 2015) or unsupervised (Conneau et al., 2018) learning of a mapping f (X) = WX that transforms source embeddings to the target space, then selects nearest neighbors in embedding space to extract translation lexicons.
Supervised Embedding Mapping Supervised learning of the mapping function requires a seed lexicon of size n, denoted as L = {(s, t) i } n i=1 .We represent the source and target word embeddings of the i-th translation pair (s, t) i by the ith column vectors of X (n) , Y (n) ∈ R d×n respectively.Xing et al. (2015) show that by enforcing an orthogonality constraint on W ∈ O d (R), we can obtain a closed-form solution from a singular value decomposition (SVD) of Y (n) X (n) T : In a domain adaptation setting we have parallel out-of-domain data D parallel-out , which can be used to extract a seed lexicon.Algorithm 1 shows the procedure of extracting this lexicon.We use the word alignment toolkit GIZA++ (Och and Ney, 2003) We take the union of the lexicons in both directions and further prune out translation pairs containing punctuation that is non-identical.To avoid multiple translations of either a source or target word, we find the most common translation pairs in D parallel-out , sorting translation pairs by the number of times they occur in D parallel-out in descending order, and keeping those pairs with highest frequency in D parallel-out .
Unsupervised Embedding Mapping For unsupervised training, we follow Conneau et al. (2018) in mapping source word embeddings to the target word embedding space through adversarial training.Details can be found in the reference, but briefly a discriminator is trained to distinguish between an embedding sampled from WX and Y, and W is trained to prevent the discriminator from identifying the origin of an embedding by making WX and Y as close as possible.
Induction Once we obtain the matrix W either from supervised or unsupervised training, we map all the possible in-domain source words to the target embedding space.We compute the nearest neighbors of an embedding by a distance metric, Cross-Domain Similarity Local Scaling (CSLS; Conneau et al. (2018)): where r T (Wx) and r S (y) measure the average cosine similarity between their K nearest neighbors in the source and target spaces respectively.
To ensure the quality of the extracted lexicons, we only consider mutual nearest neighbors, i.e., pairs of words that are mutually nearest neighbors of each other according to CSLS.This significantly decreases the size of the extracted lexicon, but improves the reliability.

NMT Data Generation and Training
Finally, we use this lexicon to create pseudoparallel in-domain data to train NMT models.Specifically, we follow Sennrich et al. (2016a) in back-translating the in-domain monolingual target sentences to the source language, but instead of using a pre-trained target-to-source NMT system, we simply perform word-for-word translation using the induced lexicon L. Each target word in the target side of L can be deterministically backtranslated to a source word, since we take the nearest neighbor of a target word as its translation according to CSLS.If a target word is not mutually nearest to any source word, we cannot find a translation in L and we simply copy this target word to the source side.We find that more than 80% of the words can be translated by the induced lexicons.We denote the constructed pseudo-parallel in-domain corpus as D pseudo-parallel-in .
During training, we first pre-train an NMT system on an out-of-domain parallel corpus D parallel-out , and then fine tune the NMT model on a constructed parallel corpus.More specifically, to avoid overfitting to the extracted lexicons, we sample an equal number of sentences from D parallel-out , and get a fixed subset D parallel-out , where |D parallel-out | = |D pseudo-parallel-in |.We concatenate D parallel-out with D pseudo-parallel-in , and finetune the NMT model on the combined corpus.

Data
We follow the same setup and train/dev/test splits of Koehn and Knowles (2017), using a Germanto-English parallel corpus that covers five different domains.Data statistics are shown in Note that these domains are very distant from each other.Following Koehn and Knowles (2017), we process all the data with byte-pair encoding (Sennrich et al., 2016b) to construct a vocabulary of 50K subwords.To build an unaligned monolingual corpus for each domain, we randomly shuffle the parallel corpus and split the corpus into two parts with equal numbers of parallel sentences.We use the target and source sentences of the first and second halves respectively.We combine all the unaligned monolingual source and target sentences on all five domains to train a skip-gram model using fasttext (Bojanowski et al., 2017).We obtain source and target word embeddings in 512 dimensions by running 10 epochs with a context window of 10, and 10 negative samples.

Main Results
We first compare DALI with other adaptation strategies on both RNN-based and Transformerbased NMT models.
Table 1 shows the performance of the two models when trained on one domain (columns) and tested on another domain (rows).We fine-tune the unadapted baselines using pseudo-parallel data created by DALI.We use the unsupervised lexicon here for all settings, and leave a comparison across lexicon creation methods to Table 3.Based on the last two columns in Table 1, DALI substantially improves both NMT models with average gains of 2.79-7.54BLEU over the unadapted baselines.
We further compare DALI with two popular data-based unsupervised adaptation methods that leverage in-domain monolingual target sentences: (1) a method that copies target sentences to the source side (Copy; Currey et al. (2017)) and (2) back-translation (BT; Sennrich et al. (2016a)), which translates target sentences to the source language using a backward NMT model.We compare DALI with supervised (DALI-S) and unsupervised (DALI-U) lexicon induction.Finally, we (1) experiment with when we directly extract a lexicon from an in-domain corpus using GIZA++ (DALI-GIZA++) and Algorithm 1, and (2) list scores for when systems are trained directly on indomain data (In-domain).For simplicity, we test the adaptation performance of the LSTM-based NMT model, and train a LSTM-based NMT with the same architecture on out-of-domain corpus for English-to-German back-translation.First, DALI is competitive with BT, outperforming it on the medical domain, and underperforming it on the other three domains.Second, the gain from DALI is orthogonal to that from BT -when combining the pseudo-parallel in-domain corpus obtained from DALI-U with that from BT, we can further improve by 2-5 BLEU points on three of four domains.Second, the gains through usage of both DALI-U and DALI-S are surprisingly similar, although the lexicons induced by these two methods have only about 50% overlap.Detailed analysis of two lexicons can be found in Section 3.5.

Word-level Translation Accuracy
Since our proposed method focuses on leveraging word-for-word translation for data augmentation, we analyze the word-for-word translation accuracy for unseen in-domain words.A source word is considered as an unseen in-domain word when it never appears in the out-of-domain corpus.We examine two question: (1) How much does each adaptation method improve the translation accuracy of unseen in-domain words?(2) How does the frequency of the in-domain word affect its translation accuracy?
To fairly compare various methods, we use a lexicon extracted from the in-domain parallel data with the GIZA++ alignment toolkit as a reference lexicon L g .For each unseen in-domain source word in the test file, when the corresponding target word in L g occurs in the output, we consider it as a "hit" for the word pair.First, we compare the percentage of successful in-domain word translations across all adaptation methods.Specifically, we scan the source and reference of the test set to count the number of valid hits C, then scan the output file to get the count C t in the same way.Finally, the hit percentage is calculated as Ct C .The results on experiments adapting IT to other domains are shown in Figure 2. The hit percentage of the unadapted output is extremely low, which confirms our assumption that in-domain word translation poses a major challenge in adaptation scenarios.We also find that all augmentation methods can improve the translation accuracy of unseen in-domain words but our proposed method can outperform all others in most cases.The unseen in-domain word translation accuracy is quantitatively correlated with the BLEU scores, which shows that correctly translating indomain unseen words is a major factor contributing to the improvements seen by these methods.
Second, to investigate the effect of frequency of word-for-word translation, we bucket the unseen in-domain words by their frequency percentile in the pseudo-in-domain training dataset, and calculate calculate the average translation accuracy of unseen in-domain words within each bucket.The results are plotted in Figure 3  lation accuracy also increases, which is consistent with our intuition that the neural network would be able to remember high frequency tokens better.
Since the absolute value of the occurrences are different among all domains, the numerical values of accuracy within each bucket vary across domains, but all lines follow the ascending pattern.

When do Copy, BT and DALI Work?
From Figure 2, we can see that Copy, BT and DALI all improve the translation accuracy of indomain unseen words.In this section, we explore exactly what types of words each method improves on.We randomly pick some in-domain unseen word pairs which are translated 100% correctly in the translation outputs of systems trained with each method.We also count these word pairs' occurrences in the pseudo-in-domain training set.The examples are demonstrated in Table 5.We find that in the case of Copy, over 80% of the successful word translation pairs have the same spelling format for both source and target words, and almost all of the rest of the pairs share subword components.In short, and as expected, Copy excels on improving accuracy of words that have identical forms on the source and target sides.
As expected, our proposed method mainly increases the translation accuracy of the pairs in our induced lexicon.It also leverages the subword components to successfully translate compound words.For example, "monotherapie" does not occur in our induced lexicon, but the model is still able to translate it correctly based on its subwords "mono@@" and "therapie" by leveraging the successfully induced pair "therapie" and "therapy".
It is more surprising to find that adding a back translated corpus significantly improves the model's ability to translate in-domain unseen words correctly, even if the source word never occurs in the pseudo-in-domain corpus.Even more surprisingly, we find that the majority of the correctly translated source words are not segmented at all, which means that the model does not leverage the subword components to make correct translations.In fact, for most of the correctly translated in-domain word pairs, the source words are never seen during training.To further analyze this, we use our BT model to do word-for-word translation for these individual words without any other context, and the results turn out to be extremely bad, indicating that the model does not actually find the correspondence of these word pairs.Rather, it rely solely on the decoder to make the correct translation on the target side for test sentences with related target sentences in the training set.To verify this, Table 4 demonstrates an example extracted from the pseudo-in-domain training set.BT-T shows a monolingual in-domain target sentence and BT-S is the back-translated source sentence.Though the back translation fails to generate any in-domain words and the meaning is unfaithful, it succeeds to generate a similar sentence pattern as the correct source sentence, which is "... ist eine (ein) ... , die (das) ... enthält .".The model can easily detect the pattern through the attention mechanism and translate the highly related word "medicine" correctly.
From the above analysis, it can be seen that the improvement brought by the augmentation of BT and DALI are largely orthogonal.The former utilizes the highly related contexts to translate unseen in-domain words while the latter directly injects reliable word translation pairs to the training corpus.This explains why we get further improvements over either single method alone.

Lexicon Coverage
Intuitively, with a larger lexicon, we would expect a better adaptation performance.In order to examine this hypothesis, we do experiments using pseudo-in-domain training sets generated by our induced lexicon with various coverage levels.Specifically, we split the lexicon into 5 folds randomly and use a portion of it comprising folds 1 through 5, which correspond to 20%, 40%, 60%, 80% and 100% of the original data.We calculate the coverage of the words in the Medical test set comparing with each pseudo-in-domain train-BT-S es ist eine Nachricht , die die aktive Substanz enthält .

BT-T
Invirase is a medicine containing the active substance saquinavir .Test-S ABILIFY ist ein Arzneimittel , das den Wirkstoff Aripiprazol enthält .
Test-T Prevenar is a medicine containing the design of Arixtra .ing set.We use each training set to train a model and get its corresponding BLEU score.From Figure 4, we find that the proportion of the used lexicon is highly correlated with both the known word coverage in the test set and its BLEU score, indicating that by inducing a larger and more accurate lexicon, further improvements can likely be made.

Semi-supervised Adaptation
Although we target unsupervised domain adaptation, it is also common to have a limited amount of in-domain parallel sentences in a semi-supervised adaptation setting.To measure efficacy of DALI in this setting, we first pre-train an NMT model on a parallel corpus in the IT domain, and adapt it to the medical domain.The pre-trained NMT obtains 7.43 BLEU scores on the medical test set.
During fine-tuning, we sample 330,278 out-ofdomain parallel sentences, and concatenate them with 547,325 pseudo-in-domain sentences generated by DALI and the real in-domain sentences.
We also compare the performance of fine-tuning on the combination of the out-of-domain parallel sentences with only real in-domain sentences.We vary the number of real in-domain sentences in the range of [20K, 40K, 80K, 160K, 320K, 480K].
In Figure 5(a), semi-supervised adaptation outperforms unsupervised adaptation after we add more than 20K real in-domain sentences.As the number of real in-domain sentences increases, the BLEU scores on the in-domain test set improve, and finetuning on both the pseudo and real in-domain sentences further improves over fine-tuning sorely on the real in-domain sentences.In other words, given a reasonable number of real in-domain sentences in a common semi-supervised adaptation setting, DALI is still helpful in leveraging a large number of monolingual in-domain sentences.

Effect of Out-of-Domain Corpus
The size of data that we use to train the unadapted NMT and BT NMT models varies from hundreds of thousands to millions, and covers a wide range of popular domains.Nonetheless, the unadapted NMT and BT NMT models can both benefit from training on a large out-of-domain corpus.We examine the question: how does fine-tuning on weak and strong unadapted NMT models affect the adaptation performance?To this end, we compare DALI and BT on adapting from subtitles to medical domains, where the two largest corpus in subtitles and medical domains have 13.9 and 1.3 million sentences.We vary the size of outof-domain corpus in a range of [0.5, 1, 2, 4, 13.9] million, and fix the number of in-domain target sentences to 0.6 million.In Figure 5(b), as the size of out-of-domain parallel sentences increases,

Source
ABILIFY ist ein Arzneimittel , das den Wirkstoff Aripiprazol enthlt .BLEU Reference abilify is a medicine containing the active substance aripiprazole .1.000 Unadapted the time is a figure that corresponds to the formula of a formula .0.204 Copy abilify is a casular and the raw piprexpression offers .0.334 BT prevenar is a medicine containing the design of arixtra .0.524 DALI abilify is a arzneimittel that corresponds to the substance ariprazole .0.588 DALI+BT abilify is a arzneimittel , which contains the substance aripiprazole .0.693 Table 6: Translation outputs from various data augmentation method and our method for IT→Medical adaptation.we have a stronger upadapted NMT which consistently improves the BLEU score of the in-domain test set.Both DALI and BT also benefit from adapting a stronger NMT model to the new domain.Combining DALI with BT further improves the performance, which again confirms our finding that the gains from DALI and BT are orthogonal to each other.Having a stronger BT model improves the quality of synthetic data, while DALI aims at improving the translation accuracy of OOV words by explicitly injecting their translations.

Effect of Domain Coverage
We further test the adaptation performance of DALI when we train our base NMT model on the WMT14 German-English parallel corpus.The corpus is a combination of Europarl v7, Common Crawl corpus and News Commentary, and consists of 4,520,620 parallel sentences from a wider range of domains.In Table 7, we compare the BLEU scores of the test sets between the unadapted NMT and the adapted NMT using DALI-U.We also show the percentage of source words or subwords in the training corpus of five domains being covered by the WMT14 corpus.Although the unadapted NMT system trained on the WMT14 corpus obtains higher scores than that trained on the corpus of each individual domain, DALI still im-

Domain
Base  proves the adaptation performance over the unadapted NMT system by up to 5 BLEU score.

Qualitative Examples
Finally, we show outputs generated by various data augmentation methods.Starting with the unadapted output, we can see that the output is totally unrelated with the reference.By adding the copied corpus, words that have the same spelling in the source and target languages e.g."abilify" are correctly translated.With back translation, the output is more fluent; though keywords like "abilify" are not well translated, in-domain words that are highly related with the context like "medicine" are correctly translated.DALI manages to translate in-domain words like "abilify" and "substance", which are added by DALI using the induced lexicon.By combining both BT and DALI, the output becomes fluent and also contains correctly translated in-domain keywords of the sentence.

Related Work
There is much work on supervised domain adaptation setting where we have large out-of-domain parallel data and much smaller in-domain parallel data.Luong and Manning (2015) propose training a model on an out-of-domain corpus and do finetuning with small sized in-domain parallel data to mitigate the domain shift problem.Instead of naively mixing out-of-domain and in-domain data, Britz et al. (2017) circumvent the domain shift problem by jointly learning domain discrimination and the translation.Joty et al. (2015) and Wang et al. (2017) address the domain adaptation problem by assigning higher weight to out-ofdomain parallel sentences that are close to the indomain corpus.Our proposed method focuses on solving the adaptation problem with no in-domain parallel sentences, a strict unsupervised setting.
Prior work on using monolingual data to do data augmentation could be easily adapted to the domain adaptation setting.Early studies on databased methods such as self-enhancing (Schwenk, 2008;Lambert et al., 2011) translate monolingual source sentences by a statistical machine translation system, and continue training the system on the synthetic parallel data.Recent databased methods such as back-translation (Sennrich et al., 2016a) and copy-based methods (Currey et al., 2017) mainly focus on improving fluency of the output sentences and translation of identical words, our method targets OOV word translation.In addition, there have been several attempts to do data augmentation using monolingual source sentences (Zhang and Zong, 2016;Chinea-Rios et al., 2017).Besides, model-based methods change model architectures to leverage monolingual corpus by introducing an extra learning objective, such as auto-encoder objective (Cheng et al., 2016) and language modeling objective (Ramachandran et al., 2017).Another line of research on using monolingual data is unsupervised machine translation (Artetxe et al., 2018;Lample et al., 2018b,a;Yang et al., 2018).These methods use word-for-word translation as a component, but require a careful design of model architectures, and do not explicitly tackle the domain adaptation problem.Our proposed data-based method does not depend on model architectures, which makes it orthogonal to these model-based methods.
Our work shows that apart from strengthening the target-side decoder, direct supervision over the in-domain unseen words is essential for domain adaptation.Similar to this, a variety of methods focus on solving OOV problems in translation.Daumé III and Jagarlamudi (2011) induce lexicons for unseen words and construct phrase tables for statistical machine translation.However, it is nontrivial to integrate lexicon into NMT models that lack explicit use of phrase tables.With regard to NMT, Arthur et al. (2016) use a lexicon to bias the probability of the NMT system and show promising improvements.Luong and Manning (2015) propose to emit OOV target words by their corresponding source words and do post-translation for those OOV words with a dictionary.Fadaee et al. (2017) propose an effective data augmentation method that generates sentence pairs containing rare words in synthetically created contexts, but this requires parallel training data not available in the fully unsupervised adaptation setting.Arcan and Buitelaar (2017) leverage a domainspecific lexicon to replace unknown words after decoding.Zhao et al. (2018) design a contextual memory module in an NMT system to memorize translations of rare words.Kothur et al. (2018) treats an annotated lexicon as parallel sentences and continues training the NMT system on the lexicon.Though all these works leverage a lexicon to address the problem of OOV words, none specifically target translating in-domain OOV words under a domain adaptation setting.

Conclusion
In this paper, we propose a data-based, unsupervised adaptation method that focuses on domain adaption by lexicon induction (DALI) for mitigating unknown word problems in NMT.We conduct extensive experiments to show consistent improvements of two popular NMT models through the usage of our proposed method.Further analysis show that our method is effective in fine-tuning a pre-trained NMT model to correctly translate unknown words when switching to new domains.Table 12: Out-of-Vocabulary statistics of English Words across five domains.Each row indicates the OOV statistics of the out-of-domain (row) corpus against the in-domain (columns) corpus.The second column shows the vocabulary size of the out-of-domain corpus in each row.The remaining columns (3rd-7th) show the number of domain-specific words in each in-domain corpus with respect to the out-of-domain corpus, and the ratio between the number of out-of-domain corpus and the domain specific words.

Figure 1 :
Figure 1: Work flow of domain adaptation by lexicon induction (DALI).

Figure 2 :
Figure 2: Translation accuracy of in-domain words of the test set on several data augmentation baseline and our proposed method with IT as the out domain

Figure 3 :
Figure 3: Translation accuracy of in-domain unseen words in the test set with regards to the frequency percentile of lexicon words inserted in the pseudo-indomain training corpus.

Figure 4 :
Figure 4: Word coverage and BLEU score of the Medical test set when the pseudo-in-domain training set is constructed with different level of lexicon coverage.

Figure 5 :
Figure 5: Effect of training on increasing number of in-domain (a) and out-of-domain (b) parallel sentences to extract word translation probabilities P (t|s) and P (s|t) in both forward and backward directions from D parallel-out , and extract lexicons L fw = {(s, t), ∀P (t|s) > 0} and L bw =

Table 2 .
Table1: BLEU scores of LSTM based and Transformer (XFMR) based NMT models when trained on one domain (columns), and tested on another domain (rows).The last two columns show the average performance of unadapted baselines and DALI, and the average gains.

Table 2 :
Corpus statistics over five domains.

Table 4 :
An example that shows why BT could translate the OOV word "Arzneimittel" correctly into "medicine"."enthált" corresponds to the English word "contain".Though BT can't translate a correct source sentence for augmentation, it generates sentences with certain patterns that could be identified by the model, which helps translate in-domain unseen words.

Table 5 :
100% successful word translation examples from the output of the IT to Medical adaptation task.The Count column shows the number of occurrences of word pairs in the pseudo-in-domain training set.

Table 7 :
BLEU scores of LSTM based NMT models when trained on WMT14 De-En data (Base), and adapted to one domain (DALI).The last two columns show the percentage of source word/subword overlap between the training data on the WMT domain and other five domains.

Table 11 :
Out-of-Vocabulary statistics of German Words across five domains.Each row indicates the OOV statistics of the out-of-domain (row) corpus against the in-domain (columns) corpus.The second column shows the vocabulary size of the out-of-domain corpus in each row.The remaining columns (3rd-7th) show the number of domain-specific words in each in-domain corpus with respect to the out-of-domain corpus, and the ratio between the number of out-of-domain corpus and the domain specific words.