Vocabulary Adaptation for Domain Adaptation in Neural Machine Translation

Neural network methods exhibit strong performance only in a few resource-rich domains. Practitioners therefore employ domain adaptation from resource-rich domains that are, in most cases, distant from the target domain. Domain adaptation between distant domains (e.g., movie subtitles and research papers), however, cannot be performed effectively due to mismatches in vocabulary; it will encounter many domain-specific words (e.g., “angstrom”) and words whose meanings shift across domains (e.g., “conductor”). In this study, aiming to solve these vocabulary mismatches in domain adaptation for neural machine translation (NMT), we propose vocabulary adaptation, a simple method for effective fine-tuning that adapts embedding layers in a given pretrained NMT model to the target domain. Prior to fine-tuning, our method replaces the embedding layers of the NMT model by projecting general word embeddings induced from monolingual data in a target domain onto a source-domain embedding space. Experimental results indicate that our method improves the performance of conventional fine-tuning by 3.86 and 3.28 BLEU points in En-Ja and De-En translation, respectively.


Introduction
The performance of neural machine translation (NMT) models remarkably drops in domains different from the training data (Koehn and Knowles, 2017). Since a massive amount of parallel data is available only in a limited number of domains, domain adaptation is often required to employ NMT in practical applications. Researchers have therefore developed fine-tuning, a dominant approach for this problem (Luong and Manning, 2015;Freitag and Al-Onaizan, 2016;Chu et al., 2017;Bapna and   source-domain and small amount of target-domain parallel data, fine-tuning adjusts the parameters of a model pre-trained in the source-domain to the target domain. However, in fine-tuning, inheriting the embedding layers of the model pre-trained in the source domain causes vocabulary mismatches; namely, a model can handle neither domain-specific words that are not covered by a small amount of targetdomain parallel data (unknown words) nor words that have different meanings across domains (semantic shift). Moreover, adopting the standard subword tokenization (Sennrich et al., 2016b;Kudo, 2018) accelerates the semantic shift. Targetdomain-specific words are often finely decomposed into source-domain subwords (e.g., "alloy" → " all" + "o" + "y"), which introduces improper subword meanings and hinders adaptation ( Table 7 in § 5).
To resolve these vocabulary-mismatch problems in domain adaptation, we propose vocabulary adaptation (Figure 1), a method of directly adapting the vocabulary (and embedding layers) of a pre-trained NMT model to a target domain, to perform effective fine-tuning ( § 3). Given an NMT model pretrained in a source domain, we first induce a wide coverage of target-domain word embeddings from target-domain monolingual data. We then fit the obtained target-domain word embeddings to the embedding space of the pre-trained NMT model by inducing a cross-domain projection from the targetdomain embedding space to the source-domain embedding space. To perform this cross-domain embedding projection, we explore two methods: cross-lingual (Xing et al., 2015) and cross-task embedding projection (Sakuma and Yoshinaga, 2019).
We evaluate fine-tuning with the proposed vocabulary adaptation for two domain pairs: 1) from JESC (Pryzant et al., 2018) to ASPEC (Nakazawa et al., 2016) for English to Japanese translation (En→Ja) and 2) from the IT domain to Law domain (Koehn and Knowles, 2017) for German to English translation (De→En). Experimental results demonstrate that our vocabulary adaptation improves the BLEU scores (Papineni et al., 2002) of fine-tuning (Luong and Manning, 2015) by 3.86 points (21.45 to 25.31) for En→Ja and 3.28 points (24.59 to 27.87) for De→En ( § 5). Moreover, it shows further improvements when combined with back-translation (Sennrich et al., 2016a).
The contributions of this paper are as follows.
• We empirically confirmed that vocabulary mismatches hindered domain adaptation.
• We established an effective, model-free finetuning for NMT that adapts the vocabulary of a pre-trained model to a target domain.
• We showed that vocabulary adaptation exhibited additive improvements over backtranslation that uses monolingual corpora.

Related Work
In this section, we first review two approaches to supervised domain adaptation in NMT: multi-domain learning and fine-tuning. We then introduce unsupervised domain adaptation using target-domain monolingual data and approaches to unknown word problems in NMT. Multi-domain learning induces an NMT model from parallel data in both source and target domains (Kobus et al., 2017;Wang et al., 2017;Britz et al., 2017). Since this approach requires training with a massive amount of source-domain parallel data, the training cost becomes problematic when we perform adaptation to many target domains. Fine-tuning (or continued learning) is a standard domain adaptation method in NMT. Given an NMT model pre-trained with a massive amount of sourcedomain parallel data, it continues the training of this pre-trained model with a small amount of target-domain parallel data (Luong and Manning, 2015;Chu et al., 2017;Bapna and Firat, 2019;Gu et al., 2019). Due to the small cost of training, research trends have shifted to fine-tuning from multi-domain learning. Recent studies focus on model architectures, training objectives, and strategies in training. Meanwhile, no attempts have been made to resolve the vocabulary mismatch problem in domain adaptation.
Unsupervised domain adaptation exploits targetdomain monolingual data to train a language model to support the model's decoder in generating natural sentences in a target domain (Gülçehre et al., 2015;Domhan and Hieber, 2017). Data augmentation using back-translation (Sennrich et al., 2016a;Hu et al., 2019) is another approach to using targetdomain monolingual data. These approaches can partly address the problem of semantic shift. However, it is possible that the source-domain encoder will fail to handle targetdomain-specific words. In such cases, a decoder with the target-domain language model becomes less helpful in the former approach, and the generated pseudo-parallel corpus has low-quality sentences on the encoder side in the latter approach.
Handling unknown words has been extensively studied for NMT since the vocabulary size of an NMT model is limited due to practical requirements (e.g., GPU memory) (Jean et al., 2015;. The current standard approach to the unknown word problem is to use token units shorter than words such as characters (Ling et al., 2015;Luong and Manning, 2016) and subwords (Sennrich et al., 2016b;Kudo, 2018) to handle rare words as a sequence of known tokens. However, more drastic semantic shifts will occur for characters or subwords than for words because they are shorter than words and naturally ambiguous.
Besides these studies mentioned above, Aji et al. (2020) reported that transferring embeddings and vocabulary mismatches between parent and child models significantly affected the performance of models also in cross-lingual transfer learning.
In this study, we aim to provide pre-trained NMT models with functionality that directly handles both target-domain-specific unknown words and semantic shifts by exploiting cross-domain embeddings learned from target-domain data.

Vocabulary Adaptation for Domain Adaptation in NMT
As we have discussed ( § 1), vocabulary mismatches between source and target domains are the important challenge in domain adaptation for NMT. This section proposes fine-tuning-based methods of directly resolving this problem. Although our methods are applicable to any NMT model with embedding layers, we assume here subword-based encoder-decoder models (Bahdanau et al., 2015;Vaswani et al., 2017) for clarity.

Vocabulary Adaptation Prior to Pre-training
One simple approach is to use target-domain vocabularies in pre-training. Specifically, we first construct vocabularies from target-domain data for each language. We then pre-train an NMT model in a source domain with the target-domain vocabularies and embeddings. Finally, we fine-tune the pre-trained model with target-domain parallel data.
In this approach, however, employing the targetdomain vocabularies will hinder pre-training in the source domain. In addition, since the embeddings induced from the target-domain data are tuned to the source domain, the problem of semantic shifts still remains and will hinder fine-tuning.

Vocabulary Adaptation Prior to Fine-tuning
Another approach is to replace the encoder's embeddings and the decoder's embeddings of the pretrained NMT model with word embeddings induced from target-domain data before fine-tuning. However, as in transplanting organs from a donor to a recipient, this causes rejection; the embedding space of a pre-trained model is irrelevant to the space of the target-domain word embeddings. We therefore project the target-domain word embeddings onto the embedding space of the pretrained model in order to make the embeddings compatible with the pre-trained model ( Figure 1 in § 1). This approach is inspired by cross-lingual and cross-task word embeddings that bridge word embeddings across languages and tasks.
An overview of our proposed method is given as follows.
Step 1 (Inducing target-domain embeddings) We induce word embeddings from monolingual data in the target domain for each language. Although we can use any method for induction, we adopt Continuous Bag-of-Words (CBOW) (Mikolov et al., 2013) here since CBOW is effective for initializing embeddings in NMT (Neishi et al., 2017), which suggests embedding spaces of CBOW and NMT are topologically similar.
Step 2 (Projecting embeddings across domains) We project the target-domain embeddings of the source and target languages into the embedding spaces of the pre-trained encoder and decoder, respectively, to obtain cross-domain embeddings ( § 3.2.1, § 3.2.2).
Step 3 (Fine-tuning) We replace the vocabularies and the embedding layers with the cross-domain embeddings and apply fine-tuning using the targetdomain parallel data.
To induce cross-domain embedding projection, we regard the two domains as different languages/tasks and explore the use of methods for inducing cross-lingual (Xing et al., 2015) and crosstask word embeddings (Sakuma and Yoshinaga, 2019). In what follows, we explain each method.

Vocabulary Adaptation by Linear Transformation
The first method exploits an orthogonal linear transformation (Xing et al., 2015) to obtain cross-lingual word embeddings. We use subwords shared across two domains for inducing an orthogonal linear transformation from the embeddings of the target domain to the embeddings of the source domain. The obtained linear transformation is used to map all embeddings of the target domain to the embedding space of the source domain to address semantic shift across domains.

Vocabulary Adaptation by Locally Linear Mapping
Due to the difference between the domains and tasks (CBOW and NMT) in inducing the embeddings, the linear transformation is likely to fail. Thus, we employ a recent method for cross-task embedding projection called "locally linear mapping" (LLM) (Sakuma and Yoshinaga, 2019). An overview is illustrated in Figure 1 (lower left). LLM learns a projection that preserves the local topology (positional relationships) of the original embeddings after mapping while disregarding the global topology. This property of LLM is suited to our situation because the local topology is expected to be the same across the semantic spaces of two domains, while globally, they can be significantly different due to semantic shift between domains as illustrated in Figure 2.
Here, we explain the essence of LLM. Interested readers may consult Sakuma and Yoshinaga (2019) for details. Suppose that T LM is the word embeddings of the target domain induced by a language model task, and S NMT is the word embeddings of the source domain induced by the translation task (the embedding layer of the pre-trained model). We denote the vocabulary of T LM by V T , the vocabulary of S NMT by V S and the vocabulary of words shared across both domains by Our goal is to produce embeddings T NMT with a vocabulary of V T in the embedding space of S NMT . We accomplish this by computing the T NMT that best preserves the local topology of T LM in the embedding space of S NMT . Concretely, for each word w i in V T , we first take the k-nearest neighbors N (w i ) ⊂ V shared in T LM . We use cosine similarity as the metric for the nearest neighbor search.
Second, we learn the local topology around w i by reconstructing T LM w i from the embeddings of its nearest neighbors as a weighted average. For this purpose, we minimize the following objective: with the constraint of j α ij = 1; the method of Lagrange multipliers gives the analytical solution.
We then compute the embedding T NMT w i that preserves the local topology by minimizing the following objective function: This optimization problem has the trivial solution: Note that subwords shared across domains will have different embeddings after projection (T NMT w = S NMT w for w ∈ V shared ). This captures the semantic shift of subwords across domains. We conduct a detailed analysis of this matter in § 6.3.

Experimental Setup
We conducted fine-tuning with our vocabulary adaptation for domain adaptation in En→Ja and De→En machine translation. In what follows, we describe the setup of our experiments.

Datasets and Preprocessing
We selected domain pairs to simulate a plausible situation where the target domain is specialized and similar source-domain parallel data is not available.
For En→Ja translation, we chose the Japanese-English Subtitle Corpus (JESC) (Pryzant et al., 2018) as the source domain and Asian Scientific Paper Excerpt Corpus (ASPEC) (Nakazawa et al., 2016) as the target domain. JESC was constructed from subtitles of movies and TV shows, while AS-PEC was constructed from abstracts of scientific papers. These domains are substantially distant, and ASPEC contains many technical terms that are unknown in the JESC domain. We followed the official splitting of training, development, and test sets, except that the last 1,000,000 sentence pairs were omitted in the training set of the ASPEC corpus as they contain low-quality translations.
For De→En translation, we adopted the dataset constructed by Koehn and Knowles (2017) from the OPUS corpus (Tiedemann, 2012). This dataset includes multiple domains that are distant from each other and is suitable for experiments on realistic domain adaptation. We chose the IT domain and the Law domain from the dataset as the source and target domain, respectively. We followed the same splitting of training, development, and test sets as Koehn and Knowles (2017).
Preprocessing As preprocessing for the En→Ja datasets, we first tokenized the parallel data using the Moses toolkit (v4.0) 1 for English sentences and KyTea (v0.4.2) 2 for Japanese sentences. We  then truecased the English sentences by using the script in the Moses toolkit. As for the De→En datasets, we used the same tokenization and truecasing as Koehn and Knowles (2017). The statistics of the datasets are listed in Table 1. We applied SentencePiece (v0.1.83) 3 (Kudo and Richardson, 2018) trained from the monolingual data in each domain to the tokenized datasets. The number of subwords was 16,000 for all languages. In the training of SentencePiece, we did not concatenate the input language and output language to maximize the portability of the pre-trained model.
From each of the preprocessed datasets, we used 1) 100,000 randomly sampled sentence pairs or 2) all sentence pairs in the training set for training in the target domain. This was for evaluating models in both cases where we have a small/large targetdomain dataset.
To prepare reproducible target-domain monolingual data, we shuffled and divided all sentence pairs of the target-domain training set except the 100,000 sentence pairs into two equal portions. We then used the first half and the second half as simulated monolingual data for the source language and the target language, respectively. The monolingual data was used for training SentencePiece and CBOW vectors in the target domain and data augmentation by back-translation. When models did not use the monolingual data, the data used for training SentencePiece and CBOW vectors was exactly identical to the training set in each domain.

Models and Embeddings
We adopted Transformer-base (Vaswani et al., 2017) implemented in fairseq (v0.8.0) 4 (Ott et al., 2019), as the core architecture for the NMT models. 5 Major hyperparameters are shown in Table 2. 6 We evaluated the performance of the models on the basis of BLEU (Papineni et al., 2002). Before pretraining the models, we induced subword embeddings from the monolingual corpus by Continuous Bag-of-Words (CBOW) (Mikolov et al., 2013) to initialize the embedding layers of the NMT models.
To evaluate the effect of vocabulary adaptation, we compared the following settings (and their combinations) that used either or both the source-and target-domain parallel data. Out-/In-domain trains a model only from the training set in the source/target domain. Fine-tuning w/ source-domain vocab. (FT-srcV) continues to train the Out-domain model using the training set in the target domain without any vocabulary adaptation (Luong and Manning, 2015). Fine-tuning w/ target-domain vocab. (FT-tgtV) Refer to § 3.1. Multi-domain learning (MDL) trains a model from both source and target domain training sets. We employed domain token mixing (Britz et al., 2017) as a method of multi-domain learning. In this setting, we jointly used the source and target domain training sets for training subword tokenization models, CBOW vectors, and training NMT models (e.g., 2797k + 100k for En→Ja translation). Vocabulary Adaptation (VA) Refer to § 3.2. We compared two projection methods: linear orthogo-  nal transformation (VA-Linear, § 3.2.1) and locally linear mapping (VA-LLM, § 3.2.2). For VA-LLM, the number of nearest neighbors, k, was fixed to 10. 7 To highlight the importance of embedding projection for the proposed method, we also evaluated settings using the target-domain CBOW vectors for the re-initialization as is (VA-CBoW).
Back-translation (BT) applies a backward translation to target-domain monolingual data in the target language. We employed the most standard backtranslation proposed by Sennrich et al. (2016a). For this back-translation, a backward model (e.g., Ja → En) is independently trained from the sourcedomain parallel data with the same setting and data as Out-domain. The subsequent fine-tuning is applied with the generated pseudo-parallel targetdomain corpora and a target-domain training set. Among the above methods, Out-domain and In-domain do not perform domain adaptation. FT-srcV, FT-tgtV, and MDL are baseline domain adaptation methods. BT is applied to FT-srcV, FT-tgtV, and VA for data augmentation.
Note that FT-tgtV and MDL assume that the target domain is given before training with the sourcedomain data. Although this assumption enables us to build a suitable vocabulary for the target domain, it sacrifices the domain portability of trained models. As a result, it requires us to perform training for a long period of each combination of a source and a target domain.
We used Adam (Kingma and Ba, 2015) to train 7 We evaluated VA-LLM with k={1, 5, 10, 20}, and the default value (k=10) was the best.  each model with the above settings. During both pre-training and fine-tuning, the learning rate linearly increased for warm-up for the first 4,000 training steps and then decayed proportionally to the inverse square root of the number of updates. Prior to fine-tuning, we reset the optimizer and the learning rate and then continued training on the training set in the target domain.  (Koehn and Knowles, 2017). There were large differences in the performance among VA-* models that perform vocabulary adaptation prior to fine tuning. The results confirmed that not only the differences in the vocabulary (set of subwords) but also the initial embeddings matter in fine-tuning NMT models. VA-* methods did not work well in En→Ja translation when only the 100k target-domain parallel data was used. This is probably because the more noisy emebeddings (ambiguous subwords) introduced by the large number of domain-specific words in the ASPEC dataset (Table 1) hinders the embedding projection of VA-LLM and VA-Linear with low-quality CBOW vectors trained from the 100k sentences. In this setting, we need more parallel data for fine-tuning to adjust the noisy initial embeddings. Table 4 shows results of ablation tests to examine for which side (encoder or decoder) VA-LLM benefited. The results confirmed that the poor performance in En→Ja translation with the 100k targetdomain parallel data is due to the failure of handling semantic shifts in the decoder. 8  Table 5: Case-sensitive BLEU scores when employing target-domain monolingual data (950k for En→Ja and 308k for De→En). +BT indicates that monolingual data was used also for data augmentation.

BLEU Scores
The improvements obtained by VA-Linear were modest overall. This was due to the nature of the linear projection employed for cross-domain embedding mapping as discussed in § 3.2.2. We analyze the difference between the two types of projected embeddings in § 6.3. Table 5 shows how employing target-domain monolingual data affected domain adaptation. In the settings, the SentencePiece and CBOW vectors of the target domain were trained from both the 100k parallel data and the monolingual data (950k and 308k for En→Ja and De→En, respectively). We also evaluated the orthogonality of the proposed method to BT since both methods exploit targetdomain monolingual data.

Effects of Monolingual Data
Interestingly, the results of FT-tgtV and VA-Linear were worse than the results in Table 3. We consider the reason to be as follows. When additionally using the target-domain monolingual data, the resulting SentencePiece model and CBOW vectors become more suitable for the target domain thanks to the increase of data. However, this also means that target-domain-specific words appearing only in the monolingual data accelerated the vocabulary mismatches, the semantic shifts, and the difference of topology in the embedding space. As the result, the vocabulary mismatches degraded the pre-trained model of the source domain for FT-tgtV and linear transformation failed to handle the semantic shifts for VA-Linear.
In contrast, due to the capability of the projection method, the performance of VA-LLM was successfully improved by the use of the monolingual data. Table 5 also shows the orthogonality of VA-LLM to BT, since the increase of BLEU scores for VA-LLM + BT from FT-srcV + BT were substantial  (5.10 pt and 2.61 pt for En→Ja and De→En translation, respectively). Table 6 shows the number of updates until convergence in En→Ja translation with the 100,000 target-domain training set. 9 We confirmed that all models were trained over a sufficient number of steps. The validation loss did not improve over at least five epochs after the best model was chosen. We used four GPUs (NVIDIA Quadro P6000) for training, and it took 0.9 sec/update on average. Here, we emphasize that VA-LLM achieved superior performance with a small number of updates (3,200 steps, less than 50 minutes) similarly to FT-srcV. Note that the overhead time of our vocabulary adaptation was negligible since embedding projection took only several minutes. Meanwhile, FT-srcV + BT took 31,280 steps due to the size of the augmented data even when we ignore the time taken to generate back-translated parallel data.

On Efficiency: Training Steps
Additionally, our proposed method is based on fine-tuning and the target domain is not supposed to be given before pre-training in the source domain, differently from MDL. Therefore, the pre-trained Out-domain can be reused each time when the target domain or settings are changed, which enables us to omit the long training time (28,750 steps, about 7.2 hours) per model training. As the training steps of VA-LLM + BT show, the overhead caused by employing the proposed method with back-translation was also small. Nevertheless, the improvements of VA-LLM + BT compared with FT-srcV + BT were substantial (Table 5). Input (JESC vocab.) 3 cases of the lu m bar spinal1 can al ste no s is2 · · · Input (ASPEC vocab.) 3 cases of the lumbar spinal1 canal stenosis2 · · · Input (IT vocab.) falls der Austausch der Rat if ik ation s ur ku nden1 zwischen · · · Input (Law vocab.) falls der Austausch der Ratifikation surkunde n1 zwischen · · · Reference should the instruments of ratification1 be exchanged between · · · FT-srcV if the exchange of the ratification of ratification between · · · FT-srcV + BT where the exchange of the Council takes place between · · · VA-LLM + BT if the instruments of ratification1 are met between · · · Table 7: Translation examples of the models with 100k target-domain parallel data in Table 3 and Table 5. Bolded words are rare or unknown in source domain. Underlined words and subscript numbers indicate correspondence. Input (JESC, IT) and Input (ASPEC, Law) were fed to FT-srcV/FT-srcV + BT and VA-LLM + BT, respectively.
6 Analysis 6.1 Translation Examples Table 7 shows translation examples generated by FT-srcV in Table 3, FT-srcV + BT and VA-LLM + BT in Table 5. The size of target-domain parallel data for training was 100k.
FT-srcV and FT-srcV + BT often failed to translate target-domain-specific words that were tokenized into short subwords. In such cases, the models tended to ignore or transliterate them. For instance, the De→En examples (lower) show that FT-srcV and FT-srcV + BT failed in translating "Ratifikationsurkunden (instruments of ratification)." Moreover, in the En→Ja examples (upper), the decomposed target-domain-specific words "脊柱 (spinal)" and "狭窄症 (stenosis)" contained targetdomain-specific subwords such as "脊" and "窄." The models without vocabulary adaptation also failed to handle these subwords when both the source-domain training set and the target-domain 100k training set rarely contained them.
Meanwhile, VA-LLM + BT successfully translated both of the cases with the help of targetdomain monolingual data. These examples imply the difficulty in translating target-domain-specific words without vocabulary adaptation.
We observed that VA-LLM + BT generated various target-domain-specific words. To quantitatively confirm this, we calculated the percentage of distinct words included in both the generated outputs and the references. The outputs in En→Ja translation generated by VA-LLM + BT, FT-srcV + BT, and FT-srcV contained 57.9%, 53.4%, and 49.5% of distinct words in the references, respectively.

Effect of Vocabulary Size in Fine Tuning
As reported in (Sennrich and Zhang, 2019), the vocabulary size of an NMT model can affect its translation quality in a low-resource setting. How about in fine-tuning? To explore this, we varied only the target-domain vocabulary size of VA-LLM before fine-tuning by vocabulary adaptation. Figure 3 shows that VA-LLM preferred large vocabulary sizes when additional target-domain monolingual data was used for training CBOW, whereas it preferred small vocabulary sizes when the data was not used. We consider the reason to be as follows. In the former case, a large vocabulary contains low-frequency subwords of which representation is unlikely to be well-trained as discussed in (Sennrich and Zhang, 2019). In the latter case, however, target-domain monolingual data can cover such low-frequency subwords.
As this analysis showed, the vocabulary size also had large effects on fine-tuning (3.52 pt difference at most). Besides the vocabulary mismatch problem, our vocabulary adaptation could make further  improvements by the vocabulary size were adjusted depending on the amount of target-domain parallel and monolingual data with a low training cost.

Quality of Cross-domain Embeddings
The advantage of our approach is that it adjusts the meanings of subwords (embeddings) as well as the vocabulary (set of subwords) to the target domain. We thus examined to what extent our vocabulary adaptation captures the semantic shift. We first observed the nearest neighbors based on cosine similarity for each of the subword embeddings in the target domain (hereafter, ASPEC-CBOW). 10 Note that the nearest neighbors should be unchanged even after embedding projection to keep the meanings learned in the target domain.
Next, we compute cosine similarities between each of the projected ASPEC-CBOW and the embeddings of Out-domain to find their nearest neighbors in the embedding space of Out-domain (hereafter, JESC-NMT). The obtained nearest neighbors show how the ASPEC-CBOW embeddings projected by linear-transformation or LLM performed during fine-tuning. Table 8 shows the nearest neighbors of two words: " branches," which appears in both domains and can have different meanings across domains, and " experimentally," which is only in the ASPEC domain. 10 Through this analysis, the candidates of nearest neighbors were limited to the shared subwords across JESC and ASPEC domains for clear comparison.
While the CBOW vector for " branches" and the embedding projected by LLM have the meaning of " veins" and " arteries", the embedding projected by linear transformation lost it. " experimentally" is a subword that only the target-domain (ASPEC) vocabulary contains. As illustrated in Figure 2, the mapping of target-domain-specific subword embeddings is likely to fail due to the difference of topology in the embedding space. We found that LLM relatively accurately computed its embedding in the JESC-NMT space while linear transformation failed. This tendency was also observed when using only the 100k parallel data for training of SentencePiece and CBOW vectors. These observations demonstrate the capability of LLM in crosstask/domain embedding projection.

Conclusion
In this study, we tackled the vocabulary mismatch problem in domain adaptation for NMT, and we proposed vocabulary adaptation, a simple but direct solution to this problem. It adapts the vocabulary of a pre-trained NMT model to a target domain for performing effective fine-tuning. Regarding domains as independent languages/tasks, our method makes wide-coverage word embeddings induced from target-domain monolingual data be compatible with a model pre-trained in a source domain.
We explored two methods for projecting word embeddings across two domains: linear transformation and locally linear mapping (LLM). The experimental results for English to Japanese translation and German to English translation confirmed that our domain adaptation method with LLM dramatically improved the translation performance.
Although the vocabulary adaptation was evaluated only for NMT, it is also applicable to a wider range of neural network models and tasks, and it can even be combined with existing fine-tuningbased domain adaptations. We will release all code to promote the reproducibility of our results. 11