Freezing Subnetworks to Analyze Domain Adaptation in Neural Machine Translation

To better understand the effectiveness of continued training, we analyze the major components of a neural machine translation system (the encoder, decoder, and each embedding space) and consider each component’s contribution to, and capacity for, domain adaptation. We find that freezing any single component during continued training has minimal impact on performance, and that performance is surprisingly good when a single component is adapted while holding the rest of the model fixed. We also find that continued training does not move the model very far from the out-of-domain model, compared to a sensitivity analysis metric, suggesting that the out-of-domain model can provide a good generic initialization for the new domain.


Introduction
Neural Machine Translation (NMT) has supplanted Phrase-Based Machine Translation (PBMT) as the standard for high-resource machine translation. This has necessitated new domain adaptation methods, because PBMT adaptation methods primarily rely on adapting the language model and phrase table using interpolation or back-off schemes (see §2). Continued training (Luong and Manning, 2015;Freitag and Al-Onaizan, 2016), also referred to as fine-tuning, is one of the most popular methods for NMT adaptation, due to its strong performance.
In contrast to the PBMT literature, little research has focused on why continued training is effective or on what happens to NMT models during continued training. Motivated by domain adaptation analysis in PBMT (Haddow and Koehn, 2012;Duh et al., 2010;Irvine et al., 2013), this work proposes a simple freezing subnetworks technique and uses it to gain insight into how the various components of an NMT system behave during continued training. We segment the model into five subnetworks, which we refer to as components, denoted in Figure 1: the source embeddings, encoder, decoder (which includes the attention mechanism), the softmax (used to denote the decoder output embeddings and biases), and the target embeddings.
We freeze components one at a time during continued training to see how much the adaptation depends on each component. We also experiment with freezing everything except one component to determine each component's capacity to adapt to the new domain on its own.
In order to further analyze continued training, we examine the magnitude of change in model components during continued training of the network, under both normal and freezing training conditions. We also conduct sensitivity analysis of each component to assist in interpreting these magnitudes.
Our NMT adaptation experiments are performed across three languages: we translate from German,

Related Work
Continued training has recently become a standard for domain or cross-lingual adaptation in several neural NLP applications. In PBMT, the most prominent methods focus on adapting the language model component (Moore and Lewis, 2010), and/or the translation model (Matsoukas et al., 2009;Mansour and Ney, 2014;Axelrod et al., 2011), or on interpolating in-domain and out-of-domain models (Lu et al., 2007;Foster et al., 2010;Koehn and Schroeder, 2007). In contrast, the methods employed in NMT tend to utilize continued training, which involves initializing the model with pre-trained weights (trained on out-of-domain data) and training/adapting it to the in-domain data. Among others, Luong and Manning (2015) and Freitag and Al-Onaizan (2016) applied this method for domain adaptation. Chu et al. (2017) mix in-domain and out-of-domain data during continued training in order to adapt to multiple domains. Continued training has also been applied to cross-lingual transfer learning for NMT, with Zoph et al. (2016) and Nguyen and Chiang (2017) using it for transfer between high-and lowresource language pairs. Continued training is effective on a range of data sizes. In-domain gains have been shown with as few as dozens of in-domain training sentences (Miceli Barone et al., 2017), and recent work has explored continued training on single sentences (Farajian et al., 2017;Kothur et al., 2018).
Similar adaptation techniques are also employed in the field of Automatic Speech Recognition, where continued training has been the basis of  cross-lingual transfer learning approaches (Grézl et al., 2014;Kunze et al., 2017). Usually, the lower layers of the network, which perform acoustic modeling, are frozen and only the upper layers are updated. In a similar vein, other works (Swietojanski and Renals, 2014;Vilar, 2018) adapt a network to a new domain by learning additional weights that re-scale the hidden units.

Data
Our experiments are carried out across three language pairs, from Russian, Korean, and German into English. Basic statistics on the datasets used for our experiments are summarized in Table 2. The three languages represent three different domain adaptation scenarios: OpenSubtitles You're gonna need a bigger boat.

WMT
Intensified communication and sharing of information between the project partners enables the transfer of expertise in rural tourism.

WIPO
The films coated therewith, in particular polycarbonate films coated therewith, have improved properties with regard to scratch resistance, solvent resistance, and reduced oiling effect, said films thus being especially suitable for use in producing plastic parts in film insert molding methods. Table 3: Example sentences to illustrate domain differences.

Out-of-domain Data
For our out-of-domain dataset we utilize the OpenSubtitles2018 corpus (Tiedemann, 2016;Lison and Tiedemann, 2016), which consists of translated movie subtitles. 1 For De-En and Ru-En, we also use data from WMT 2017 (Bojar et al., 2017), 2 which contains data from several sources: Europarl (parliamentary proceedings) (Koehn, 2005), 3 News Commentary (political and economic news commentary), 4 Common Crawl (web-crawled parallel corpus), and the EU Press Releases.
We use the final 2500 lines of OpenSubtitles2018 for the development set. For German and Russian we also concatenate newstest2016 as part of the development set. newstest2016 consists of translated news articles released by WMT for its shared task. In Korean, we rely only on the OpenSubtitles2018 data. See Table 3 for example sentences from WMT and OpenSubtitles.

In-domain Data
We perform adaptation into the World International Property Organization (WIPO) COPPA-V2 dataset (Junczys-Dowmunt et al., 2016). 5 The WIPO data consist of parallel sentences from international patent application abstracts. We reserve 3000 lines each for the in-domain development and test sets. See Table 3 for an example WIPO sentence.

Data Preprocessing
All our datasets were tokenized using the Moses 6 tokenizer. Additionally, Korean text was seg-1 www.opensubtitles.org 2 statmt.org/wmt17 3 statmt.org/europarl 4 casmacat.eu/corpus/news-commentary.html 5 wipo.int/patentscope/en/data 6 statmt.org/moses/ mented into words using the KoNLPy wrapper of the Mecab-Ko segmenter. 7 As a final preprocessing step, we train Byte Pair Encoding (BPE) segmentation models (Sennrich et al., 2016) on the out-of-domain training corpus. We train separate BPE models for each language, each with a vocabulary size of 30,000. For each language, BPE is trained on the out-of-domain corpus only and then applied to the training, development, and test data for both out-of-domain and in-domain datasets. This mimics the realistic setting where a generic, computationally-expensive-to-train NMT model is trained once. This NMT model is then adapted to new domains as they emerge, without retraining on the out-of-domain corpus. Training BPE on the in-domain data would change the vocabulary and thus require re-building the model.

Experimental Setup
For all language pairs, we train systems on the out-of-domain data and select the best model parameters based on perplexity on the out-of-domain development set. We then adapt the systems into our smaller, in-domain training sets. We select the best model based on the WIPO development set perplexity and report results on the WIPO test sets.

Continued Training
We define continued training as: 1. Train a model until convergence on large outof-domain bitext.
2. Initialize a new model with the final parameters of Step 1.

Train the model from
Step 2 until convergence on in-domain bitext.

NMT Implementation and Settings
Our neural machine translation systems are trained using SOCKEYE (Hieber et al., 2017). 8 We use SOCKEYE's built-in functionality for freezing parameters. We build RNN-based encoder-decoder models with attention (Bahdanau et al., 2015), using a bidirectional RNN for the encoder. The encoder and decoder both have 2 layers with LSTM hidden sizes of 512. Source and target word vectors are also of size 512. The number of parameters in each component are given in Table 1. While training the out-of-domain models, we apply dropout with 10% probability on the RNN layers. We apply label smoothing of 0.1. We use ADAM (Kingma and Ba, 2014) as the optimizer, using a learning rate of 0.0003 and a learning rate reduce factor of 0.7. We use a batch size of 4096 words and create a checkpoint every 4000 minibatches.
We do not use dropout or label smoothing during continued training because we do not want regularization to bias our measurements of magnitude changes during continued training (see §5.3). We note, however, that each would likely increase indomain performance. Our batch size during continued training is 128 sentences, and we create a checkpoint every half epoch. Our learning rate reduce factor for continued training is 0.5. We run each continued training experiment over a set of learning rates (0.1, 0.01, 0.001, 0.0001, 0.00001) and choose the best result based on the perplexity on the development set, as previous work has suggested that even when using ADAM, continued training can be sensitive to learning rate (Farajian et al., 2017;Li et al., 2018;Kothur et al., 2018). We use dot product attention (Luong et al., 2015), which means we do not have a separate attention component; the attention is implicitly built into the decoder.  For De-En and Ru-En, the out-of-domain models have reasonable performance on the in-domain test set. In these language pairs, freezing any single component has little impact on in-domain BLEU. The worst change is −0.9 BLEU-when freezing the De-En encoder-and in some cases we see small gains of up to 0.4 BLEU. We interpret these gains as trivial (and possibly the result of variance) but there may be an NMT continued training scenario in which freezing could increase performance by acting as a regularizer (see Ghahremani et al., 2017).

Results and Analysis
In Ko-En, where the out-of-domain model does poorly on the in-domain test set, we see more sub-stantial drops when freezing a component during continued training. Freezing the decoder and encoder does the most harm (−3.8 and −3.3 BLEU, respectively), followed by the source embeddings and softmax components (−1.7 and −1.5 BLEU, respectively).
In all cases, freezing the target embeddings has very little impact (at most −0.2 BLEU, in Ko-En), suggesting that it is relatively unimportant during adaptation. These results show that the model and training procedure are very robust; continued training is able to find a local minimum for the new domain which has (nearly) equal performance to the one found in full training, even though an entire component is fixed to the initial out-of-domain model's values.
This robustness suggests that caution is in order when attempting to interpret changes of any single component-in particular, changes in the surrounding components must also be considered. For example, it appears that when the source embeddings are fixed, the encoder is able to compensate for the non-adapted source embeddings and adapt the system to interpret source tokens correctly in the new domain. Conversely, it appears that when the encoder is fixed, the source embeddings are able to adapt to produce vectors for source tokens which are interpreted correctly by the un-adapted encoder. Note that adaptation to source tokens in the new domain could theoretically occur in any un-frozen component, an idea further explored in the next section.

Freezing All But One Component
In our second set of experiments, we freeze all but one component during continued training to see how much each component, in isolation, is able to adapt the NMT system to the new domain. The results are shown in Figure 2 (right striped bars).
We find that only adapting a single component is-somewhat surprisingly-not catastrophic in most cases. Adapting only the encoder, for example, still gives a gain of 20.1 BLEU over the out-of-domain model (3.8 BLEU worse than full continued training) in German and 11.4 BLEU (0.2 BLEU worse than full continued training) in Russian.
In De-En and Ko-En, we see that adapting just the encoder does the best, followed by the decoder, source embeddings, softmax, and target embeddings. The trend in Russian is similar but with the   decoder and source embeddings switched. These experiments suggest the encoder is most able to adapt the model to a new domain in isolation. It is worth noting that the encoder achieves this despite being the component with the fewest parameters (3.7M). The target embeddings are least able to adapt the model to a new domain (consistent with §5.1).
These experiments also show that the upper bound for adapting a single component is quite high, suggesting that the upper bound for adaptation techniques using monolingual data to adapt individual components could be quite high as well. Of course, it seems unlikely that techniques using only monolingual data can achieve the same level of performance as when directly optimizing on bitext.

Magnitude of Changes During Continued Training
We are interested in the overall magnitude of the changes experienced by each component during continued training, (i.e., how far each moves from the out-of-domain model) and how those changes compare to the cases where only a single component was adapted. We had two opposing hypotheses that could predict adaptation behavior when only one component is being adapted (as in §5.1): 1. The portion of the network producing the component's input is fixed, as is the portion of the network that interprets the component's output. This suggests the component will be somewhat constrained, in contrast to full continued training where the components may adapt jointly over time. 2. Since all other components are fixed, the adapting component has to bear all the responsibility for changing the entire model's behavior, requiring more drastic changes than it would have undergone during full continued training.
The root mean square (RMS) of the differences between each component in the initial outof-domain model and the same component after continued training is shown in Table 4 (normal continued training) and Table 5 (trained individually).While further work would be required to make any definitive statements, the results clearly favor the second hypothesis. The movement of individually adapted components tends to be larger than that of their counterparts in fully adapted models.

Sensitivity Analysis
To assist in interpreting the overall magnitude of changes experienced during continued training, we perform sensitivity analysis of each component of the initial, out-of-domain model. In each experiment, zero-mean, independent Gaussian noise with fixed variance is added to every parameter in a single component of the model. By varying noise levels, we show how much (random) movement is required to produce a given decrease in performance. 9 Figure 3 shows the sensitivity plots for each component. Table 6 shows, for each component, the (linearly interpolated) BLEU score decrease that would result from adding random noise of the same magnitude as the change observed in full continued training.
Considering the sensitivity of each component reveals several patterns. First, the most significant 9 Bojar et al. (2010) show that very low BLEU scores are not trustworthy. Due to the very low BLEU score (2.7) of the out-of-domain Ko-En system on the in-domain test set, we use out-of-domain test sets for each language, where BLEU scores fall between 11 and 30. This means that the BLEU scores for continued training (computed on the in-domain test set) are not directly comparable to the BLEU scores produced for sensitivity analysis. However, as the sensitivity analysis is used only as an aid in interpreting the general magnitude of BLEU shifts, we view this as an acceptable compromise.

Conclusions
This work presents and applies a simple freezing subnetworks method to analyze continued training. Freezing any single component during continued training has negligible effect on performance compared to full continued training. Furthermore, adapting only a single component via continued training produces surprisingly strong performance in most cases, achieving most of the performance gain of full continued training. That is, continued training is able to adapt the overall system to a new domain by modifying only parameters in a single component. This finding goes against the intuitive hypothesis that source embeddings must account for domain changes in the source vocabulary, target embeddings must account for changes in the target vocabulary, etc.
We note that the encoder and decoder, despite having the least parameters (3.7M and 6.8M, respectively, out of 56M), perform strongly across all languages. This suggests further work on adapting only a subset of parameters may be warranted (see also Vilar, 2018;Michel and Neubig, 2018).
We also perform sensitivity analysis of components and find that continued training does not This suggests that the out-of-domain model, while not performing very well on the in-domain test set, is close to a good local minimum on the in-domain error surface. This finding may explain the recent success of techniques which regularize a continued training model using the initial, out-of-domain model (Miceli Barone et al., 2017;Dakwale and Monz, 2017;Khayrallah et al., 2018).