Pruning-then-Expanding Model for Domain Adaptation of Neural Machine Translation

Domain Adaptation is widely used in practical applications of neural machine translation, which aims to achieve good performance on both general domain and in-domain data. However, the existing methods for domain adaptation usually suffer from catastrophic forgetting, large domain divergence, and model explosion. To address these three problems, we propose a method of “divide and conquer” which is based on the importance of neurons or parameters for the translation model. In this method, we first prune the model and only keep the important neurons or parameters, making them responsible for both general-domain and in-domain translation. Then we further train the pruned model supervised by the original whole model with knowledge distillation. Last we expand the model to the original size and fine-tune the added parameters for the in-domain translation. We conducted experiments on different language pairs and domains and the results show that our method can achieve significant improvements compared with several strong baselines.


Introduction
Neural machine translation (NMT) models (Kalchbrenner and Blunsom, 2013;Cho et al., 2014;Sutskever et al., 2014;Bahdanau et al., 2015;Gehring et al., 2017;Vaswani et al., 2017) are datadriven and hence require large-scale training data to achieve good performance (Zhang et al., 2019a). In practical applications, NMT models usually need to produce translation for some specific domains with only a small quantity of in-domain data available, so domain adaptation is applied to address the problem. A typical domain adaptation scenario as discussed in Freitag and Al-Onaizan (2016) is that an NMT model have been trained with large-scale general-domain data and then is adapted to specific * Corresponding author: Yang Feng. Reproducible code: https://github.com/ictnlp/PTE- NMT. domains, hoping the model can fit in-domain data well meanwhile the performance will not degrade too much on the general domain.
Towards this end, many researchers have made their attempts. The fine-tuning method (Luong and Manning, 2015) performs in-domain training based on the general-domain model by first training the model on general-domain data and then continuing to train on in-domain data. Despite its convenience for use and high-quality for in-domain translation, this method suffers from catastrophic forgetting which leads to poor performance in the previous domains. Regularization-based methods (Dakwale and Monz, 2017;Thompson et al., 2019;Barone et al., 2017;Khayrallah et al., 2018) instead introduce an additional loss to the original objective so that the translation model can trade off between general-domain and in-domain. This kind of methods usually has all the parameters shared by general-domain and in-domain, with the assumption that the optimal parameter spaces for all the domains will overlap with each other, and retaining these overlapped parameters can balance over all the domains. This assumption is feasible when the domains are similar, but when the divergence of the domains is large, it is not reasonable anymore. In contrast, the methods with domain-specific networks (Dakwale and Monz, 2017;Wang et al., 2019;Bapna and Firat, 2019;Gu et al., 2019) can be often (but not always) immune to domain divergence as it can capture domain-specific features. But unfortunately, as the number of domains increases, the parameters of this kind of methods will surge. Besides, the structure of these networks needs to be carefully designed and tuned, which prevents them from being used in many cases.
Given the above, we propose a method of domain adaptation that can not only deal with large domain divergence during domain transferring but also keep a stable model size even with multiple domains. Inspired by the analysis work on NMT (Bau et al., 2019;Voita et al., 2019;Gu and Feng, 2020), we find that only some important parameters in a well-trained NMT model play an important role when generating the translation and unimportant parameters can be erased without affecting the translation quality too much. According to these findings, we can preserve important parameters for generaldomain translation, while tuning unimportant parameters for in-domain translation. To achieve this, we first train a model on the general domain and then shrink the model with neuron pruning or weight pruning methods, only retaining the important neurons/parameters. To ensure the model can still perform well on general-domain data, we adjust the model on in-domain data with knowledge distillation where the original whole model is used as the teacher and the pruned model as the student. Finally, we expand the model to the original size and fine-tune the added parameters on the in-domain data. Experimental results on different languages and domains show that our method can avoid catastrophic forgetting on general-domain data and achieve significant improvements over strong baselines on multiple in-domain data sets.
Our contributions can be summarized as follows: • We prove that the parameters that are unimportant for general-domain data can be utilized to improve in-domain translation quality.
• Our model can keep superior performance over baselines even when continually transferring to multiple domains.
• Our model can fit in the continual learning scenario where the data for the previous domains cannot be got anymore which is the common situation in practice.

The Transformer
In our work, we apply our method in the framework of TRANSFORMER (Vaswani et al., 2017) which will be briefly introduced here. However, we note that our method can also be combined with other NMT architectures. We denote the input sequence of symbols as x = (x 1 , . . . , x J ), the ground-truth sequence as y * = (y * 1 , . . . , y * K * ) and the translation as y = (y 1 , . . . , y K ). The Encoder & Decoder The encoder is composed of N identical layers. Each layer has two sublayers. The first is a multi-head self-attention sublayer and the second is a fully connected feedforward network. Both of the sublayers are followed by a residual connection operation and a layer normalization operation. The input sequence x will be first converted to a sequence of vectors is the sum of word embedding and position embedding of the source word x j . Then, this sequence of vectors will be fed into the encoder and the output of the Nth layer will be taken as source hidden states. and we denote it as H. The decoder is also composed of N identical layers. In addition to the same kind of two sublayers in each encoder layer, the crossattention sublayer is inserted between them, which performs multi-head attention over the output of the encoder. The final output of the N -th layer gives the target hidden states S = [s 1 ; . . . ; s K * ], where s k is the hidden states of y k . The Objective We can get the predicted probability of the k-th target word over the target vocabulary by performing a linear transformation and a softmax operation to the target hidden states: where W o ∈ R d model ×|Vt| and |V t | are the size of target vocabulary. The model is optimized by minimizing a cross-entropy loss of the ground-truth sequence with teacher forcing training: where K is the length of the target sentence and θ denotes the model parameters.

Knowledge Distillation
Knowledge Distillation (KD) method (Hinton et al., 2015) is for distilling knowledge from a teacher network to a student network. Normally, the teacher network is considered to be with higher capability. A smaller student network can be trained to perform comparablely or even better by mimicking the output distribution of the teacher network on the same data. This is usually done by minimizing the cross entropy between the two distributions: where q denotes the output distribution of the teacher network and θ and θ T denote the parameters of the student and teacher network, respectively.
The parameters of the teacher network usually keep fixed during the KD process.

Method
The main idea of our method is that different neurons or parameters have different importance to the translation model and hence different roles in domain adaptation. Based on this, we distinguish them into important and unimportant ones and make important neurons or parameters compromise between domains while unimportant ones focus on in-domain. Specifically, our method involves the following steps shown in Figure 1. First, we train a model on the general domain and then evaluate the importance of different neurons or parameters. Then we erase the unimportant neurons or parameters and only keep the ones that are related to the general domain so that our method will not be subjected to domain divergence. Next, we further adjust our model under the framework of knowledge distillation (Hinton et al., 2015) on the in-domain with the unpruned model as the teacher and the pruned model as the student. In this way, the pruned model can regain some of its lost performance because of pruning. Finally, we expand the pruned model to the original size and fine-tune the added parameters for the in-domain.

Model Pruning
Model pruning aims to find a good subset of neurons and parameters of the general-domain model while maintaining the original performance as much as possible. Therefore, under the premise of retaining most of the model's capability, we want to remove those unimportant neurons or parameters to reduce the size of the whole model first. To achieve this, we adopt two pruning schemes. The first is neuron pruning, where we evaluate the importance of neurons directly and then prune unimportant neurons and relevant parameters. The second is weight pruning, where we evaluate and prune each parameter directly. Neuron Pruning To evaluate the importance of each neuron, we adopt a criterion based on the Taylor expansion (Molchanov et al., 2017), where we directly approximate the change in loss when removing a particular neuron. Let h i be the output produced from neuron i and H represents the set of other neurons. Assuming the independence of each neuron in the model, the change of loss when removing a certain neuron can be represented as: Then, approximating L(H, h i = 0) with a firstorder Taylor polynomial where h i equals zero: (6) The remainder R 1 can be represented in the form of Lagrange: where δ ∈ (0, 1). Considering the use of ReLU activation function in the model, the first derivative of loss function tends to be constant, so the second order term tends to be zero in the end of training. Thus, we can ignore the remainder and get the importance evaluation function as follows: In practice, we need to accumulate the product of the activation and the gradient of the objective function w.r.t to the activation, which is easily computed during back-propagation. Finally, the evaluation function is shown as: where h l i is the activation value of the i-th neuron of l-th layer and T is the number of the training examples. The criterion is computed on the general-domain data and averaged over T . Finally, we prune a certain percentage of neurons and relevant parameters in each target layer based on this criterion.
Weight Pruning We adopt the magnitude-based weight pruning scheme (See et al., 2016), where the absolute value of each parameter in the target matrix is treated as the importance: where w mn denotes the m-th row and n-th column parameter of the weight matrix W. The weight matrix W represents different parts of the model, e.g., embedding layer, attention layer, output layer, etc. Finally, a certain percentage of parameters in each target parameter matrix are pruned.

Knowledge Distillation
Though only limited degradation will be brought in performance after removing the unimportant neurons or parameters, we want to further reduce this loss. To achieve this, we minimize the difference in the output distribution of the unpruned and pruned model. In this work, the general-domain model (parameters denoted as θ * G ) acts as the teacher model and the pruned model (parameters denoted as θ G ) acts as the student model. So, the objective in this training phase is: Considering that the general-domain data is not always available in some scenarios when adapting the model to new domains, e.g., continual learning, we adopt the word-level knowledge distillation method using the in-domain data. Because the teacher model is trained on general-domain, it can still transfer the general-domain knowledge to the student model even with the in-domain data. We can fine-tune the pruned model on general-domain if the data is available which can simplify the training procedure. We have also tried the sentencelevel knowledge distillation method, but the results are much worse. The parameters of the teacher model keep fixed during this training phase and the parameters of the pruned model are updated with this KD loss. After convergence, the parameters of the pruned model (θ G ) will be solely responsible for the general-domain and will also participate in the translation of in-domain data. These parameters will be kept fixed during the following training phase, so our model won't suffer catastrophic forgetting on the general-domain during the fine-tuning process.

Model Expansion
After getting the well-trained pruned model, we add new parameters (denoted as θ I ) to it, which expands the model to its original size. Then we fine-tune these newly added parameters with indomain data, which is supervised by the ground truth sequences. As we have indicated above, the parameters of the pruned model (denoted as θ G ), which are responsible for generating the generaldomain translation, keep fixed during this training phase. The objective function is: After convergence, the parameters of the pruned model (θ G ) and new parameters (θ I ) are combined together for generating the in-domain translation. English→French. For this task, the generaldomain data is from the UN corpus of the WMT 2014 En-Fr translation task that contains 12.78M sentence pairs, which are mainly related to the News domain. We choose newstest2013 and new-stest2014 as our development and test set, respectively. The in-domain data with 53K sentence pairs are from WMT 2019 biomedical translation task, and it is mainly related to the Biomedical domain. We choose 1K and 1K sentences randomly from the corpora as our development and test data, respectively. We tokenize and truecase the corpora.
English→German. For this task, generaldomain data is from the WMT16 En-De translation task which is mainly News texts. It contains about 4.5M sentence pairs. We choose the newstest2013 for validation and newstest2014 for test. For the indomain data, we use the parallel training data from the IWSLT 2015 which is mainly from the Spoken domain. It contains about 194K sentences. We choose the 2014test for validation and the 2015test for test. We tokenize and truecase the corpora.
Besides, integrating operations of 32K, 32K, and 30K are performed to learn BPE (Sennrich et al., 2016) on the general-domain data and then applied to both the general-domain and in-domain data. Then we filter out the sentences which are longer 1 http://www.statmt.org/moses/ 2 https://nlp.stanford.edu/ than 128 sub-words. For the Zh-En translation task, 44K size of the Chinese dictionary and 33K size of the English dictionary are built based on the general-domain data. For the En-Fr and En-De tasks, 32K size of the dictionaries for the source and target languages are also built on the corresponding general-domain data.

Systems
We use the open-source toolkit called Fairseqpy (Ott et al., 2019) released by Facebook as our Transformer system. The contrast methods can be divided into two categories. The models of the first category are capacity-fixed while the second category are capacity-increased. The first category includes the following systems: • General This baseline system is trained only with the general-domain training data.
• In This baseline system is trained only with the in-domain training data. Besides minimizing the loss between the ground truth words and the output distribution of the network, this method also minimizes the cross-entropy between the output distribution of the general-domain model and the network. The final objective is: where α is the hyper-parameter which controls the contribution of the two parts. The bigger the value, the less degradation on the general-domain.
• Elastic Weight Consolidation (EWC) (Thompson et al., 2019) This method models the importance of the parameters with Fisher information matrix and puts more constrains on the important parameters to let them stay close to the original values during the fine-tuning process. The training objective is: where i represents the i-th parameter and F i is the modeled importance for the i-th parameter.  FT' represent neuron pruning, weight pruning, knowledge distillation, and fine-tuning, respectively. The numbers on the right of 'PTE' denote that this training phase is based on the previous corresponding models. After knowledge distillation, the parameters in the pruned model (system 10, 13) are fixed, so the general-domain BLEU is unchanged after fine-tuning (system 11, 14). * and ** mean the improvements over the MLL method is statistically significant (ρ < 0.05 and ρ < 0.01, respectively). (Collins et al., 2005) The second category indcludes the following three systems: • Full Bias (Michel and Neubig, 2018) This method adds domain-specific bias term to the output softmax layer and only updates the term as other parts of the general-domain model keep fixed. • Adapter (Bapna and Firat, 2019) This methods injects domain-specific adapter modules into each layer of the general-domain model. Each adapter contains a normalization layer and two linear projection layers. The adapter size is set to 64. • Multiple-output Layer Learning (MLL) (Dakwale and Monz, 2017) The method modifies the general-domain model by adding domain-specific output layer for the in-domain and learning these domain specific parameters with respective learning objective. The training objective is: where θ S is the domain-shared parameters, θ G and θ I denote the domain specific parameters for the general-domain and in-domain, respectively. • Our Method -Pruning Then Expanding (PTE) Our model is trained just as the Method section describes. For the neuron pruning scheme, we prune the last 10% unimportant neurons; for the weight pruning scheme, we prune the last 30% unimportant parameters. To better show the ability of our method, we report the general-and indomain performance after each training phase. Implementation Details All the systems are implemented as the base model configuration in Vaswani et al. (2017) strictly. We set the hyperparameter α to 1 for MOL, EWC, and MLL and we will do more analysis on the impact of this hyperparameter in the next section. We set the learning rate during fine-tuning process to 7.5 × 10 −5 for all the systems after having tried different values from 1.5 × 10 −6 to 1.5 × 10 −3 . In both of our methods, we don't prune the layer-normalization layers in the encoder and decoder, which can make training faster and more stable. For the neuron pruning method, we also don't prune the first layer of the encoder and the last layer of the decoder. Just like the work of Dakwale and Monz (2017), the domain of the test data is known in our experiments. Besides, we use beam search with a beam size of 4 during the decoding process.

Main Results
The final translation is detokenized and then the quality is evaluated using the 4-gram case-sensitive BLEU (Papineni et al., 2002) with the SacreBLEU tool (Post, 2018). 3 The results are given in Table 1. In all the datasets, our weight pruning method outperforms all the baselines. Furthermore, we get the following conclusions: First, the contrast capacity-fixed methods can't handle large domain divergence and still suffer catastrophic forgetting. They perform well in the En-De translation task, where the data distributions are similar. They can significantly improve the in-domain translation quality without excessive damage to the general-domain translation quality. However, they perform worse in the En-Fr and Zh-En translation tasks with more different data distributions. The in-domain data contains many low-frequency or out-of-vocabulary tokens of the general-domain data. In this situation, these methods either bring limited in-domain improvements or degrade the general-domain performance too much. In contrast, our method is superior to them in all tasks, especially on the more different domains. This also validates our motivation. Second, the capacity-increased methods can better deal with domain divergence. Compared with them, our method can achieve larger improvements on in-domain since we actually allocate more parameters for in-domain than the capacity-increased methods. Besides, our methods are also more convenient to use in practice because we don't need to specialize the model architecture. The pruning ratio is the only hyper-parameter needed tuning.
Lastly, both of our methods are immune to large domain divergence. Moreover, the knowledge distillation can bring modest improvements on the general domain. Compared with the neuron pruning method, the weight pruning method is more effective since it can prune and reutilize more parameters with smaller performance degradation.

Adapting to Multi-Domain
We conduct experiments under the multi-domain scenario, which lets the model adapt to several different domains. Except for the training data used in the main experiments of the Zh-En task, which are related to the News and Thesis domain,  we add two datasets from other domains, namely, Spoken and Education. Both of them are chosen randomly from the UM-corpus. Each of them contains about 75K, 1K, and 1K sentence pairs in the training, development, and test set. We test our weight-pruning based method and still prune last 30% unimportant parameters. We compare our method with the basic fine-tuning system and more effective capacity-increased method. The results are shown in Table 2. It shows that our method can get significant improvements on all the domains.

Effects of Different Hyper-parameters
For the MOL, EWC, and MLL methods, the hyperparameter α controls the trade-off between the general-and in-domain performance. As for our method, the proportion of model parameters to be pruned has a similar effect. To better show the full general-and in-domain performance tradeoff, we conduct experiments with different hyperparameters. We compare our method with the best capacity-fixed method EWC and best capacityincreased method MLL. For the EWC and MLL method, we vary α from 0.25 to 2.5. We vary the pruning proportion from 5% to 30% for our neuron-pruning method and from 10% to 50% for our weight-pruning method. The results are shown in Figure 2. It shows that our method outperforms EWC at all the operating points significantly. Be-  sides, our neuron-pruning method can achieve comparable results as MLL and our weight-pruning method can surpass it with fewer parameters.

Ablation Study
To further understand the impact of each step of our method, we perform further studies by removing or replacing certain steps of our method. We first investigate the necessity of parameter importance evaluation. We train another three models following our method but with the parameters randomly pruned. The results are given in Table 3. It shows that random pruning will give excessive damage to general-domain. Besides, we also train another model that skips the model pruning and knowledge distillation steps and directly fine-tune the unimportant parameters. At last, we perform translation with the whole model on both the generaland in-domain. The results show that the change of unimportant parameters will also lead to catastrophic forgetting on general-domain, which shows the necessity of "divide and conquer".

Effects of Data Distribution Divergence
To further prove that our method is better at dealing with large domain divergence, we conduct experiments on the En-Fr translation task. Following the method in Moore and Lewis (2010), we score and rank each in-domain sentence pair by calculating the per-word cross-entropy difference between the general-and in-domain language model: where H denotes the language model which is trained with Srilm (Stolcke, 2002), s and t denote the source and target sentence. Then, we split the in-domain data into four parts with equal size and train new models with them separately. We compare our weight pruning based method with the EWC and MLL methods. The results are shown in Figure 3. It shows that we can get larger improvements as the data divergence gets larger.

Related Work
Domain Adaptation Recent work on DA can be divided into two categories according to the use of training data. The first category, which is also referred to as multi-domain adaptation, needs the training data from all of the domains. Chu et al. (2017) fine-tunes the model with the mix of the general-domain data and over-sampled in-domain data. Kobus et al. (2017) adds domain-specific tags to each sentence. Zhang et al. (2019b) applies curriculum learning to the DA problem. Britz et al. (2017) adds a discriminator to extract common features across domains. There are also some work (Zeng et al., 2018(Zeng et al., , 2019Gu et al., 2019) that adds domain-specific modules to the model to preserve the domain-specific features. Currey et al. (2020) distills multiple expert models into a single student model. The work of Liang et al. (2020) has a similar motivation with ours which also fix the important parameters and prune the unimportant parameters. Compared with their method, our method doesn't need to store the general-domain training data and our method has less degradation on general-domain because we adopt the knowledge distillation method.
The second category, which is also referred to as continual learning, only needs the data from the new domain and the model in use. The biggest challenge for this kind of work is the catastrophic forgetting. Luong and Manning (2015) fine-tunes the general-domain model with the in-domain data. Freitag and Al-Onaizan (2016) ensembles the general-domain model and the fine-tuned model for generating. Saunders et al. (2019) investigates adaptive ensemble weighting for inference. Khayrallah et al. (2018) and Thompson et al. (2019) add regularization terms to let the model parameters stay close to their original values. Dakwale and Monz (2017) minimizes the cross-entropy between the output distribution of the general-domain model and the fine-tuned model. Michel and Neubig (2018) adds domain-specific softmax bias term to the output layer. Bapna and Firat (2019) injects domain-specific adapter modules into each layer of the general-domain model. Wuebker et al. (2018) only saves the domain-specific offset based on the general-domain model. Wang et al. (2020b) achieves efficient lifelong learning by establishing complementary learning systems. Sato et al. (2020) adapts the vocabulary of a pre-trained NMT model to the target domain.
Overall, our work is related to the second type of approach, which is more flexible and convenient in practice. The work of Thompson et al. (2019) and Dakwale and Monz (2017) are most related to our work. Compared with Thompson et al. (2019), our work is better at dealing with large domain divergence, since we add domain-specific parts to the model. In contrast to Dakwale and Monz (2017), our model divides each layer of the model into domain-shared and domain-specific parts, which increases the depth of the in-domain model, intuitively. Besides, our method doesn't need to add parameters, but it can be easily extended when necessary.
Model Pruning Model pruning usually aims to reduce the model size or improve the inference efficiency. See et al. (2016) examines three magnitudebased pruning schemes. Zhu and Gupta (2018) demonstrates that large-sparse models outperform comparably-sized small-dense models. Wang et al. (2020a) improves the utilization efficiency of parameters by introducing a rejuvenation approach. Lan et al. (2020) presents two parameter reduction techniques to lower memory consumption and increase the training speed of BERT.

Conclusion
In this work, we propose a domain adaptation method based on the importance of neurons and parameters of the NMT model. We make the important ones compromise between domains while unimportant ones focus on in-domain. Based on this, our method consists of several steps, namely, model pruning, knowledge distillation, model expansion, and fine-tuning. The experimental results on different languages and domains prove that our method can achieve significant improvements with model capacity fixed. Further experiments prove that our method can also improve the overall performance under the multi-domain scenario.