Factorized Transformer for Multi-Domain Neural Machine Translation

Multi-Domain Neural Machine Translation (NMT) aims at building a single system that performs well on a range of target domains. However, along with the extreme diversity of cross-domain wording and phrasing style, the imperfections of training data distribution and the inherent defects of the current sequential learning process all contribute to making the task of multi-domain NMT very challenging. To mitigate these problems, we propose the Factorized Transformer, which consists of an in-depth factorization of the parameters of an NMT model, namely Transformer in this paper, into two categories: domain-shared ones that encode common cross-domain knowledge and domain-specific ones that are private for each constituent domain. We experiment with various designs of our model and conduct extensive validations on English to French open multi-domain dataset. Our approach achieves state-of-the-art performance and opens up new perspectives for multi-domain and open-domain applications.


Introduction
Recent advances in Neural Machine Translation (NMT) (Bahdanau et al., 2015;Vaswani et al., 2017) have led to significant improvement in terms of translation quality (Wu et al., 2016;Hassan et al., 2018), opening new perspectives for Machine Translation in real-world scenarios. In order to deliver trust-worthy translations for end users, an NMT system is often required to meet expert-level translation quality in one or multiple related target domains, while performing well enough on a range of generic subjects, just like human experts do.
However, requiring a single NMT system to perform well on multiple distant domains simultaneously is a very challenging task. First, languages are highly polysemous: the same words or expressions may have different meanings in different contexts. Also wording and syntactic style may significantly vary depending on the domains. Second, a multi-domain NMT system in general suffers from two major issues: Domain Bias and Catastrophic Forgetting (Mccloskey and Cohen, 1989;Kirkpatrick et al., 2016;Thompson et al., 2019) . While the former biases the model toward well-represented domains to the detriment of the low-resource ones, the latter makes the sequential learning process difficult as the model keeps forgetting previously learned knowledge when exposed to the new training examples.
Most of the existing NMT systems rely on the same network to model all domains, which means the same word embedding to represent all the meanings of a word and the same set of parameters to model its depending contexts. This type of configuration in general maximizes the knowledge transfer, but overlooks the specificity of each domain (Koehn and Knowles, 2017). An obvious solution for this problem is to dedicate an individual model to each constituent domain, which is unrealistic in practice as it dramatically increases the number of model parameters. Moreover, the recent success of multilingual applications (Johnson et al., 2017) show that a single NMT model where all parameters are shared can handle translation between hundred of language pairs, suggesting that model capacity may not be the key weakness of the current NMT models to deal with Multi-Domain problems. Thus, the need for a compact architecture with better parameter efficiency is appealing.
We propose the Factorized Transformer framework to deal with the multi-domain NMT problem. The Factorized Transformer consists in factorizing partially or fully basic components (embedding, attention and FFN layers) of a conventional Transformer architecture into domain-specific blocks and domain-shared blocks. This dual structure has several advantages: 1) It allows the model to leverage all available data, labeled or unlabeled, to build a generic model at an early stage of domain-agnostic training; 2) Domain singularities could be effectively learned by using domain-specific components and the respective in-domain training data during the stage of domain-aware training. The domain bias issue naturally disappears; 3) Domainspecific components are independently optimized, without any interference between target domains. The original performance of the generic model on un-adapted source domains is also preserved, overcoming the limit of catastrophic forgetting. 4) The design of Factorized Transformer is orthogonal to any data-driven approach, so that the benefit of both approaches can be combined.
Our contributions can be summarized as follows: • We address the weaknesses of existing NMT systems in multi-domain scenarios by proposing the Factorized Transformer, which separately model domain-shared and domainspecific information via its dual structure.
• We validate our method over a large-scale English to French multi-domain setting. We study 3 variants of Factorized Transformer meeting different requirements of performance and parameter space limitation, our approach outperforms all previously state-ofthe-art multi-domain systems, reaching close to the combined performance of individual fine-tuned models.
• Our proposed architecture enables new perspectives for open domain applications.

Related Work
Multi-domain NMT has been an active research area. Prior work in this area can be divided into two main categories: data-driven and model-driven, although they are usually complementary.
Data-Driven Approaches Many researches focus on the exploration of data-driven approaches (van der Wees et al., 2017;Sajjad et al., 2017;Wang et al., 2018). Chu et al. (2017)  Soft-Constraints-Based Approaches The subcategory consists in injecting domain information into the model parameters, by the means of sideconstraints, domain embeddings, so as to endow these parameters with domain knowledge, to make them domain-aware. Kobus et al. (2017) added an artificial token to the end of the input sequence to indicate the required target domain and exploited domain as a tag or a feature. Britz et al. (2017) employed discriminators, training objective or GANlike techniques to incorporate domain knowledge into the encoder or decoder. Chu and Dabre (2019) treated text domains as distinct languages in order to use multi-lingual approaches when implementing multi-domain NMT. Zeng et al. (2018) combine source-target domain classifiers and adversarial domain classifier during training. However, since the main model parameters (embeddings, encoder, decoder) remain shared across all domains, the capacity of these methods to deal with the interdomain conflicts might be limited.
Hard-Constraints-Based Approaches involve dedicating extra parameters to directly model domain-specific knowledge. Michel and Neubig (2018) introduces speaker-specific softmax bias to deal with adaptation for a large number of speakers, the idea of parameter factorization is also exploited. Adapter tuning is a recently arisen approach for transfer learning (Rebuffi et al., 2017(Rebuffi et al., , 2018Houlsby et al., 2019;Stickland and Murray, 2019). Each task/domain is equipped with its own set of parameters in order to model and capture domain specificity, which is decoupled among different tasks. Bapna et al. (2019) successfully adapt this approach for domain adaptation and multilingual NMT models.
Our work falls into the second sub-category of the model-driven approaches and we hypothesize that the idea of introducing decoupled domainspecific parameters is crucial. We conduct exper-iments and analysis in the following sections to validate this hypothesis.

Approach
All basic components (embedding, attention and FFN matrices) of a conventional Transformer are factorized into multiple domain-specific blocks (Figure 1), one for each domain (colored ones) and a domain-shared block (white ones), common across all domains.
It's worth to notice that domain information is necessary for both training and inference, which could be obtained via external sources. Nevertheless, the domain prediction is not the main purpose of this work and we suppose in the whole paper, except otherwise mentioned, that domain information is known and passed as input to the model during training and inference.

Training Curriculum
We first briefly explain the training curriculum before moving to the detailed schemes of factorization, as the former is complementary to the latter and is designed to take advantage of the latter. The training curriculum can be theoretically divided into two stages: an early stage of domain-agnostic training and a later stage of domain-specific training, even though in practice, it could be achieved in an end-to-end curriculum.
Domain-Agnostic Training aims at building a generic model by sharing the model parameters across all available training domains. Using all available training data is beneficial for the model's overall performance as it allows the model to leverage knowledge from other domains that are related or close to the target domains. For example, the "JRC Acquis" domain (a collection of legislative texts of the European Union) would probably benefit from adding "europarl" domain (a collection of European Parliament texts) training data. Many data weighting schemes exist in the literature, however, this is beyond the scope of this paper and more importantly, the design of Factorized Transformer is orthogonal to any data-driven approach, so that the benefit of both approaches can be combined.

Domain-Specific
Training Once the generic model comes to a convergence, the domain-shared parameters of the resulting generic model are then frozen. We unfold all domain-specific components to the number of target domains and initialize them with the same corresponding matrix trained during the first stage. The specialization step is straightforward: the optimization of each set of domainspecific parameters can operate independently using the respective relevant in-domain data.
As each domain-specific matrix is initialized with the corresponding parameters from the underneath pre-trained network. Therefore, no transition performance degradation is observed along the extra module integration if any. In the case where an additional adaptation layer is involved (Fig 1  (F 6 )), we initialize it to a block identity tensor to maintain the exact model performance coming off the domain-agnostic training. This property is of great practical value as it allows the network to adapt directly on top of a set of well-optimized parameters. Similar design can be found in adapter modules: (Rebuffi et al., 2018;Houlsby et al., 2019;Stickland and Murray, 2019), which relies on skip-connection or residual-connection in order to obtain a near identity initialization. Moreover, (Houlsby et al., 2019) observed that if the initialization deviates too far from the identity function, the model may fail to train with adapter modules for transferring BERT style parameters across NLP tasks. However, our proposed Factorized Transformer does not suffer from such problem as it has the exact identity initialization property.

Factorization Schemes of Basic Components
Throughout this section, we ignore all bias terms, as they may or may not exist depending on the variant/block of the Transformer architecture and also do not add significantly to the parameter count. We first go through some notations before getting into architecture description, d m refers to the dimension of the model, which is equal to embedding size d e and hidden size d h in a conventional Transformer. V refers to the vocabulary size, without loss of generality, we suppose the source side and target side both share the same vocabulary size for the theoretical considerations. d f ilter refers to the filter dimension used in the FFN layers. h denotes the number of heads used in multi-head attention. N d represents the number of constituent domains.
Finally, we introduce an extra dimension d inner as the inner dimension used for linear factorization that we will explain in the following paragraphs.
Factorization of Embedding Blocks A conventional Transformer network has three wide embed-  ding matrices W e of dimensions d m * V , which are often tied or partially tied (Press and Wolf, 2016) to reduce model size. NMT models usually require the vocabulary size V to be large, V is of the order of 100 * d m . This can easily result in an embedding matrix with millions of parameters, many of which are only updated sparsely during training. We follow the work of Lan et al. (2019) to factorize these blocks (Fig 1 (F 1 ) and (F 4 )). More specifically, for each embedding matrix M e , we decompose it along an inner dimension d inner (Eq 1): where W C is a shared matrix and W i is a specific matrix for i ∈ 1 . . . N d . The advantage of such decomposition is two-fold: First, instead of sharing the same word embedding for all domains, the domain-specific sub-matrices provide a capacity for the model to give a domain-specific meaning to each word embedding. Secondly, from a practical perspective, by using this decomposition, we reduce the embedding parameters from If d inner d m , the factorized form's parameter cost remains inferior to the original embedding block, resulting in better usage of model parameters.
Where the weight matrices are of dimension:

Factorization of Attention Blocks
The factorization of the attention blocks operates differently from the embedding blocks, as each attention block is composed of four relatively small weight matrices W Q , W K , W V , W O . Within the Multi-Head Attention (MHA) in a conventional Transformer, they are square matrices of the same dimensions d 2 m . In the case of Multi-Query Attention (MQA) (Shazeer, 2019) instead of multi-head, we share the same key and value sub-matrices for all the heads, the dimensions of matrices W K , W V are reduced to d m * d k = d 2 m /h. We consider two schemes of introducing domainspecific components. A "full" scheme (Fig 1  (F 5 )) which consists in assigning different matrices for each domain for each transformation of W Q , W K , W V , W O in multi-head style attention, and a "light" scheme (Fig 1 (F 2 )) which only parallelizes the relatively small matrices of W K , W V of the multi-query style attention. Concretely, if we denote the conventional attention mechanism as follows: where [·, . . . , ·] stands for concatenation and ·, · for dot product. The factorization of the attention block in the full scheme with multi-head style attention can be written as: And in the case of the light scheme with Multi-Query Attention: While the latter remains parameter efficient unless N d h, the former significantly increases the model parameters.
Factorization of FFN Blocks FFN blocks are composed of coupled linear matrices joined via a ReLU activation on their amplifying inner dimension d f ilter . We could perform twice the linear factorization as for case of embedding matrices (Fig  1 (F 3 )), or introduce an extra layer of square matrices, one for each domain (Fig 1 (F 6 )). In general, few additional parameters are needed for the factorization of the FFN blocks unless where the weight matrices are of dimension: The first factorization scheme (Fig 1 (F 3 )) for the FFN block can be written as: where the weight matrices are of dimension: The second factorization scheme (Fig 1 (F 6 )) for the FFN block can be formulated as:

Overall Architecture Designs of Factorized Transformer
We consider three architecture designs of Factorized Transformer for multi-domain NMT in this paper, namely Deep Factorization (DF), Shallow Factorization (SF) and Parallel Attention (PA). These designs have been deliberately chosen as extreme cases to provide insights on the limits of the Factorized Transformer, regarding different requirements of performance and parameter space limitation. Other more progressive combination schemes could be also interesting to be investigated depending on the final goal and constraints of applications.
Deep Factorization (DF) We combine the factorization schemes (F 1 ), (F 2 ), (F 3 ), (F 4 ), and it's called deep factorization, since factorization is applied to all the main blocks and the combination of domain-shared parameters and domain-specific parameters occur through the whole model. We set the d inner to 280 to obtain the same model capacity as the Transformer base setting for fair comparison.

Shallow Factorization (SF)
We rely on the entire original architecture of Transformer to encode domain-shared knowledge as a conventional Transformer, so that we will not suffer from the loss of knowledge transfer capacity compared to the original Transformer. The domain-specific components are plugged into the main architecture as light weight add-on modules. We also duplicate the key, value matrices as domain-specific components. It corresponds to the combination of factorization schemes (F 2 ) and (F 6 ) in Figure 1.

Parallel Attention (PA)
We parallelize all the attention matrices of the original multi-head attention (Vaswani et al., 2017) to boost the model capacity reserved to each domain. This configuration (Fig  1 (F 5 )) can be seen as a factorization of the entire network into domain-shared non-attention blocks and domain-specific blocks.  (Sajjad et al., 2017), we oversampled the low-resource domains to match the same order of size for high resource domains, out-of-domain sentences are not concerned by the oversampling. All sentence pairs are then concatenated and shuffled into a final training data. We tokenize English and French sentences using MOSES script 2 . Byte-pair encoding (Sennrich et al., 2016) is employed in the experiment 50,000 joint pairs, the source and target vocabulary is set to the 50,000 most frequent tokens . Table 1 provides the corpora statistics used in our experiments.

Systems Settings
We employ Transformer (Vaswani et al., 2017) as our basis architecture. Six layers are stacked in both the encoder and decoder, and the dimensions of the embedding vectors and all hidden vectors are set to 512. The inner layer of the feed-forward sublayer has the dimension of 2048. We use 8 heads in the multi-head or multi-query attention. The target embedding and the output embedding are shared in our experiments. We use the Adam optimizer with β 1 = 0.9, β 2 = 0.997, ε= 10 −9 during training. The initial learning rate is 0.0003. The learning rate decay schedule is applied for initial warm up and annealing (Vaswani et al., 2017). During training, each mini-batch contains 4096 tokens and we use a dropout rate of 0.1 on all datasets including attention dropout. During evaluation, we employ lowercase token BLEU (Papineni et al., 2002) as our evaluation metric and use mteval-13a script. In addition, during decoding, we use the beam search algorithm and the beam size is set to 4.
Benchmark Systems We compare our system with multi-domain systems previously reported in the literature, a system is considered as multidomain system if all its parameters can be contained within a unified and deployment-friendly framework. Such candidates are Domain Control (DC) (Kobus et al., 2017) and Target Token Mixing (TTM) (Britz et al., 2017), which are sideconstraint based pioneer works of using domain information for multi-domain training; Multitask Learning (ML) (Britz et al., 2017) method and the Word-level Domain Context (WDC) (Zeng et al., 2018) method both add classifiers to the training so that the network can distinguish mulit-domain contexts; As mentioned in the introduction, adapterbased method is also considered. We use the "bottleneck" Residual Adapters (RA) reported in Bapna et al. (2019) with an inner dimension set to 2048. We re-implement all previously reported RNNbased approach with the Transformer architecture for fair comparison. We omit any data-driven approach, as it is orthogonal to our approach and can be naturally combined together. We choose a balanced scheme described above as a pretty strong data-mixing baseline, the best system after several preliminary experiments.

Experimental Results and Analysis
The results of our system are shown at the bottom of   tuning (Luong and Manning, 2015) with in-domain data for News and Iwslt domains. We refer to the average score over the 5 target domains (AVG-5) as multi-domain performance. We also report the combined performance of 5 fully fine-tuned models as the upper bound performance (+2.65 BLEU in average) for Multi-Domain approaches. Our proposed Factorized Transformer systems clearly outperform the baseline and other multidomain systems in terms of multi-domain performance (AVG-5) as well as individual performance for most settings: our Deep Factorization, Shallow Factorization, Parallel Attention systems respectively yield +1.34, +1.40 and +2.13 BLEU gain over the baseline system. Substantial gains are observed for the domains of JRC (law text), EMEA (medical text) and SUB (subtitles) which have every specific terminologies and syntactic style. No significant improvement is observed for the domains of NEWS and IWSLT, which are still kinds of general domains.
Surprisingly, most of the previous multi-domain techniques, except adapter-based approach, yield very marginal gain over the Transformer baseline in our experiment setting. As all these techniques are re-implemented under the Transformer architecture, we assert that Transformer may have a stronger outof-the-box expressive ability compared to its RNNbased counterparts. Also, all soft-constraint-based  Table 2) for multi-domain benchmark systems.
systems perform better for domains that are closed to general domains (News, Iwslt) with big amount of out-of-domain data than the low-resource and over-sampled ones, which validate the assumption that models with a single shared set of parameters are more likely to be biased toward high resource domains to the detriment of the low-resource ones. Adapter-based system has the closest overall performance, demonstrating the benefit of separating the training process into domain-shared and domainspecific stages with the corresponding shared or domain-specific parameters.
Parameter Efficiency All of our systems demonstrate better parameter efficiency, measured by the ratio between the performance gain and the parameter scale factor (Fig 2).

Impact of Catastrophic Forgetting
Our Factorized Transformer can also be used for domain adaptation tasks. One of the main concerns of domain  Table 3: Benchmark for Domain Adaptation Techniques. The domain SUB is fine-tuned using in-domain data, the results of JRC and NEWS domains are omitted for space reason, which are taken into account in the average score (AVG-5). FactorTrans-PA refers to the Parallel Attention design of our approach using the fine-tuned model as pre-trained model.
adaptation is how to limit the degradation caused by the catastrophic forgetting problem. Table 3 shows the benchmark results between one of our Factorized Transformer system (PA) and some popular domain adaptation techniques. The fine-tuned system achieves the best in-domain performance (Subtitle), however, it suffers from severe catastrophic forgetting problem as its performance in the domain of EMEA is nearly halved. Our Factorized Transformer can operate on top of the fine-tuned system to recover most of the performance drop while preserving the optimal in-domain performance, The resulting system outperforms the fine-tuned system by +13.12 BLEU and the baseline system by +1.20 BLEU in overall performance. Introducing regularization techniques such as L2 (Barone et al., 2017), EWC (Kirkpatrick et al., 2016;Thompson et al., 2019) and mix-finetuning (Chu et al., 2017) can alleviate the drop in the domains of IWSLT, however it limits the performance of in-domain.

Towards Open-Domain NMT
In many real-world scenarios, the domain information is unknown at inference time, and even worse, the test inputs may also be out-of-domain, which means the model has never seen data from the same domains during training. For such unknown domains, NMT systems are known to have poor performance, especially adapted ones (Freitag and Al-Onaizan, 2016;Koehn and Knowles, 2017). Model ensembling is a reasonable approach to deal with unknown domains (Freitag and Al-Onaizan, 2016;Saunders et al., 2019). The compact and unified architecture of Factorized Transformer makes it ideal for this purpose as at each step all domain-specific representations can be   Table 4 for details). We ensemble all of the 5 adapted domains' output and that of the "general" domain, which corresponds to the base model before any domain-aware training and is more likely to have good performance for unknown domains than its adapted counterparts (Freitag and Al-Onaizan, 2016;Saunders et al., 2019).
The results (Table 4) demonstrate the potential of our Factorized Transformer for open-domain applications: not surprisingly, a naive combination of adapted systems (ens-uniform) result in degradation in all domains. The ens-soft and ens-learnable systems both manage to preserve the in-domain performance for known domains while still performing reasonably well for the unknown IT domain.

Conclusion
In this paper, we propose the Factorized Transformer framework to overcome the limits of traditional multi-domain NMT approaches in modeling all domain knowledge within a single shared set of parameters. By factorizing wisely the parameters of the Transformer model into domain-shared and domain-specific parts, we significantly improve the model's parameter efficiency and provide new perspectives for open domain applications.