Multi-Source Domain Adaptation with Mixture of Experts

We propose a mixture-of-experts approach for unsupervised domain adaptation from multiple sources. The key idea is to explicitly capture the relationship between a target example and different source domains. This relationship, expressed by a point-to-set metric, determines how to combine predictors trained on various domains. The metric is learned in an unsupervised fashion using meta-training. Experimental results on sentiment analysis and part-of-speech tagging demonstrate that our approach consistently outperforms multiple baselines and can robustly handle negative transfer.


Introduction
Typical domain adaptation methods are designed to transfer supervision from a single source domain. However, in many practical applications, we have access to multiple sources. For instance, in sentiment analysis of product reviews, we can often transfer from a wide range of product domains, rather than one. This can be particularly promising for target domains which do not match any one available source well. For example, the Kitchen product domain may include reviews on pans, cookbooks or electronic devices, which cannot be perfectly aligned to a single source such as Cookware, Books or Electronics. By intelligently aggregating distinct and complementary information from multiple sources, we may be able to better fit the target distribution.
A straightforward approach to utilizing data from multiple sources is to combine them into a single domain. This strategy, however, does not account for distinct relations between individual sources and the target example. Constructing a common feature space for this heterogeneous collection may wash out informative characteristics of individual domains and also lead to negative transfer (Rosenstein et al., 2005). Therefore, we propose to explicitly model the relationship between different source domains and target examples. We hypothesize that different source domains are aligned to different sub-spaces of the target domain. Specifically, in this paper, we model the domain relationship with a mixtureof-experts (MoE) approach (Jacobs et al., 1991b). For each target example, the predicted posterior is a weighted combination of all the experts' predictions. The weights reflect the proximity of the example to each source domain. Our model learns this point-to-set metric automatically, without additional supervision.
We define the point-to-set metric using Mahalanobis distance (Weinberger and Saul, 2009) between individual examples and a set (i.e. domain), which are computed within the hidden representation space of our model. The main challenge is to learn this metric in an unsupervised setting. We address it through a meta-training procedure, in which we create multiple meta-tasks of domain adaptation from the source domains. In each meta-task, we pick one of the source domains as meta-target, and the rest source domains as metasources. By minimizing the loss using the MoE predictions on meta-target, we are able to learn both the model and the metric simultaneously. To further improve transfer quality, we align the encoding space of our target and source domains via adversarial learning.
We evaluate our approach on sentiment analysis using the benchmark multi-domain Amazon reviews dataset (Chen et al., 2012;Ziser and Reichart, 2017) as well as on part-of-speech (POS) tagging using the SANCL dataset (Petrov and Mc-Donald, 2012). Experiments show that our ap-proach consistently improves the adaptation results over the best single-source model and a unified multi-source model. On average, we achieve a 7% relative error reduction on the Amazon reviews dataset, and a 13% on the SANCL dataset. Importantly, the POS tagging experiments on the SANCL dataset demonstrate that our method is able to robustly handle negative transfer from unrelated sources (e.g., Twitter) and utilize it effectively to consistently improve performance.

Related Work
Unsupervised domain adaptation Most existing domain adaptation methods focus on aligning the feature space between source and target domains to reduce the domain shift (Ben-David et al., 2007;Blitzer et al., , 2006Pan et al., 2010). Our approach is close to the representation learning approaches, such as the denoising autoencoder (Glorot et al., 2011), the marginalized stacked denoising autoencoders (Chen et al., 2012), and domain adversarial networks (Tzeng et al., 2014;Ganin et al., 2016;Zhang et al., 2017;Shen et al., 2018).
In contrast to these previous approaches, however, our approach not only learns a shared representation space that generalizes well to the target domain, but also captures informative characteristics of individual source domains.
Multi-Source domain adaptation The main challenge in using multiple sources for domain adaptation is in learning domain relations. Some approaches assume that all source domains are equally important to the target domain (Li and Zong, 2008;Luo et al., 2008;Crammer et al., 2008). Others learn a global domain similarity metric using labeled data in a supervised fashion (Yang et al., 2007;Duan et al., 2009;. Alternatively, Mansour et al. (2009) andBhatt et al. (2016) utilize unlabeled data of the target domain to find a distribution weighted combination of the source domains or to construct an auxiliary training set of the source domain instances close to the target domain instances. Recent adversarial methods on multi-source domain adaptation (Zhao et al., 2018;Chen and Cardie, 2018) align source domains to the target domains globally, without accounting for the distinct importance of each source with respect to a specific target example.
The work most related to ours is by Kim et al.
... Figure 1: Architecture of the MoE model. E is the encoder which maps an input x to a hidden representation E(x); F Si is the classifier on the i th source domain; D is the critic that is only used during adversarial training. M is the metric learning component, which takes the encoding of x and source domains (S 1:K ) as input and computes α.
(2017). They also model the example-to-domain relations, but use an attention mechanism. The attention module is learned using limited training data from the target domain in a supervised fashion. Our method, however, works in an unsupervised setting without utilizing any labeled data from the target domain.

Methodology
Problem definition We follow the unsupervised multi-source domain adaptation setup, assuming access to labeled training data from K source domains: , and (optionally) unlabeled data from a target domain: The goal is to learn a model using the source domain data, that generalizes well to the target domain.
Notations For the rest of the paper, we denote an individual example as x, and a batch of examples as x. We use superscript to denote the domain from which an example is sampled, and use subscript to denote the index of an example.

Overview of Our Approach
We model the multiple source domains as a mixture of experts, and learn a point-to-set metric α to weight the experts for different target examples. The metric is learned in an unsupervised manner.
Our model consists of four key components as shown in Figure 1, namely the encoder (E), classifier (F ), metric (M ) and adversary (D). We use a typical neural multi-task learning architecture (Caruana, 1997), with a shared encoder across all sources, and domain-specific classifiers . Each input is first encoded with E, and then fed to each classifier to obtain the domain-specific predictions (i.e. posteriors). The final predictions are then weighted based on the metric (see Equation 1).
We start by describing the representation learning component.

Representation
Our goal is to design an encoder that supports transfer, while maintaining source domainspecific information. Depending on different tasks and datasets, we select appropriate encoders -MLP, CNN or LSTM (see Section 4.3 for details).
We further add an adversarial module (D) on top of the encoder, in order to align the target domain with the sources. D is typically designed as a parameterized classifier in domain adversarial networks (Ganin et al., 2016;Zhang et al., 2017), which is trained jointly with the encoder and the classifiers through a minimax game. Here, we instead use Maximum Mean Discrepancy (MMD) (Gretton et al., 2012) as our adversary. This distance metric measures the discrepancy between two distributions explicitly in a non-parametric manner, greatly simplifying the training procedure compared to domain adversarial networks which use an additional domain classifier module.

Mixture of Experts
Given an example x from the target domain, we model its posterior distribution as a mixture of posteriors produced by models trained on different source domain data: (1) p S i is the posterior distribution produced by the i th source classifier F S i (the i th expert). W S i is the output layer weights of F S i , α is a parameterized metric function that measures how much confidence we put in the specific source expert for a given example x. 2 To derive α, we first define a point-to-set Mahalanobis distance metric between an example x and a set S: where µ S is the mean encoding of S. In its original form, the matrix M S played the role of the inverse covariance matrix. However, computing the inverse of the covariance matrix is both time consuming and numerically unstable in practice. Here we allow M to denote any positive semi-definite matrix which is to be estimated during training (Weinberger and Saul, 2009). To guarantee the positive semi-definiteness of M, we approximate M with M = UU , where U ∈ R h×r , h is the dimension of hidden representations and r is a hyper-parameter controlling the rank of M.
Based on the distance metric, we further derive a confidence score e(x, S i ) = f d(x, S i ) for each specific expert. The final metric values α(x, S i ) are then obtained by normalizing these confidence scores: Here, we explain our design of e(x, S) on two tasks, respectively binary classification and sequence tagging, which are also used for evaluation in this paper (Section 4).

Binary classification
The point-to-set Mahalanobis distance metric measures the distance between an example x and the mean encoding of S, i.e. µ S , while taking into account the (pseudo) covariance of S. In binary classification, however, the mean vector µ S is likely to be located near the decision boundary, particularly under a balanced setting. Therefore, a small d(x, S) actually implies lower confidence of the corresponding classifier, which is counter-intuitive. To this end, we instead define the confidence e(x, S) as the difference between the distances from x to each category of S: Here S + and S − stand for the positive space and negative space of S respectively. Consequently, if x is either far away from S (i.e., x is not in the manifold of S) or near the classification boundary, we will get a small e(x, S) indicating a low confidence to the corresponding prediction. On the contrary, if x is much closer to a specific category of S than other categories, the classifier will get a higher confidence.
Sequence tagging For sequence tagging tasks (e.g., POS tagging), we compute the distance metric at the token level. 3 Unlike in binary classification, the decision boundary here is more complicated, and the label distribution is typically imbalanced. The mean vector µ S is unlikely to be located at the decision boundary. So we directly use the (reverse) distance as the confidence value for each token x:

Training
Since we do not have annotated data in the target domain, we have to learn our model in an unsupervised fashion. Inspired by the recent progress on few-shot learning with metric-based models such as matching network (Vinyals et al., 2016; and prototypical network (Snell et al., 2017), we propose the following meta-training approach. Given K source domains, each source domain will be considered as a target, referred to as meta-target, with the rest of the source domains as meta-sources. This way, we obtain K (meta-sources, meta-target) training pairs for domain adaptation. Then, we apply our MoE formulation over these meta-training pairs to learn the metric. At testing time, the metric will be applied to all the K source domains for each example in the target domain.
We optimize two main objectives: the MoE objective and the multi-task learning (MTL) objective.
MoE objective For each example in each metatarget domain, we compute its MoE posterior using the corresponding meta-sources. Therefore, we get the following MoE loss over the entire multi-source training data: Note that α is normalized over the meta-sources for each meta-target, rather than over all the K sources.
MTL objective For each meta-target, we further optimize a supervised cross-entropy loss using the corresponding labels. All supervised objectives are optimized jointly with the encoder being shared, resulting in the following multi-task learning objective: Adversary-augmented MoE We use MMD (Gretton et al., 2012) as the adversary to minimize the divergence between the marginal distribution of target domain and source domains. Specifically, at each training epoch, given the K batches {x S 1 , x S 2 , ..., x S K } from all the source domains, we sample a batch (unlabeled) x T from our target domain, and minimize the MMD: where measures the discrepancy between D S and D T based on Reproducing Kernel Hilbert Space (RKHS). φ(·) is the feature map induced by a universal kernel. We follow Bousmalis et al. (2016) and use a linear combination of multiple RBF kernels: Algorithm Compute cross-entropy loss over T meta , and add to L mtl 10: Compute Mahalanobis metric α(x, S ) for each x ∈ T meta and S ∈ S meta Eq. (2) 11: Compute MoE loss over (S meta , T meta ) using α, and add to L moe 12: Compute entropy of α over S, and add to R h 13: end for 14: Compute MMD between x T and ∪ K i=1 x S i , and add to L adv Eq. (5) 15: Update parameters via backpropagating gradients of the total loss L Eq. (7) 16: until converge Entropy regularization In the meta-training process, for each example x in meta-target, we know exactly from which source x is sampled. This provides additional insight that the α distribution is skewed, which can be utilized as a soft constraint. Therefore, we propose to regularize the entropy of the α distribution over all the sources, rather than meta-sources: 4 Joint learning Our final objective is the weighted combination of each individual component loss: where λ controls the balance of the MoE loss and MTL loss. γ is set to 0 in non-adversarial setting when unlabeled data from the target domain 4 Alternatively, we can directly exploit this supervision and minimize the KL divergence of the α distribution and its ground truth one-hot distribution. In practice, however, we found it beneficial to allow examples from one domain to be attended to different sources. This observation may be attributed to the fact that each domain indeed consists of multiple latent sub-domains.
is not provided. Additionally, it would be straightforward to add an MoE loss for labeled data in the target domain if they are available, thus extending our framework to a setting where we have fewshot target annotations. The training process is shown in Algorithm 1.

Task and Dataset
Sentiment classification We use the multidomain Amazon reviews dataset , one of the standard benchmark datasets for domain adaptation. It contains reviews on four domains: Books (B), DVDs (D), Electronics (E), and Kitchen appliances (K).
We follow the specific experiment settings proposed by Chen et al. (2012)  For each dataset, we conduct experiments by selecting the target domain in a round-robin fash-ion. Following the protocol in previous work, we use cross-validation over source domains for hyper-parameters selection for each adaptation task (Zhao et al., 2018). When training with an adversary, we use the 2,000 examples training set of the target domain as the unlabeled data in both the settings. In ZISER17, the same data is also used for test, resulting in a transductive setting.
Part-of-Speech tagging We further consider a sequence tagging task, where the metric is computed over the token-level encodings and multiclass predictions are made at the token (word) level. We use the SANCL dataset  which contains part-of-speech (POS) tagging annotations in 5 web domains: Emails, Weblogs, Answers, Newsgroups, and Reviews. Among these, Newsgroups, Reviews, and Answers have both a validation and a test set, and are used as target domains. The test set from Weblogs and Emails are used as individual source domains. The tagging is performed using the Universal POS tagset . We also use Twitter (Liu et al., 2018) as an additional training source. Since it differs substantially from other sources and the target domain, we can assess our model's ability to handle negative transfer. We consider 750 sentences from each SANCL source domain for training, and up to 2,250 sentences from the Twitter dataset to magnify the negative transfer. The validation set in the standard split of each target domain is used for hyper-parameters selection and early-stopping in our experiments.

Baselines
We verify the efficacy of our approach (MoE) in non-adversarial and adversarial settings respectively. In both settings, we compare our approach against the following two baselines: • best-SS: the best single-source adaptation model among all the sources.
• uni-MS: the unified multi-source adaptation model, which is trained using the combination of all the source domain data with singlesource transfer methods. uni-MS is a common and strong baseline for multi-source domain adaptation (Zhao et al., 2018).
For the rest of the paper, we name the adversarial counterpart of the models as * -A.
In the adversarial setting on CHEN12, in addition to best-SS and uni-MS with adversarial loss, we further compare with the following two systems that also utilize unlabeled data from target domain.
• MDAN: the multi-source domain adversarial network (Zhao et al., 2018). MDAN gives the state-of-the-art performance for multi-source domain adaptation on CHEN12. It generalizes the domain adversarial network to multiple source domain adaptation by selectively backpropagating the domain discrimination loss according to domain classification error.

Implementation Details
For CHEN12, since the dataset is in TF-IDF format and the word ordering information is not available, we use a multilayer perceptron (MLP) with an input layer of 5,000 dimensions and one hidden layer of 500 dimensions as our encoder. For ZISER17, we instead use a convolutional neural network (CNN) encoder with a combination of kernel widths 3 and 5 (Kim, 2014), each with one hidden layer of size 150, which are then concatenated to a 300 dimension representation. 6 For the POS tagging encoder, we use a hierarchical bidirectional LSTM (BiLSTM) network, which contains a character-level BiLSTM for generating individual word representations, followed by a word-level BiLSTM that generates contextualized word representations.
For MMD, we follow Bousmalis et al. (2016) and use 19 RBF kernels with the standard deviation parameters ranging from 10 −6 to 10 6 . 7 All the models were trained using Adam with weight decay. Learning rate is set to 10 −4 for CHEN12 and 10 −3 for ZISER17 and POS tagging. We use mini-batches of 32 samples from each domain. We tune the coefficients λ, η for each adaptation task. γ is set to 1 for all experiments.

Sentiment Analysis on Amazon Reviews
We report our results on the Amazon reviews datasets in Table 1 (CHEN12) and Table 2 (ZISER17). Our approach (MoE) consistently achieves the best performance across different settings and tasks. The results clearly demonstrate the value of using multiple sources. In most cases, even a unified model performs better than the oracle best single source. By smartly combining all the sources, our model outperforms the unified model significantly. One exception is the task of "B,D,K-E" in CHEN12, where the unified multi-source model doesn't improve over the best single source model, constituting a negative transfer scenario. However, even in this scenario, our approach still performs significantly better, demonstrating its robustness in handling negative transfer.
Impact of adversarial adaptation We achieve consistent improvements over the baseline systems with the addition of the adversarial loss. In most cases, MoE also achieves additional improvement (e.g., 79.42% vs. 80.87% in "D,E,K-B"). We notice that in some cases, e.g., "B,D,K-E" in CHEN12 and "B,E,K-D" in ZISER17, the adversarial loss doesn't help MoE. This might be attributed to the fact that by aligning the target distribution with the source domains, the representation space becomes more compact, thus making it more difficult to capture source domain-specific characteristics and increasing the difficulty of metric learning in MoE.
Analysis on the metric (α) Figure 2 Figure 3 exemplifies the above point. For instance, the first review about "charger" and "battery" is closer to the Electronics source domain. This relation is successfully captured by the α distribution produced by our model.
We further investigate the impact of entropy regularization over α. • best 2 quart pot in the world . … with the glorious one pot meals cookbook . it has wonderful recipies , and the pot works wonderful .
• great kit however the book that comes with the kit needs some work . the photos in the book are not accurate with the descriptions . …  on CHEN12 and ZISER17. It shows that entropy regularization benefits our model under both nonadversarial and adversarial settings. Tagging   Table 4 summarizes our results on POS tagging. Again, our approach consistently achieves the best performance across different settings and tasks. Adding Twitter as a source leads to a drop in performance for the unified model, as a result of negative transfer. Our method, however, robustly handles negative transfer and manages to even benefit from this additional source. Table 5 presents the α distribution learned by the metric, on average for all tokens of the target domain. As we can see, our model (MoE-A) effectively learns to de-crease the weights on Twitter, demonstrating again its ability to alleviate negative transfer. We further study the impact of this outlier source by varying the amount of Twitter data used during training. We gradually increase the number of Twitter instances by 750. As shown in Table 6, the increase of the Twitter data does not benefit the unified multi-source model (uni-MS-A), and even amplifies negative transfer for the Answers and Reviews domains. However, the performance of our MoE (MoE-A) model stays stable, consistently increasing with more Twitter, showing robustness in handling negative transfer.

Conclusion
In this paper, we propose a novel mixture-ofexperts (MoE) approach for unsupervised domain adaptation from multiple diverse source domains. We model the domain relations through a point-to-set distance metric, and introduce a meta-training mechanism to learn this metric. Experimental results on sentiment classification and part-of-speech tagging demonstrate that our approach consistently outperforms various baselines and can robustly handle negative transfer. The effectiveness of our approach suggests its potential application to a broader range of domain adaptation tasks in NLP and other areas.