Multinomial Adversarial Networks for Multi-Domain Text Classification

Many text classification tasks are known to be highly domain-dependent. Unfortunately, the availability of training data can vary drastically across domains. Worse still, for some domains there may not be any annotated data at all. In this work, we propose a multinomial adversarial network (MAN) to tackle this real-world problem of multi-domain text classification (MDTC) in which labeled data may exist for multiple domains, but in insufficient amounts to train effective classifiers for one or more of the domains. We provide theoretical justifications for the MAN framework, proving that different instances of MANs are essentially minimizers of various f-divergence metrics (Ali and Silvey, 1966) among multiple probability distributions. MANs are thus a theoretically sound generalization of traditional adversarial networks that discriminate over two distributions. More specifically, for the MDTC task, MAN learns features that are invariant across multiple domains by resorting to its ability to reduce the divergence among the feature distributions of each domain. We present experimental results showing that MANs significantly outperform the prior art on the MDTC task. We also show that MANs achieve state-of-the-art performance for domains with no labeled data.


Introduction
Text classification is one of the most fundamental tasks in Natural Language Processing, and has found its way into a wide spectrum of NLP applications, ranging from email spam detection and social media analytics to sentiment analysis and data mining.Over the past couple of decades, supervised statistical learning methods have become the dominant approach for text classification (e.g.McCallum et al. (1998); Kim (2014); Iyyer et al. (2015)).Unfortunately, many text classification tasks are highly domain-dependent in that a The source code of MAN can be found at https:// github.com/ccsasuke/mantext classifier trained using labeled data from one domain is likely to perform poorly on another.In the task of sentiment classification, for example, a phrase "runs fast" is usually associated with positive sentiment in the sports domain; not so when a user is reviewing the battery of an electronic device.In real applications, therefore, an adequate amount of training data from each domain of interest is typically required, and this is expensive to obtain.
Two major lines of work attempt to tackle this challenge: domain adaptation (Blitzer et al., 2007) and multi-domain text classification (MDTC) (Li and Zong, 2008).In domain adaptation, the assumption is that there is some domain with abundant training data (the source domain), and the goal is to utilize knowledge learned from the source domain to help perform classifications on another lower-resourced target domain. 1 The focus of this work, MDTC, instead simulates an arguably more realistic scenario, where labeled data may exist for multiple domains, but in insufficient amounts to train an effective classifier for one or more of the domains.Worse still, some domains may have no labeled data at all.The objective of MDTC is to leverage all the available resources in order to improve the system performance over all domains simultaneously.
One state-of-the-art system for MDTC, the CMSC of Wu and Huang (2015), combines a classifier that is shared across all domains (for learning domain-invariant knowledge) with a set of classifiers, one per domain, each of which captures domain-specific text classification knowledge.This paradigm is sometimes known as the Shared-Private model (Bousmalis et al., 2016).CMSC, however, lacks an explicit mechanism to ensure that the shared classifier captures only 1 Review §5 for other variants of domain adaptation.arXiv:1802.05694v1[cs.CL] 15 Feb 2018 domain-independent knowledge: the shared classifier may well also acquire some domain-specific features that are useful for a subset of the domains.We hypothesize that better performance can be obtained if this constraint were explicitly enforced.
In this paper, we thus propose Multinomial Adversarial Networks (henceforth, MANs) for the task of multi-domain text classification.In contrast to standard adversarial networks (Goodfellow et al., 2014), which serve as a tool for minimizing the divergence between two distributions (Nowozin et al., 2016), MANs represent a family of theoretically sound adversarial networks that, in contrast, leverage a multinomial discriminator to directly minimize the divergence among multiple probability distributions.And just as binomial adversarial networks have been applied to numerous tasks (e.g.image generation (Goodfellow et al., 2014), domain adaptation (Ganin et al., 2016), crosslingual sentiment analysis (Chen et al., 2016)), we anticipate that MANs will make a versatile machine learning framework with applications beyond the MDTC task studied in this work.
We introduce the MAN architecture in §2 and prove in §3 that it directly minimizes the (generalized) f-divergence among multiple distributions so that they are indistinguishable upon successful training.Specifically for MDTC, MAN is used to overcome the aforementioned limitation in prior art where domain-specific features may sneak into the shared model.This is done by relying on MAN's power of minimizing the divergence among the feature distributions of each domain.The highlevel idea is that MAN will make the extracted feature distributions of each domain indistinguishable from one another, thus learning general features that are invariant across domains.
We then validate the effectiveness of MAN in experiments on two MDTC data sets.We find first that MAN significantly outperforms the stateof-the-art CMSC method (Wu and Huang, 2015) on the widely used multi-domain Amazon review dataset, and does so without relying on external resources such as sentiment lexica ( §4.1).When applied to the FDU-MTL dataset ( §4.3), we obtain similar results: MAN achieves substantially higher accuracy than the previous top-performing method, ASP-MTL (Liu et al., 2017).ASP-MTL is the first empirical attempt to use a multinomial adversarial network proposed for a multi-task learning setting, but is more restricted and can be viewed as a special case of MAN.In addition, we for the first time provide theoretical guarantees for MAN ( §3) that were absent in ASP-MTL.Finally, while many MDTC methods such as CMSC require labeled data for each domain, MANs can be applied in cases where no labeled data exists for a subset of domains.To evaluate MAN in this semisupervised setting, we compare MAN to a method that can accommodate unlabeled data for (only) one domain (Zhao et al., 2017), and show that MAN achieves performance comparable to the state of the art ( §4.2).

Model
In this paper, we strive to tackle the text classification problem in a real-world setting in which texts come from a variety of domains, each with a varying amount of labeled data.Specifically, assume we have a total N domains, N 1 labeled domains (denoted as ∆ L ) for which there is some labeled data, and N 2 unlabeled domains (∆ U ) for which no annotated training instances are available.Denote ∆ = ∆ L ∪ ∆ U as the collection of all domains, with N = N 1 + N 2 being the total number of domains we are faced with.The goal of this work, and MDTC in general, is to improve the overall classification performance across all N domains, measured in this paper as the average classification accuracy across the N domains in ∆.

Model Architecture
As shown in Figure 1, the Multinomial Adversarial Network (MAN) adopts the Shared-Private paradigm of Bousmalis et al. (2016) and consists of four components: a shared feature extractor F s , a domain feature extractor F d i for each labeled domain d i ∈ ∆ L , a text classifier C, and finally a domain discriminator D. The main idea of MAN is to explicitly model the domain-invariant features that are beneficial to the main classification task across all domains (i.e.shared features, extracted by F s ), as well as the domain-specific features that mainly contribute to the classification in its own domain (domain features, extracted by F d ).Here, the adversarial domain discriminator D has a multinomial output that takes a shared feature vector and predicts the likelihood of that sample coming from each domain.As seen in Figure 1 during the training flow of F s (green arrows), F s aims to confuse D by minimizing J D Fs that is anticorrelated to J D (detailed in §2.2), so  that D cannot predict the domain of a sample given its shared features.The intuition is that if even a strong discriminator D cannot tell the domain of a sample from the extracted features, those features F s learned are essentially domain invariant.By enforcing domain-invariant features to be learned by F s , when trained jointly via backpropagation, the set of domain features extractors F d will each learn domain-specific features beneficial within its own domain.
The architecture of each component is relatively flexible, and can be decided by the practitioners to suit their particular classification tasks.For instance, the feature extractors can adopt the form of Convolutional Neural Nets (CNN), Recurrent Neural Nets (RNN), or a Multi-Layer Perceptron (MLP), depending on the input data (See §4).The input of MAN will also be dependent on the feature extractor choice.The output of a (shared/domain) feature extractor is a fixed-length vector, which is considered the (shared/domain) hidden features of some given input text.On the other hand, the outputs of C and D are label probabilities for class and domain prediction, respectively.For example, both C and D can be MLPs with a softmax layer on top.In §3, we provide alternative architectures for D and their mathematical implications.We now present detailed descriptions of the MAN training in §2.2 as well as the theoretical grounds in §3.

Training
Denote the annotated corpus in a labeled domain d i ∈ ∆ L as X i ; and (x, y) ∼ X i is a sample drawn from the labeled data in domain d i , where x is the input and y is the task label.On the other hand, for any domain d i ∈ ∆, denote the unlabeled corpus as U i .Note for a labeled domain, one can use a separate unlabeled corpus or simply use the labeled data (or use both).
In Figure 1, the arrows illustrate the training flows of various components.Due to the adversarial nature of the domain discriminator D, it is trained with a separate optimizer (red arrows), while the rest of the networks are updated with the main optimizer (green arrows).C is only trained on labeled domains, and it takes as input the concatenation of the shared and domain feature vectors.At test time for unlabeled domains with no F d , the domain features are set to the 0 vector for C's input.On the contrary, D only takes the shared features as input, for both labeled and unlabeled domains.The MAN training is described in Algorithm 1.
In Algorithm 1, L C and L D are the loss functions of the text classifier C and the domain discriminator D, respectively.As mentioned in §2.1, C has a sof tmax layer on top for classification.We hence adopt the canonical negative loglikelihood (NLL) loss: where y is the true label and ŷ is the sof tmax predictions.For D, we consider two variants of MAN.The first one is to use the NLL loss same as C which suits the classification task; while another option is to use the Least-Square (L2) loss that was shown to be able to alleviate the gradient vanishing problem when using the NLL loss in the adversarial setting (Mao et al., 2017): where d is the domain index of some sample and d is the prediction.Without loss of generality, we normalize d so that N i=1 di = 1 and ∀i : di ≥ 0. Therefore, the objectives of C and D that we are minimizing are: For the feature extractors, the training of domain feature extractors is straightforward, as their sole objective is to help C perform better within their own domain.Hence, J F d = J C for any domain d.Finally, the shared feature extractor F s has two objectives: to help C achieve higher accuracy, and to make the feature distribution invariant across all domains.It thus leads to the following bipartite loss: where λ is a hyperparameter balancing the two parts.J D Fs is the domain loss of F s anticorrelated to J D : If D adopts the NLL loss (6), the domain loss is simply −J D .For the L2 loss (7), J D Fs intuitively translates to pushing D to make random predictions.See §3 for theoretical justifications.

Theories of Multinomial Adversarial Networks
The binomial adversarial nets are known to have theoretical connections to the minimization of various f-divergences between two distributions (Nowozin et al., 2016).However, for adversarial training among multiple distributions, despite similar idea has been empirically experimented (Liu et al., 2017), no theoretical justifications have been provided to our best knowledge.
In this section, we present a theoretical analysis showing the validity of MAN.In particular, we show that MAN's objective is equivalent to minimizing the total f-divergence between each of the shared feature distributions of the N domains, and the centroid of the N distributions.The choice of loss function will determine which specific fdivergence is minimized.Furthermore, with adequate model capacity, MAN achieves its optimum for either loss function if and only if all N shared feature distributions are identical, hence learning an invariant feature space across all domains.
First consider the distribution of the shared features f for instances in each domain d i ∈ ∆: Combining ( 5) with the two loss functions (2), (3), the objective of D can be written as: where D i (f ) is the i-th dimension of D's (normalized) output vector, which conceptually corresponds to the probability of D predicting that f is from domain d i .
We first derive the optimal D for any fixed F s .
Lemma 1.For any fixed F s , with either NLL or L2 loss, the optimum domain discriminator D * is: The proof involves an application of the Lagrangian Multiplier to solve the minimum value of J D , and the details can be found in the Appendix.We then have the following main theorems for the domain loss for F s : . When D is trained to its optimality, if D adopts the NLL loss: where JSD(•) is the generalized Jensen-Shannon Divergence (Lin, 1991) among multiple distributions, defined as the average Kullback-Leibler divergence of each P i to the centroid P (Aslam and Pavlu, 2007).
Theorem 2. If D uses the L2 loss: where χ 2 Neyman (• •) is the Neyman χ 2 divergence (Nielsen and Nock, 2014).The proof of both theorems can be found in the Appendix.
Consequently, by the non-negativity and joint convexity of the f-divergence (Csiszar and Korner, 1982), we have: Fs is −N log N when using NLL loss, and 0 for the L2 loss.The optimum value above is achieved if and only if Therefore, the loss of F s can be interpreted as simultaneously minimizing the classification loss J C as well as the divergence among feature distributions of all domains.It can thus learn a shared feature mapping that are invariant across domains upon successful training while being beneficial to the main classification task.

Multi-Domain Text Classification
In this experiment, we compare MAN to stateof-the-art MDTC systems, on the multi-domain Amazon review dataset (Blitzer et al., 2007) which is one of the most widely used MDTC datasets.Note that this dataset was already preprocessed into a bag of features (unigrams and bigrams), losing all word order information.This prohibits the usage of CNNs or RNNs as feature extractors, limiting the potential performance of the system.Nonetheless, we adopt the same dataset for fair comparison and employ a MLP as our feature extractor.In particular, we take the 5000 most frequent features and represent each review as a 5000d feature vector, where feature values are raw counts of the features.Our MLP feature extractor would then have an input size of 5000 in order to process the reviews.The Amazon dataset contains 2000 samples for each of the four domains: book, DVD, electronics, and kitchen, with binary labels (positive, negative).Following Wu and Huang (2015), we conduct 5-way cross validation.Three out of the five folds are treated as training set, one serves as the validation set, while the remaining being the test set.The 5-fold average test accuracy is reported.
Table 1 shows the main results.Three types of models are shown: Domain-Specific Models Only, where only in-domain models are trained2 ; Shared Model Only, where a single model is trained with all data; and Shared-Private Models, a combination of the previous two.Within each category, various architectures are examined, such as Least Square (LS), SVM, and Logistic Regression (LR).
As explained before, we use MLP as our feature extractors for all our models (bold ones).Among our models, the ones with the MAN prefix use adversarial training, and MAN-L2 and MAN-NLL indicate the L2 loss and NLL loss MAN, respectively.
From Table 1, we can see that by adopting modern deep neural networks, our methods achieve superior performance within the first two model categories even without adversarial training.This is corroborated by the fact that our SP-MLP model performs comparably to CMSC, while the latter relies on external resources such as sentiment lexica.Moreover, when our multinomial adversarial nets are introduced, further improvement is observed.With both loss functions, MAN outperforms all Shared-Private baseline systems on each domain, and achieves statistically significantly higher overall performance.For our MAN-SP models, we provide the mean accuracy as well as the standard errors over five runs, to illustrate the performance variance and conduct significance test.It can be seen that MAN's performance is relatively stable, and consistently outperforms CMSC.

Experiments for Unlabeled Domains
As CMSC requires labeled data for each domain, their experiments were naturally designed this way.In reality, however, many domains may not have any annotated corpora available.It is therefore also important to look at the performance in these unlabeled domains for a MDTC system.Fortunately, as depicted before, MAN's adversarial training only utilizes unlabeled data from each domain to learn the domain-invariant features, and can thus be used on unlabeled domains as well.During testing, only the shared feature vector is fed into C, while the domain feature vector is set to 0.
In order to validate MAN's effectiveness, we compare to state-of-the-art multi-source domain adaptation (MS-DA) methods (See §5).Compared to standard domain adaptation methods with  one source and one target domain, MS-DA allows the adaptation from multiple source domains to a single target domain.Analogically, MDTC can be viewed as multi-source multi-target domain adaptation, which is superior when multiple target domains exist.With multiple target domains, MS-DA will need to treat each one as an independent task, which is more expensive and cannot utilize the unlabeled data in other target domains.
In this work, we compare MAN with one recent MS-DA method, MDAN (Zhao et al., 2017).Their experiments only have one target domain to suit their approach, and we follow this setting for fair comparison.However, it is worth noting that MAN is designed for the MDTC setting, and can deal with multiple target domains at the same time, which can potentially improve the performance by taking advantage of more unlabeled data from multiple target domains during adversarial training.We adopt the same setting as Zhao et al. (2017), which is based on the same multidomain Amazon review dataset.Each of the four domains in the dataset is treated as the target domain in four separate experiments, while the remaining three are used as source domains.
In Table 2, the target domain is shown on top, and the test set accuracy is reported for various systems.It shows that MAN outperforms several baseline systems, such as a MLP trained on the source-domains, as well as single-source domain adaptation methods such as mSDA (Chen et al., 2012) and DANN (Ganin et al., 2016), where the training data in the multiple source domains are combined and viewed as a single domain.Finally, when compared to MDAN, MAN and MDAN each achieves higher accuracy on two out of the four target domains, and the average accuracy of MAN is similar to MDAN.competitive performance for the domains without annotated corpus.Nevertheless, unlike MS-DA methods, MAN can handle multiple target domains at one time.

Experiments on the MTL Dataset
To make fair comparisons, the previous experiments follow the standard settings in the literature, where the widely adopted Amazon review dataset is used.However, this dataset has a few limitations: First, it has only four domains.In addition, the reviews are already tokenized and converted to a bag of features consisting of unigrams and bigrams.Raw review texts are hence not available in this dataset, making it impossible to use certain modern neural architectures such as CNNs and RNNs.To provide more insights on how well MAN work with other feature extractor architectures, we provide a third set of experiments on the FDU-MTL dataset (Liu et al., 2017).The dataset is created as a multi-task learning dataset with 16 tasks, where each task is essentially a different domain of reviews.It has 14 Amazon domains: books, electronics, DVD, kitchen, apparel, camera, health, music, toys, video, baby, magazine, software, and sports, in addition to two movies review domains from the IMDb and the MR dataset.Each domain has a development set of 200 samples, and a test set of 400 samples.The amount of training and unlabeled data vary across domains but are roughly 1400 and 2000, respectively.
We compare MAN with ASP-MTL (Liu et al., 2017) on this FDU-MTL dataset.ASP-MTL also adopts adversarial training for learning a shared feature space, and can be viewed as a special case of MAN when adopting the NLL loss (MAN-NLL).Furthermore, while Liu et al. (2017) do not pro-vide any theoretically justifications, we in §3 prove the validity of MAN for not only the NLL loss, but an additional L2 loss.Besides the theoretical superiority, we in this section show that MAN also substantially outperforms ASP-MTL in practice due to the feature extractor choice.
In particular, Liu et al. (2017) choose LSTM as their feature extractor, yet we found CNN (Kim, 2014) to achieve much better accuracy while being ∼ 10 times faster.Indeed, as shown in Table 3, with or without adversarial training, our CNN models outperform LSTM ones by a large margin.When MAN is introduced, we attain the state-of-the-art performance on every domain with a 88.4% overall accuracy, surpassing ASP-MTL by a significant margin of 2.3%.
We hypothesize the reason LSTM performs much inferior to CNN is attributed to the lack of attention mechanism.In ASP-MTL, only the last hidden unit is taken as the extracted features.While LSTM is effective for representing the context for each token, it might not be powerful enough for directly encoding the entire document (Bahdanau et al., 2015).Therefore, various attention mechanisms have been introduced on top of the vanilla LSTM to select words (and contexts) most relevant for making the predictions.In our preliminary experiments, we find that Bi-directional LSTM with the dot-product attention (Luong et al., 2015) yields better performance than the vanilla LSTM in ASP-MTL.However, it still does not outperform CNN and is much slower.As a result, we conclude that, for text classification tasks, CNN is both effective and efficient in extracting local and higher-level features for making a single categorization.
Finally, we observe that MAN-NLL achieves slightly higher overall performance compared to MAN-L2, providing evidence for the claim in a recent study (Lucic et al., 2017) that the original GAN loss (NLL) may not be inherently inferior.Moreover, the two variants excel in different domains, suggesting the possibility of further performance gain when using ensemble.

Related Work
Multi-Domain Text Classification The MDTC task was first examined by Li and Zong (2008), who proposed to fusion the training data from multiple domains either on the feature level or the classifier level.The prior art of MDTC (Wu and Huang, 2015) decomposes the text classifier into a general one and a set of domain-specific ones.However, the general classifier is learned by parameter sharing and domain-specific knowledge may sneak into it.They also require external resources to help improve accuracy and compute domain similarities.
Domain Adaptation Domain Adaptation attempts to transfer the knowledge from a source domain to a target one, and the traditional form is the single-source, single-target (SS,ST) adaptation (Blitzer et al., 2006).Another variant is the SS,MT adaptation (Yang and Eisenstein, 2015), which tries to simultaneously transfer the knowledge to multiple target domains from a single source.However, it cannot fully take advantage the training data if it comes from multiple source domains.MS,ST adaptation (Mansour et al., 2009;Zhao et al., 2017) can deal with multiple source domains but only transfers to a single target domain.Therefore, when multiple target domains exist, they need to treat them as independent problems, which is more expensive and cannot utilize the additional unlabeled data in these domains.Finally, MDTC can be viewed as MS,MT adaptation, which is arguably more general and realistic.
Adversarial Networks The idea of adversarial networks was proposed by Goodfellow et al. (2014) for image generation, and has been applied to various NLP tasks as well (Chen et al., 2016;Li et al., 2017).Ganin et al. (2016) first used it for the SS,ST domain adaptation followed by many others.Bousmalis et al. (2016) utilized adversarial training in a shared-private model for domain adaptation to learn domain-invariant features, but still focused on the SS,ST setting.Fi-nally, the idea of using adversarial nets to discriminate over multiple distributions was empirically explored by a very recent work (Liu et al., 2017) under the multi-task learning setting, and can be considered as a special case of our MAN framework with the NLL domain loss.Nevertheless, we propose a more general framework with alternative architectures for the adversarial component, and for the first time provide theoretical justifications for the multinomial adversarial nets.Moreover, Liu et al. (2017) used LSTM without attention as their feature extractor, which we found to perform sub-optimal in the experiments.We instead chose Convolutional Neural Nets as our feature extractor that achieves higher accuracy while running an order of magnitude faster (See §4.3).

Conclusion
In this work, we propose a family of Multinomial Adversarial Networks (MAN) that generalize the traditional binomial adversarial nets in the sense that MAN can simultaneously minimize the difference among multiple probability distributions instead of two.We provide theoretical justifications for two instances of MAN, MAN-NLL and MAN-L2, showing they are minimizers of two different f-divergence metrics among multiple distributions, respectively.This indicates MAN can be used to make multiple distributions indistinguishable from one another.It can hence be applied to a variety of tasks, similar to the versatile binomial adversarial nets, which have been used in many areas for making two distributions alike.
We in this paper design a MAN model for the MDTC task, following the shared-private paradigm that has a shared feature extractor to learn domain-invariant features and domain feature extractors to learn domain-specific ones.MAN is used to enforce the shared feature extractor to learn only domain-invariant knowledge, by resorting to MAN's power of making indistinguishable the shared feature distributions of samples from each domain.We conduct extensive experiments, demonstrating our MAN model outperforms the prior art systems in MDTC, and achieves state-ofthe-art performance on domains without labeled data when compared to multi-source domain adaptation methods.

Appendix A Proofs
A.1 Proofs for MAN-NLL Assume we have N domains, consider the distribution of the shared features F s for instances in each domain d i : The objective that D attempts to minimize is: where D i (f ) is the i-th dimension of D's output vector, which conceptually corresponds to the softmax probability of D predicting that f is from domain d i .We therefore have property that for any f : Lemma 2. For any fixed F s , the optimum domain discriminator D * is: Proof.For a fixed F s , the optimum We employ the Lagrangian Multiplier to derive arg max D N i=1 P i (f ) log D i (f ) under the constraint of (13).Let Let ∇L = 0: Solving the two equations, we have: On the other hand, the loss function of the shared feature extractor F s consists of two additive components, the loss from the text classifier C, and the loss from the domain discriminator D: We have the following theorem for the domain loss for F s : Theorem 3. When D is trained to its optimality: J D Fs = −J D * = −N log N + N • JSD(P 1 , P 2 , . . ., P N ) (16) where JSD(•) is the generalized Jensen-Shannon Divergence (Lin, 1991) among multiple distributions.
Proof.Let P = N i=1 P i N .There are two equivalent definitions of the generalized Jensen-Shannon divergence: the original definition based on Shannon entropy (Lin, 1991), and a reshaped one expressed as the average Kullback-Leibler divergence of each P i to the centroid P (Aslam and Pavlu, 2007).We adopt the latter one here: JSD(P 1 , P 2 , . . ., P N ) 1 Now substituting D * into J D Fs : Consequently, by the non-negativity of JSD (Lin, 1991), we have the following corollary: Corollary 2. The optimum of J D Fs is −N log N , and is achieved if and only if P 1 = P 2 = • • • = P N = P .

A.2 Proofs for MAN-L2
The proof is similar for MAN with the L2 loss.The loss function used by D is, for a sample from domain d i with shared feature vector f : So the objective that D minimizes is: For simplicity, we further constrain D's outputs to be on a simplex: Lemma 3.For any fixed F s , the optimum domain discriminator D * is: Proof.For a fixed F s , the optimum Solving the two equations, we have λ = 0 and: For the domain loss of F s : Theorem 4. Let P = N i=1 P i N . When D is trained to its optimality: where χ 2 Neyman (• •) is the Neyman χ 2 divergence (Nielsen and Nock, 2014).

Forward
and backward passes when updating the parameters of Fs, Fd and C Forward and backward passes when updating the parameters of D

Figure 1 :
Figure 1: MAN for MDTC.The figure demonstrates the training on a mini-batch of data from one domain.One training iteration consists of one such mini-batch training from each domain.The parameters of F s , F d , C are updated together, and the training flows are illustrated by the green arrows.The parameters of D are updated separately, shown in red arrows.Solid lines indicate forward passes while dotted lines are backward passes.J D Fs is the domain loss for F s , which is anticorrelated with J D (e.g.J D Fs = −J D ). (See §2, §3)

Table 1 :
Wu and Huang (2015)he Amazon dataset.Models in bold are ours while the performance of the rest is taken fromWu and Huang (2015).Numbers in paren- theses indicate standard errors, calculated based on 5 runs.Bold numbers indicate the highest performance in each domain, and * shows statistical significance (p < 0.05) over CMSC under a one-sample T-Test.

Table 2 :
Zhao et al. (2017)ed domains.Models in bold are our models while the rest is taken fromZhao et al. (2017).Highest domain performance is shown in bold.
Therefore, MAN achieves books elec.dvd kitchen apparel camera health music toys video baby magaz.softw.sports IMDb MR Avg.

Table 3 :
Liu et al. (2017)U-MTL dataset.Bolded models are ours, while the rest is fromLiu et al. (2017).Highest performance is each domain is highlighted.For our full MAN models, standard errors are shown in parenthese and statistical significance (p < 0.01) over ASP-MTL is indicated by *.