Weighed Domain-Invariant Representation Learning for Cross-domain Sentiment Analysis

Cross-domain sentiment analysis is currently a hot topic in both the research and industrial areas. One of the most popular framework for the task is domain-invariant representation learning (DIRL), which aims to learn a distribution-invariant feature representation across domains. However, in this work, we find out that applying DIRL may degrade domain adaptation performance when the label distribution \rm{P}(\rm{Y}) changes across domains. To address this problem, we propose a modification to DIRL, obtaining a novel weighted domain-invariant representation learning (WDIRL) framework. We show that it is easy to transfer existing models of the DIRL framework to the WDIRL framework. Empirical studies on extensive cross-domain sentiment analysis tasks verified our statements and showed the effectiveness of our proposed solution.


Introduction
Sentiment analysis aims to predict sentiment polarity of user-generated data with emotional orientation like movie reviews.The exponentially increase of online reviews makes it an interesting topic in research and industrial areas.However, reviews can span so many different domains and the collection and preprocessing of large amounts of data for new domains is often time-consuming and expensive.Therefore, cross-domain sentiment analysis is currently a hot topic, which aims to transfer knowledge from a label-rich source domain (S) to the label-few target domain (T).
In recent years, one of the most popular frameworks for cross-domain sentiment analysis is the domain invariant representation learning (DIRL) framework (Glorot et al., 2011;Fernando et al., 2013;Ganin et al., 2016;Zellinger et al., 2017;Li et al., 2017).Methods of this framework follow the idea of extracting a domaininvariant feature representation, in which the data distributions of the source and target domains are similar.Based on the resultant representations, they learn the supervised classifier using source rich labeled data.The main difference among these methods is the applied technique to force the feature representations to be domain-invariant.
However, in this work, we discover that applying DIRL may harm domain adaptation in the situation that the label distribution P(Y) shifts across domains.Specifically, let X and Y denote the input and label random variable, respectively, and G(X) denote the feature representation of X.We found out that when P(Y) changes across domains while P(X|Y) stays the same, forcing G(X) to be domain-invariant will make G(X) uninformative to Y.This will, in turn, harm the generation of the supervised classifier to the target domain.In addition, for the more general condition that both P(Y) and P(X|Y) shift across domains, we deduced a conflict between the object of making the classification error small and that of making G(X) domain-invariant.
We argue that the problem is worthy of studying since the shift of P(Y) exists in many real-world cross-domain sentiment analysis tasks (Glorot et al., 2011).For example, the marginal distribution of the sentiment of a product can be affected by the overall social environment and change in different time periods; and for different products, their marginal distributions of the sentiment are naturally considered different.Moreover, there are many factors, such as the original data distribution, data collection time, and data clearing method, that can affect P(Y) of the collected target domain unlabeled dataset.Note that in the real-world cross-domain tasks, we do not know the labels of the collected target domain data.Thus, we cannot previously align its label distribution P T (Y) with that of source domain labeled data P S (Y), as done in many previous works (Glorot et al., 2011;Ganin et al., 2016;Tzeng et al., 2017;Li et al., 2017;He et al., 2018;Peng et al., 2018).
To address the problem of DIRL resulted from the shift of P(Y), we propose a modification to DIRL, obtaining a weighted domain-invariant representation learning (WDIRL) framework.This framework additionally introduces a class weight w to weigh source domain examples by class, hoping to make P(Y) of the weighted source domain close to that of the target domain.
Based on w, it resolves domain shift in two steps.In the first step, it forces the marginal distribution P(X) to be domain-invariant between the target domain and the weighted source domain instead of the original source, obtaining a supervised classifier P S (Y|X; Φ) and a class weight w.In the second step, it resolves the shift of P(Y|X) by adjusting P S (Y|X; Φ) using w for label prediction in the target domain.We detail these two steps in §4.Moreover, we will illustrate how to transfer existing DIRL models to their WDIRL counterparts, taking the representative metric-based CMD model (Zellinger et al., 2017) and the adversarial-learning-based DANN model (Ganin et al., 2016) as an example, respectively.
In summary, the contributions of this paper include: (i) We theoretically and empirically analyse the problem of DIRL for domain adaptation when the marginal distribution P(Y) shifts across domains.(ii) We proposed a novel method to address the problem and show how to incorporate it with existent DIRL models.(iii) Experimental studies on extensive cross-domain sentiment analysis tasks show that models of our WDIRL framework can greatly outperform their DIRL counterparts.

Domain Adaptation
For expression consistency, in this work, we consider domain adaptation in the unsupervised setting (however, we argue that our analysis and solution also applies to the supervised and semisupervised domain adaptation settings).In the unsupervised domain adaptation setting, there are two different distributions over X × Y: the source domain P S (X, Y) and the target domain P T (X, Y).And there is a labeled data set D S drawn i.i.d from P S (X, Y) and an unlabeled data set D T drawn i.i.d.from the marginal distribution P T (X): The goal of domain adaptation is to build a classier f : X → Y that has good performance in the target domain using D S and D T .For this purpose, many approaches have been proposed from different views, such as instance reweighting (Mansour et al., 2009), pivot-based information passing (Blitzer et al., 2007), spectral feature alignment (Pan et al., 2010) subsampling (Chen et al., 2011), and of course the domain-invariant representation learning (Pan et al., 2011;Gopalan et al., 2011;Long et al., 2013;Muandet et al., 2013;Yosinski et al., 2014;Long et al., 2015;Aljundi et al., 2015;Wei et al., 2016;Bousmalis et al., 2016;Pinheiro and Element, 2018;Zhao et al., 2018).

Domain Invariant Representation Learning
Domain invariant representation learning (DIRL) is a very popular framework for performing domain adaptation in the cross-domain sentiment analysis field (Ghifary et al., 2014;Li et al., 2017;Chen et al., 2018;Peng et al., 2018).It is heavily motivated by the following theorem (Ben-David et al., 2007).
Theorem 1.For a hypothesis h, Here, L S (h) denotes the expected loss with hypothesis h in the source domain, L T (h) denotes the counterpart in the target domain, d 1 is a measure of divergence between two distributions.
Based on Theorem 1 and assuming that performing feature transform on X will not increase the values of the first and third terms of the right side of Ineq.(1), methods of the DIRL framework apply a feature map G onto X, hoping to obtain a feature representation G(X) that has a lower value of d 1 (P S (G(X)), P T (G(X))).To this end, different methods have been proposed.These methods can be roughly divided into two directions.The first direction is to design a differentiable metric to explicitly evaluate the discrepancy between two distributions.We call methods of this direction as the metric-based DIRL methods.A representative work of this direction is the center-momentum-based model proposed by Zellinger et al. (2017).In that work, they proposed a central moment discrepancy metric (CMD) to evaluate the discrepancy between two distributions.Specifically, let denote X S and X T an M dimensional random vector on the compact interval [a; b] M over distribution P S and P T , respectively.The CMD loss between P S and P T is defined by: (2) Here, E(X) denotes the expectation of X over distribution P S (X), and is the k-th momentum, where X i denotes the i th dimensional variable of X.
The second direction is to perform adversarial training between the feature generator G and a domain discriminator D. We call methods of this direction as the adversarial-learning-based methods.As a representative, Ganin et al. (2016) trained D to distinguish the domain of a given example x based on its representation G(x).At the same time, they encouraged G to deceive D, i.e., to make D unable to distinguish the domain of x.More specifically, D was trained to minimize the loss: (3) over its trainable parameters, while in contrast G was trained to maximize L d .According to the work of Goodfellow et al. (2014), this is equivalent to minimize the Jensen-shannon divergence (Amari et al., 1987;Lin, 1991) JSD(P S , P T ) between P S (G(X)) and P T (G(X)) over G. Here, for a concise expression, we write P as the shorthand for P(G(X)).
The task loss is the combination of the supervised learning loss L sup and the domaininvariant learning loss L inv , which are defined on D S only and on the combination of D S and D T , respectively: Here, α is a hyper-parameter for loss balance, and the aforementioned domain adversarial loss JSD(P S , P T ) and CMD K are two concrete forms of L inv .

Problem of Domain-Invariant Representation Learning
In this work, we found out that applying DIRL may harm domain adaptation in the situation that P(Y) shifts across domains.Specifically, when P S (Y) differs from P T (Y), forcing the feature representations G(X) to be domain-invariant may increase the value of L S (h) in Ineq.
(1) and consequently increase the value of L T (h), which means the decrease of target domain performance.
In the following, we start our analysis under the condition that P S (X|Y) = P T (X|Y).Then, we consider the more general condition that P S (X|Y) also differs from P T (X|Y).
When P S (X|Y) = P T (X|Y), we have the following theorem.
Theorem 2. Given P S (X|Y) = P T (X|Y), if P S (Y = i) = P T (Y = i) and a feature map G makes P S (M(X)) = P T (M(X)), then P S (Y = i|M(X)) = P S (Y = i).
Proof.Proofs appear in Appendix A.
Remark.According to Theorem 2, we know that when P S (X|Y) = P T (X|Y) and P S (Y = i) = P T (Y = i), forcing G(X) to be domaininvariant inclines to make data of class i mix with data of other classes in the space of G(X).This will make it difficult for the supervised classifier to distinguish inputs of class i from inputs of the other classes.Think about such an extreme case that every instance x is mapped to a consistent point g 0 in G(X).In this case, P S (G(X) = g 0 ) = P T (G(X) = g 0 ) = 1.Therefore, G(X) is domain-invariant.As a result, the supervised classifier will assign the label y * = arg max y P S (Y = y) to all input examples.This is definitely unacceptable.To give a more intuitive illustration of the above analysis, we offer several empirical studies on Theorem 2 in Appendix B.
When P S (Y) = P T (Y) and P S (X|Y) = P T (X|Y), we did not obtain such a strong conclusion as Theorem 2. Instead, we deduced a conflict between the object of achieving superior classification performance and that of making features domain-invariant.
Suppose that P S (Y = i) = P T (Y = i) and instances of class i are completely distinguishable from instances of the rest classes in G(X), i.e.,: In DIRL, we hope that: Consider the region x ∈ X i , where P(G(X = x)|Y = i) > 0. According to the above assumption, we know that P(G(X = x ∈ X i )|Y = i) = 0. Therefore, applying DIRL will force Taking the integral of x over X i for both sides of the equation, we have P S (Y = i) = P T (Y = i).This deduction contradicts with the setting that P S (Y = i) = P T (Y = i).Therefore, G(X) is impossible fully class-separable when it is domain-invariant.Note that the object of the supervised learning is exactly to make G(X) class-separable.Thus, this actually indicates a conflict between the supervised learning and the domain-invariant representation learning.Based on the above analysis, we can conclude that it is impossible to obtain a feature representation G(X) that is class-separable and at the same time, domain-invariant using the DIRL framework, when P(Y) shifts across domains.However, the shift of P(Y) can exist in many cross-domain sentiment analysis tasks.Therefore, it is worthy of studying in order to deal with the problem of DIRL.

Weighted Domain Invariant Representation Learning
According to the above analysis, we proposed a weighted version of DIRL to address the problem caused by the shift of P(Y) to DIRL.The key idea of this framework is to first align P(Y) across domains before performing domain-invariant learning, and then take account the shift of P(Y) in the label prediction procedure.Specifically, it introduces a class weight w to weigh source domain examples by class.Based on the weighted source domain, the domain shift problem is resolved in two steps.In the first step, it applies DIRL on the target domain and the weighted source domain, aiming to alleviate the influence of the shift of P(Y) during the alignment of P(X|Y).In the second step, it uses w to reweigh the supervised classifier P S (Y|X) obtained in the first step for target domain label prediction.
We detail these two steps in §4.1 and §4.2, respectively.

Align P(X|Y) with Class Weight
The motivation behind this practice is to adjust data distribution of the source domain or the target domain to alleviate the shift of P(Y) across domains before applying DIRL.Consider that we only have labels of source domain data, we choose to adjust data distribution of the source domain.To achieve this purpose, we introduce a trainable class weight w to reweigh source domain examples by class when performing DIRL, with w i > 0. Specifically, we hope that: and we denote w * the value of w that makes this equation hold.We shall see that when w = w * , DIRL is to align P S (G(X)|Y) with P T (G(X)|Y) without the shift of P(Y).According to our analysis, we know that due to the shift of P(Y), there is a conflict between the training objects of the supervised learning L sup and the domaininvariant learning L inv .And the conflict degree will decrease as P S (Y) getting close to P T (Y).Therefore, during model training, w is expected to be optimized toward w * since it will make P(Y) of the weighted source domain close to P T (Y), so as to solve the conflict.
We now show how to transfer existing DIRL models to their WDIRL counterparts with the above idea.Let S : P → R denote a statistic function defined over a distribution P.
For example, the expectation function E(X) in E(X S ) ≡ E(X)(P S (X)) is a concrete instaintiation of S. In general, to transfer models from DIRL to WDIRL, we should replace Take the CMD metric as an example.In WDIRL, the revised form of CMD K is defined by: (5) Here, E(X S |Y S = i) ≡ E(X)(P S (X|Y = i)) denotes the expectation of X over distribution P S (X|Y = i).Note that both P S (Y = i) and E(X S |Y S = i) can be estimated using source labeled data, and E(X T ) can be estimated using target unlabeled data.
As for those adversarial-learning-based DIRL methods, e.g., DANN (Ganin et al., 2016), the revised domain-invariant loss can be precisely defined by: During model training, D is optimized in the direction to minimize Ld , while G and w are optimized to maximize Ld .In the following, we denote JSD(P S , P T ) the equivalent loss defined over G for the revised version of domain adversarial learning.
The general task loss in WDIRL is defined by: where Linv is a unified representation of the domain-invariant loss in WDIRL, such as CMD K and JSD(P S , P T ).

Align P(Y|X) with Class Weight
In the above step, we align P(X|Y) across domains by performing domain-invariant learning on the class-weighted source domain and the original target domain.In this step, we deal with the shift of P(Y).Suppose that we have successfully resolved the shift of P(X|Y) with G, i.e., P S (G(X)|Y) = P T (G(X)|Y).Then, according to the work of (Chan and Ng, 2005), we have: where γ(Y = i) = P T (Y = i)/P S (Y = i).Of course, in most of the real-world tasks, we do not know the value of γ(Y = i).However, note that γ(Y = i) is exactly the expected class weight w * i .Therefore, a natural practice of this step is to estimate γ(Y = i) with the obtained w i in the first step and estimate P T (Y|G(X)) with: .
(9) In summary, to transfer methods of the DIRL paradigm to WDIRL, we should: first revise the definition of L inv , obtaining its corresponding WDIRL form Linv ; then perform supervised learning and domain-invariant representation learning on D S and D T according to Eq. ( 7), obtaining a supervised classifier P S (Y|X; Φ) and a class weight vector w; and finally, adjust P S (Y|X; Φ) using w according to Eq. ( 9) and obtain the target domain classifier P T (Y|X; Φ).

Experiment Design
Through the experiments, we empirically studied our analysis on DIRL and the effectiveness of our proposed solution in dealing with the problem it suffered from.In addition, we studied the impact of each step described in §4.1 and §4.2 to our proposed solution, respectively.To performe the study, we carried out performance comparison between the following models: • SO: the source-only model trained using source domain labeled data without any domain adaptation.
• CMD: the centre-momentum-based domain adaptation model (Zellinger et al., 2017) of the original DIRL framework that implements L inv with CMD K .
• DANN: the adversarial-learning-based domain adaptation model (Ganin et al., 2016) of the original DIRL framework that implements L inv with JSD(P S , P T ).
• CMD † : the weighted version of the CMD model that only applies the first step (described in §4.1) of our proposed method.
• DANN † : the weighted version of the DANN model that only applies the first step of our proposed method.
• CMD † † : the weighted version of the CMD model that applies both the first and second (described in §4.2) steps of our proposed method.
• DANN † † : the weighted version of the DANN model that applies both the first and second steps of our proposed method.
• CMD * : a variant of CMD † † that assigns w * (estimate from target labeled data) to w and fixes this value during model training.
• DANN * : a variant of DANN † † that assigns w * to w and fixes this value during model training.
Intrinsically, SO can provide an empirical lowerbound for those domain adaptation methods.CMD * and DANN * can provide the empirical upbound of CMD † † and DANN † † , respectively.In addition, by comparing performance of CMD * and DANN * with that of SO, we can know the effectiveness of the DIRL framework when P(Y) dose not shift across domains.By comparing CMD † with CMD, or comparing DANN † with DANN, we can know the effectiveness of the first step of our proposed method.By comparing CMD † † with CMD † , or comparing DANN † † with DANN † , we can know the impact of the second step of our proposed method.And finally, by comparing CMD † † with CMD, or comparing DANN † † with DANN, we can know the general effectiveness of our proposed solution.

Dataset and Task Design
We conducted experiments on the Amazon reviews dataset (Blitzer et al., 2007), which is a benchmark dataset in the cross-domain sentiment analysis field.This dataset contains Amazon product reviews of four different product domains: Books (B), DVD (D), Electronics (E), and Kitchen (K) appliances.Each review is originally associated with a rating of 1-5 stars and is encoded in 5,000 dimensional feature vectors of bag-of-words unigrams and bigrams.
Binary-Class.From this dataset, we constructed 12 binary-class cross-domain sentiment analysis tasks: B→D, B→E, B→K, D→B, D→E, D→K, E→B, E→D, E→K, K→B, K→D, K→E.Following the setting of previous works, we treated a reviews as class '1' if it was ranked up to 3 stars, and as class '2' if it was ranked 4 or 5 stars.

Implementation Detail
For all studied models, we implemented G and f using the same architectures as those in (Zellinger et al., 2017).For those DANN-based methods (i.e., DANN, DANN † , DANN † † , and DANN * ), we implemented the discriminator D using a 50 dimensional hidden layer with relu activation functions and a linear classification layer.Hyper-parameter K of CMD K and CMD K was set to 5 as suggested by Zellinger et al. (2017).Model optimization was performed using RmsProp (Tieleman and Hinton, 2012) learning rate of w was set to 0.01, while that of other parameters was set to 0.005 for all tasks.Hyper-parameter α was set to 1 for all of the tested models.We searched for this value in range α = [1, • • • , 10] on task B → K. Within the search, label distribution was set to be uniform, i.e., P(Y = i) = 1/L, for both domain B and K.We chose the value that maximize the performance of CMD on testing data of domain K.You may notice that this practice conflicts with the setting of unsupervised domain adaptation that we do not have labeled data of the target domain for training or developing.However, we argue that this practice would not make it unfair for model comparison since all of the tested models shared the same value of α and α was not directly fine-tuned on any tested task.With the same consideration, for every tested model, we reported its best performance achieved on testing data of the target domain during its training 1 .
To initialize w, we used label prediction of the source-only model.
Specifically, let P SO (Y|X; θ SO ) denote the trained source-only model.We initialized w i by: Here, I denotes the indication function.To offer an intuitive understanding to this strategy, we report performance of WCMD † † over different initializations of w on 2 within-group (B→D, E→K) and 2 cross-group (B→K, D→E) binaryclass domain adaptation tasks in Figure 1.Here, 1 Please refer to the attached source code in the appendix for more implementation detail of this work.0.5 0.6 0.7 0.8 0.9 we say that domain B and D are of a group, and domain E and K are of another group since B and D are similar, as are E and K, but the two groups are different from one another (Blitzer et al., 2007).Note that P S (Y = 1) = 0.5 is a constant, which is estimated using source labeled data.From the figure, we can obtain three main observations.First, WCMD † † generally outperformed its CMD counterparts with different initialization of w.Second, it was better to initialize w with a relatively balanced value, i.e., w i P S (Y = i) → 1 L (in this experiment, L = 2).Finally, w 0 was often a good initialization of w, indicating the effectiveness of the above strategy.

Main Result
Table 1 shows model performance on the 12 binary-class cross-domain tasks.
From this  table, we can obtain the following observations.First, CMD and DANN underperform the source-only model (SO) on all of the 12 tested tasks, indicating that DIRL in the studied situation will degrade the domain adaptation performance rather than improve it.This observation confirms our analysis.
Second, CMD † † consistently outperformed CMD and SO.This observation shows the effectiveness of our proposed method for addressing the problem of the DIRL framework in the studied situation.Similar conclusion can also be obtained by comparing performance of DANN † † with that of DANN and SO.Third, CMD † and DANN † consistently outperformed CMD and DANN, respectively, which shows the effectiveness of the first step of our proposed method.Finally, on most of the tested tasks, CMD † † and DANN † † outperforms CMD † and DANN † , respectively.
Figure 2 depicts the relative improvement, e.g., (Acc(CMD)−Acc(SO))/Acc(SO), of the domain adaptation methods over the SO baseline under different degrees of P(Y) shift, on two binaryclass domain adaptation tasks (You can refer to Appendix C for results of the other models on other tasks).From the figure, we can see that the performance of CMD generally got worse as the increase of P(Y) shift.In contrast, our proposed model CMD † † performed robustly to the varying of P(Y) shift degree.Moreover, it can achieve the near upbound performance characterized by CMD * .This again verified the effectiveness of our solution.
Table 2 reports model performance on the 2 within-group (B→D, E→K) and the 2 cross-group (B→K, D→E) multi-class domain adaptation tasks (You can refer to Appendix D for results on the other tasks).From this table, we observe that on some tested tasks, CMD † † and DANN † † did not greatly outperform or even slightly underperformed CMD † and DANN † , respectively.A possible explanation of this phenomenon is that the distribution of D T also differs from that of the target domain testing dataset.Therefore, the estimated or learned value of w using D T is not fully suitable for application to the testing dataset.This explanation is verified by the observation that CMD † and DANN † also slightly outperforms CMD * and DANN * on these tasks, respectively.

Conclusion
In this paper, we studied the problem of the popular domain-invariant representation learning (DIRL) framework for domain adaptation, when P(Y) changes across domains.To address the problem, we proposed a weighted version of DIRL (WDIRL).We showed that existing methods of the DIRL framework can be easily transferred to our WDIRL framework.Extensive experimental studies on benchmark cross-domain sentiment analysis datasets verified our analysis and showed the effectiveness of our proposed solution.
For each task, D S consisted of 1,000 examples of each class, and D T consists of 1500 examples of class '1' and 500 examples of class '2'.In addition, since it is reasonable to assume that D T can reveal the distribution of target domain data, we controlled the target domain testing dataset to have the same class ratio as D T .Using the same label assigning mechanism, we also studied model performance over different degrees of P(Y) shift, which was evaluated by the max value of P S (Y = i)/P T (Y = i), ∀i = 1, • • • , L. Please refer to Appendix C for more detail about the task design for this study.Multi-Class.We additionally constructed 12 multi-class cross-domain sentiment classification tasks.Tasks were designed to distinguish reviews of 1 or 2 stars (class 1) from those of 4 stars (class 2) and those of 5 stars (class 3).For each task, D S contained 1000 examples of each class, and D T consisted of 500 examples of class 1, 1500 examples of class 2, and 1000 examples of class 3. Similarly, we also controlled the target domain testing dataset to have the same class ratio as D T .

Figure 1 :
Figure 1: Mean accuracy of WCMD † † over different initialization of w.The empirical optimum value of w makes w 1 P S (Y = 1) = 0.75.The dot line in the same color denotes performance of the CMD model and 'w 0 ' annotates performance of WCMD † † when initializing w with w 0 .

Figure 2 :
Figure 2: Relative improvement over the SO baseline under different degrees of P(Y) shift on the B→D and B →K binary-class domain adaptation tasks.