Bi-Transferring Deep Neural Networks for Domain Adaptation

Sentiment classiﬁcation aims to automatically predict sentiment polarity (e.g., positive or negative) of user generated sentiment data (e.g., reviews, blogs). Due to the mismatch among different domains, a sentiment classiﬁer trained in one domain may not work well when directly applied to other domains. Thus, domain adaptation for sentiment classiﬁcation algorithms are highly desirable to reduce the domain discrepancy and manual labeling costs. To address the above challenge, we propose a novel domain adaptation method, called Bi-Transferring Deep Neural Networks (BTDNNs). The proposed BTDNNs attempts to transfer the source domain examples to the target domain, and also transfer the target domain examples to the source domain. The linear transformation of BTDNNs ensures the feasibility of transferring between domains, and the distribution consistency between the transferred domain and the desirable domain is constrained with a linear data reconstruction manner. As a result, the transferred source domain is supervised and follows similar distribution as the target domain. Therefore, any supervised method can be used on the transferred source domain to train a classiﬁer for sentiment classiﬁcation in a target domain. We conduct experiments on a benchmark composed of reviews of 4 types of Amazon products. Experimental results show that our proposed approach signiﬁcantly outperforms the several baseline methods, and achieves an accuracy which is


Introduction
With the rise of social media (e.g., blogs and social networks etc.), more and more user generated sentiment data have been shared on the Web (Pang et al., 2002;Pang and Lee, 2008;Liu, 2012;Zhou et al., 2011). They exist in the form of user reviews on shopping or opinion sites, in posts of blogs/questions or customer feedbacks. This has created a surge of research in sentiment classification (or sentiment analysis), which aims to automatically determine the sentiment polarity (e.g., positive or negative) of user generated sentiment data (e.g., reviews, blogs, questions).
Machine learning algorithms have been proved promising and widely used for sentiment classification (Pang et al., 2002;Pang and Lee, 2008;Liu, 2012). However, the performance of these models relies on manually labeled training data. In many practical cases, we may have plentiful labeled data in the source domain, but very few or no labeled data in the target domain with a different data distribution. For example, we may have many labeled books reviews, but we are interested in detecting the polarity of electronics reviews. Reviews for different products might have different vocabularies, thus classifiers trained on one domain often fail to produce satisfactory results when transferring to another domain. This has motivated much research on cross-domain (domain adaptation) sentiment classification which transfers the knowledge from the source domain to the target domain (Thomas et al., 2006;Snyder and Barzilay, 2007;Blitzer et al., 2007;Daume III, 2007;Li and Zong, 2008;Li et al., 2009;Pan et al., 2010;Kumar et al., 2010;Glorot et al., 2011;Chen et al., 2011a;Li et al., 2012;Xia et al., 2013a;Li et al., 2013;Zhou et al., 2015a;Zhuang et al., 2015).
Depending on whether the labeled data are available for the target domain, cross-domain sen-timent classification can be divided into two categories: supervised domain adaptation and unsupervised domain adaptation. In scenario of supervised domain adaptation, labeled data is available in the target domain but the number is usually too small to train a good sentiment classifier, while in unsupervised domain adaptation only unlabeled data is available in the target domain, which is more challenging. This work focuses on the unsupervised domain adaptation problem of which the essence is how to employ the unlabeled data of target domain to guide the model learning from the labeled source domain.
The fundamental challenge of cross-domain sentiment classification lies in that the source domain and the target domain have different data distribution. Recent work has investigated several techniques for alleviating the domain discrepancy: instance-weight adaptation (Huang et al., 2007;Jiang and Zhai, 2007;Li and Zong, 2008;Mansour et al., 2009;Dredze et al., 2010;Chen et al., 2011b;Chen et al., 2011a;Li et al., 2013;Xia et al., 2013a) and feature representation adaptation (Thomas et al., 2006;Snyder and Barzilay, 2007;Blitzer et al., 2007;Li et al., 2009;Pan et al., 2010;Zhou et al., 2015a;Zhuang et al., 2015). The first kind of methods assume that some training data in the source domain are very useful for the target domain and these data can be used to train models for the target domain after re-weighting. In contrast, feature representation approaches attempt to develop an adaptive feature representation that is effective in reducing the difference between domains.
Recently, some efforts have been initiated on learning robust feature representations with deep neural networks (DNNs) in the context of crossdomain sentiment classification (Glorot et al., 2011;. Glorot et al. (2011) proposed to learn robust feature representations with stacked denoising auto-encoders (SDAs) (Vincent et al., 2008). Denoising auto-encoders are onelayer neural networks that are optimized to reconstruct input data from partial and random corruption. These denoisers can be stacked into deep learning architectures. The outputs of their intermediate layers are then used as input features for SVMs (Fan et al., 2008).  proposed a marginalized SDA (mSDA) that addressed the two crucial limitations of SDAs: high  Figure 1. In BTDNNs, the linear transformation makes the feasibility of transferring between domains, and the linear data reconstruction manner ensures the distribution consistency between the transferred domain and the desirable domain. Specifically, our BTDNNs has one common encoder f c , two decoders g s and g t which can map an example to the source domain and the target domain respectively. As a result, the source domain can be transferred to the target domain along with its sentiment label, and any supervised method can be used on the transferred source domain to train a classifier for sentiment classification in the target domain, as the transferred source domain data share the similar distribution as the target domain. Experimental results show that the proposed approach significantly outperforms several baselines, and achieves an accuracy which is competitive with the state-of-the-art method for cross-domain sentiment classification.
The remainder of this paper is organized as follows. Section 2 introduces the related work. Section 3 describes our proposed bi-transferring deep neural networks (BTDNNs). Section 4 presents the experimental results. In Section 5, we conclude with ideas for future research.

Related Work
Domain adaptation aims to generalize a classifier that is trained on a source domain, for which typically plenty of training data is available, to a target domain, for which data is scarce. Cross-domain generalization is important in many real applications, the key challenge is that data in the source and the target domain are often distributed differently.
Recent work has investigated several techniques for alleviating the difference in the context of cross-domain sentiment classification task. Blitzer et al. (2007) proposed a structural correspondence learning (SCL) algorithm to train a crossdomain sentiment classifier. SCL is motivated by a multi-task learning algorithm, alternating structural optimization (ASO), proposed by Ando and Zhang (2005). Given labeled data from a source domain and unlabeled data from both source and target domains, SCL attempts to model the relationship between "pivot features" and "non-pivot features". Pan et al. (2010) proposed a spectral feature alignment (SFA) algorithm to align the domain-specific words from the source and target domains into meaningful clusters, with the help of domain-independent words as a bridge. In the way, the cluster can be used to reduce the gap between domain-specific words of two domains. Dredze et al. (2010) combined classifier weights using confidence-weighted learning, which represented the covariance of the weight vectors. Xia et al. (2013a) proposed an instance selection and instance weighting method for cross-domain sentiment classification. After that, Xia et al. (2013b) proposed a feature ensemble plus sample selection method to further improve the sentiment classification adaptation. Zhou et al. (Zhou et al., 2015b) proposed to bridge the domain gap with the help of topical correspondence. Li et al. (2009) proposed to transfer common lexical knowledge across domains via matrix factorization techniques. Zhou et al. (2015a) further improved the matrix factorization techniques via a regularization term on the pivots and domain-specific words, ensuring that the pivots capture only correspondence aspects and the domain-specific words capture only individual aspects. Li and Zong (2008) proposed the multi-label consensus training approach which combined several base classifiers trained with SCL.  proposed a domain adaptation algorithm based on sample and feature selection. Li et al. (2013) proposed an active learning algorithm for cross-domain sentiment classification.  investigated the online active domain adaptation problem in a novel but practical setting where the labels can be acquired with a lower cost in the source domain than in the target domain.
There has also been research in exploring careful structuring of features or prior knowledge for domain adaptation. Daumé III (2007) proposed a kernel-mapping function which maps both source and target domains data to a high-dimensional feature space so that data points from the same domain are twice as similar as those form different domains. Dai et al. (2008) Xiao and Guo (2015) proposed to learn distributed state representations for cross-domain sequence predictions.
Recently, some efforts have been initiated on learning robust feature representations with deep neural networks (DNNs) for cross-domain natural language processing. Glorot et al. (2011) and  proposed to use deep learning for cross-domain sentiment classification. Most recently, Yang and Eisenstein (2014) proposed an unsupervised domain adaptation method with marginalized structured dropout. Furthermore, Yang and Eisenstein (2015) proposed to use feature embeddings with metadata domain attributes for multi-domain adaptation. In this paper, our proposed approach BTDNNs tackles the domain discrepancy with a linear data construction manner, which can effectively model the domainspecific features as well as the commonality of domains. Deep learning techniques have also been proposed to heterogeneous transfer learning (Socher et al., 2013;Kan et al., 2015;Long et al., 2015), where knowledge is transferred from one modality to another based on the correspondences at hand. Our proposed framework can be considered as a more general case, where the bias of the correspondences between the source and target domains is constrained with a linear data reconstruction manner.
Besides, other researchers also explore the DNNs for sentiment analysis (Socher et al., 2011;Tang et al., 2014;Tang et al., 2015;Zhai and Zhang, 2016;Chandar et al., 2014). However, all these methods focus on the sentiment analysis without considering the domain discrepancy. In this paper, we focus on domain adaptation for sentiment classification with a different model formulation and task definition.

Problem Definition
Given two domains X s and X t , where X s and X t are referred to a source domain and a target domain, respectively. Suppose we have a set of labeled sentiment examples as well as some unlabeled examples in the source domain X s with size n s , containing terms from a vocabulary V with size m. The examples in the source domain X s can be represented as a term-document matrix X s = {x s 1 , · · · , x s ns } ∈ R m×ns , with their sentiment labels y s = {y s 1 , · · · , y s ns }, where x s i ∈ R m is the feature representation of the i-th source domain example with a tf-idf weight of the corresponding term and y s i ∈ {+1, −1} is its sentiment label. 1 Similarly, suppose we have a set of unlabeled examples in the target domain X t with size n t , containing terms from a vocabulary V with size m. The examples in target domain X t can also be represented as a term-document matrix X t = {x nt } ∈ R m×nt , where each example denotes a tf-idf weight of the corresponding term. The task of cross-domain sentiment classification is to learn a robust classifier to predict the polarity 1 We use upper case and lower case characters represent the matrices and vectors respectively throughout the paper.
of unseen examples from X t . Note that we only consider one source domain and one target domain in this paper. However, our proposed algorithm is a general framework and can be easily adapted to multi-domain problems.

Basic Auto-Encoder
An auto-encoder is an unsupervised neural network which is trained to reconstruct a given input vector from its latent representation (Bengio et al., 2007). It can be seen as a special neural network with three layers: the input layer, the latent layer, and the reconstruction layer. An autoencoder contains two parts: encoder and decoder. The encoder, denoted as f , attempts to map an input vector x ∈ R m×1 to the latent representation z ∈ R k×1 , in which k is the number of neurons in the latent layer. Usually, f is a nonlinear function as follows: where s e is the activation function of the encoder, whose input is called the activation function, which is usually non-linear, such as sigmoid function or tanh function is a linear transform parameter, and b ∈ R k×1 is the basis. The decoder, denoted as g, tries to map the latent representation z back to a reconstruction: Similarly, s d is the activation function of the decoder with parameters {W , b }.
The training objective is the determination of parameters {W, b} and {W , b } that minimize the average reconstruction errors: where x i represents the i-th one of N training examples. Parameters {W, b} and {W , b } can be optimized by stochastic or mini-batch gradient descent. By minimizing the reconstruction error, we require the latent features should be able to reconstruct the original input as much as possible.

Bi-Transferring Deep Neural Networks
The traditional auto-encoder in subsection 3.2 attempts to reconstruct the input itself, which is usually used for feature representation learning. Nevertheless, our proposed bi-transferring deep neural networks (BTDNNs) attempts to transfer examples between domains to deal with the domain discrepancy, with the inspiration of DNNs in computer vision (Kan et al., 2015). Motivated by the successful application in computer vision (Kan et al., 2015), we construct the architecture of BTDNNs with one encoder f e , and two decoders, g s and g t shown in Figure 1, which can transform an input example to the source domain and the target domain respectively. 2 Specifically, the encoder f c tries to map an input example x into the latent feature representation z, which is common to both the source and target domains as follows: The decoder g s attempts to map the latent representation to the source domain, and the decoder g t attempts to map the latent representation to the target domain as follows: where s e (·) and s d (·) are the element-wise nonlinear activation function, e.g., sigmoid or tanh function, W c and b c are the parameters for encoder f c , W s and b s are the parameters for decoder g s , W t and b t are the parameters for decoder g t . Following the literature (Kan et al., 2015), we attempt to map the source domain examples X s to the source domain (e.g., X s itself) with an encoder f c and a decoder g s . Similarly, given an encoder f c and a decoder g t , we aim to map the source domain examples X s to the target domain. Although it is unknown what the mapped examples look like, they are expected to follow the similar distribution as the target domain. This kind of distribution consistency between two domains can be characterized from the perspective of a linear data reconstruction manner.
The two domains X s and X t can be generally reconstructed from each other, and their distances can be used to measure the domain discrepancy. Following the literature (He et al., 2012), BTDNNs attempt to represent a transferred source domain g t (f c (x s i )) with a linear reconstruction function from the target domain: 2 In the implementation, we use the stacked denoising auto-encoders (SDA) (Vincent et al., 2008) to model the source and the target domain data.
where β t i is the coefficients for the reconstruction of transferred source domain examples. Equation (7) enforces that each example of transferred domain is consistent with that of target domain, which ensures that the transferred source domain follows the similar distribution as the target domain. The overall objective for the examples of source domain X s can be formulated as below: min fc,gs,g t ,β s i Xs − gs(fc(Xs)) 2 2 + gt(fc (Xs) 2 < τ, Bs = [β s 1 , β s 2 , · · · , β s n t ] T ∈ R n t ×ns Combining the above equations, the overall objective of BTDNNs can be formulated as follows: min fc,gs,g t ,Bs,B t Xs − gs(fc(Xs)) 2 2 + gt(fc(Xs)) − XtBt) 2 2 + Xt − gt(fc(Xt)) 2 2 + gs(fc(Xt)) − XsBs) 2 2 (8) where γ is a regularization parameter controlling the amount of shrinkage. With the optimization of equation (8), our proposed approach BTDNNs can map any input examples to the source and target domains respectively. Especially, the source domain examples X s can transferred to the target domain along with their sentiment labels. The transferred source domain data g t (f s (X s )) share the similar distribution as the target domain, so any supervised method can be used to learn a classifier for sentiment classification in the target domain. In this paper, a linear support vector machine (SVM) (Fan et al., 2008) is employed for building sentiment classification models.

Learning Algorithm
Note that the optimization problem in equation (8) is not convex in variables {f c , g s , g t , B s , B t } together. However, when considering one variable at a time, the cost function turns out to be convex. For example, given {g s , g t , B s , B t }, the cost function is a convex function w.r.t. f c . Therefore, although we cannot expect to get a global minimum of the above problem, we shall develop a simple and efficient optimization algorithm via alternative iterations.

Optimize
When {f c , g s , g t } are fixed, the objective function in equation (8) can be written as: where g s (f c (X t )) = G s = [g s 1 , · · · , g s nt ] and g t (f c (X s )) = G t = [g t 1 , · · · , g t ns ]. Since G s and G t are independent with each other, so they can be optimized independently. The optimization of G s with other variables fixed is a least squares problem with 2 -regularization. It can also be decomposed into n t optimization problems, with each corresponding to one β s j and can be solved in parallel: for j = 1, 2, · · · , n t . It is a standard 2 -regularized least squares problem and the solution is: where I is an identity matrix with all entries equal to 1. Similarly, The optimization of G t can also be decomposed into n s 2 -regularized least squares problems and the solution of each one is: for i = 1, 2, · · · , n s . We repeat the above equations until f c , g s , g t , B s and B t converge or a maximum number of iterations is exceeded.

Algorithm Complexity
In this section, we analyze the computational complexity of the learning algorithm described in equations (9), (11) and (12). Besides expressing the complexity of the algorithm using big O notation, we also count the number of arithmetic operations to provide more details about the run time. Computational complexity of learning matrix G s is O(m × n s × k) per iteration. Similarly, for each iteration, learning matrices G t takes O(m × n t × k). Learning matrices B s and B t takes O(m 2 × n s ) and O(m 2 × n t ) operations per iteration. In real applications, we have k m. Therefore, the overall complexity of the algorithm, dominated by computation of matrices B s and B t , is O(m 2 × n) where n = max(n s , n t ).

Data Set
Domain adaptation for sentiment classification has been widely studied in the NLP community. A large majority experiments are performed on the benchmark made of reviews of Amazon products gathered by . This data set contains 4 different domains: Book (B), DVDs (D), Electronics (E) and Kitchen (K). For simplicity and comparability, we follow the convention of Pan et al., 2010;Glorot et al., 2011;) and only consider the binary classification problem whether a review is positive (higher than 3 stars) or negative (3 stars or lower). There are 1000 positive and 1000 negative reviews for each domain, as well as approximately 4,000 unlabeled reviews (varying slightly between domains). The positive and negative reviews are also exactly balanced.
Following the literature (Pan et al., 2010), we can construct 12 cross-domain sentiment classification tasks: where the word before an arrow corresponds with the source domain and the word after an arrow corresponds with the target domain. To be fair to other algorithms that we compare to, we use the raw bag-of-words unigram/bigram features as their input and pre-process with tf-idf . Table 1 presents the statistics of the data set.

Compared Methods
As a baseline method, we train a linear SVM (Fan et al., 2008) on the raw bag-of-words representation of the labeled source domain and test it on the target domain. In the original paper regarding the benchmark data set,  adapted Structural Correspondence Learning (SCL) for sentiment analysis. Li and Zong (2008)  Recently, some efforts have been initiated on learning robust feature representations with DNNs for cross-domain sentiment classification. Glorot et al. (2011) first employed stacked Denoising Auto-encoders (SDA) to extract meaningful representation for domain adaptation.  proposed marginalized SDA (mSDA) that addressed the high computational cost and lack of scalability to high-dimensional features. Zhuang et al. (2015) proposed a state-of-the-art method called transfer learning with deep autoencoders (TLDA).
For SCL, PJNMF, SDA, mSDA and TLDA, we use the source codes provided by the authors. For SFA and MCT, we re-implement them based on the original papers. The above methods serve as comparisons in our empirical evaluation. For fair comparison, all hyper-parameters are set by 5-fold cross validation on the training set from the source domain. 3 For our proposed BTDNNs, the number of hidden neurons is set as 1000, the regularization parameter γ is tuned via 5-fold cross-validation.
For SDA, mSDA, TLDA and BTDNNs, we can construct the classifiers for the target domain in two ways. The first way is directly to use the stacking SVM on top of the output of the hidden layer. The second way is to apply the standard SVM to train a classifier for source domain in the embedding space. Then the classifiers is applied to predict sentiment labels for target domain data. For fair comparison with the shallow models, we choose the second way in this paper. Figure 2 shows the accuracy of classification results for all methods and for all source-target domain pairs. We can check that all compared methods achieve the similar performance with the results reported in the original papers. From Figure 2, we can see that our proposed approach BTDNNs outperforms all other eight comparison methods in general. The baseline performs poorly on all the 12 tasks, while the other seven domain adaptation methods, SCL, MCT, SFA, PJNMF, SDA, mSDA and TLDA, consistently outperform the baseline method across all the 12 tasks, which demonstrates that the transferred knowledge from the source domain to the target domain is useful for sentiment classification. Nevertheless, the improvements achieved by these seven methods over the baseline are much smaller than the proposed approach BTDNNs.
Surprisingly, we note that the deep learning based methods (SDA, mSDA and TLDA) perform worse than our approach, the reason may be that SDA, mSDA and TLDA learn the unified domaininvariable feature representations by combining the source domain data and that of the target domain data together, which cannot well characterize the domain-specific features as well as the commonality of domains. On the contrary, our proposed BTDNNs ensures the feasibility of transferring between domains, and the distribution consistency between the transferred domain and the desirable domain is constrained with a linear data reconstruction manner.
We also conduct significance tests for our proposed approach BTDNNs and the state-of-the-art method (TLDA) using a McNemar paired test for labeling disagreements (Gillick and Cox, 1989). In general, the average result on the 12 sourcetarget domain pairs indicates that the difference benchmark set by the authors.  Figure 3: Proxy A-distance between domains of the Amazon benchmark for the 6 different pairs. between BTDNNs and TLDA is mildly significant with p < 0.08. Furthermore, we also conduct the experiments on a much larger industrial-strength data set of 22 domains (Glorot et al., 2011). The preliminary results show that BTDNNs significantly outperforms TLDA (p < 0.05). Therefore, we will report our detailed results and discussions in our future work.

Domain Divergence
In this subsection, we look into how similar two domains are to each other. Ben-David et al. (2006) showed that the A-distance as a measure of how different between the two domains. They hypothesized that it should be difficult to discriminate between the source and target domains in order to have a good transfer between them. In practice, computing the exact A-distance is impossible and one has to compute a proxy. Similar to (Glorot et al., 2011), the proxy for the A-distance is then defined as 2(1 − 2 ), where is the generalization error of a linear SVM classifier trained on the binary classification problem to distinguish inputs between the two domains. Figure 3 presents the results for each pair of domains. Surprisingly, the distance is increased with the help of new feature representations, e.g., distinguishing between domains becomes easier with the BTDNNs features. We explain this effect through the fact that BTDNNs can ensure the feasibility of transferring between domains, and the distribution consistency between the transferred domain and the desirable domain is constrained with a linear data reconstruction manner, which can learn a generally better representations for the input data. This helps both tasks, distinguishing between domains and sentiment classification (e.g., in the book domain BTDNNs might interpolate the feature "exciting" from "boring", both are not particularly relevant for sentiment classification but might help distinguish the review from the Electronic domain.).

Conclusions and Future Work
In this paper, we propose a novel Bi-Transferring Deep Neural Networks (BTDNNs) for crossdomain sentiment classification. The proposed BTDNNs attempts to transfer the source domain examples to the target domain, and also transfer the target domain examples to the source domain. The linear transformation of BTDNNs ensures the feasibility of transferring between domains, and the distribution consistency between the transferred domain and the desirable domain is constrained with a linear data reconstruction manner. Experimental results show that BTDNNs significantly outperforms the several baselines, and achieves an accuracy which is competitive with the state-of-the-art method for sentiment classification adaptation.
There are some ways in which this research could be continued. First, since deep learning may obtain better generalization on large-scale data sets (Bengio, 2009), a straightforward path of the future research is to apply the proposed BTDNNs for domain adaptation on a much larger industrial-strength data set of 22 domains (Glorot et al., 2011). Second, we will try to investigate the use of the proposed approach for other kinds of data set, such as 20 newsgroups and Reuters-21578 (Li et al., 2012;Zhuang et al., 2013).