Identifying Transferable Information Across Domains for Cross-domain Sentiment Classification

Getting manually labeled data in each domain is always an expensive and a time consuming task. Cross-domain sentiment analysis has emerged as a demanding concept where a labeled source domain facilitates a sentiment classifier for an unlabeled target domain. However, polarity orientation (positive or negative) and the significance of a word to express an opinion often differ from one domain to another domain. Owing to these differences, cross-domain sentiment classification is still a challenging task. In this paper, we propose that words that do not change their polarity and significance represent the transferable (usable) information across domains for cross-domain sentiment classification. We present a novel approach based on χ2 test and cosine-similarity between context vector of words to identify polarity preserving significant words across domains. Furthermore, we show that a weighted ensemble of the classifiers enhances the cross-domain classification performance.


Introduction
The choice of the words to express an opinion depends on the domain as users often use domainspecific words (Qiu et al., 2009;. For example, entertaining and boring are frequently used in the movie domain to express an opinion; however, finding these words in the electronics domain is rare. Moreover, there are words which are likely to be used across domains in the same proportion, but may change their polarity orientation from one domain to another (Choi et al., 2009). For example, a word like unpredictable is positive in the movie domain (un-predictable plot), but negative in the automobile domain (unpredictable steering). Such a polarity changing word should be assigned positive orientation in the movie domain and negative orientation in the automobile domain. 1 Due to these differences across domains, a supervised algorithm trained on a labeled source domain, does not generalize well on an unlabeled target domain and the cross-domain performance degrades.
Generally, supervised learning algorithms have to be re-trained from scratch on every new domain using the manually annotated review corpus (Pang et al., 2002;Kanayama and Nasukawa, 2006;Pang and Lee, 2008;Esuli and Sebastiani, 2005;Breck et al., 2007;Li et al., 2009;Prabowo and Thelwall, 2009;Taboada et al., 2011;Rosenthal et al., 2014). This is not practical as there are numerous domains and getting manually annotated data for every new domain is an expensive and time consuming task (Bhattacharyya, 2015). On the other hand, domain adaptation techniques work in contrast to traditional supervised techniques on the principle of transferring learned knowledge across domains (Blitzer et al., 2007;Pan et al., 2010;Bhatt et al., 2015). The existing transfer learning based domain adaptation algorithms for cross-domain classification have generally been proven useful in reducing the labeled data requirement, but they do not consider words like unpredictable that change polarity orientation across domains. Transfer (reuse) of changing polarity words affects the cross-domain performance negatively. Therefore, one cannot use transfer learning as the proverbial hammer, rather one needs to gauge what to transfer from the source domain to the target domain.
In this paper, we propose that the words which are equally significant with a consistent polarity across domains represent the usable information for cross-domain sentiment analysis. χ 2 is a popularly used and reliable statistical test to identify significance and polarity of a word in an annotated corpus (Oakes et al., 2001;Al-Harbi et al., 2008;Cheng and Zhulyn, 2012;Sharma and Bhattacharyya, 2013). However, for an unlabeled corpus no such statistical technique is applicable. Therefore, identification of words which are significant with a consistent polarity across domains is a non-trivial task. In this paper, we present a novel technique based on χ 2 test and cosine-similarity between context vector of words to identify Significant Consistent Polarity (SCP) words across domains. 2 The major contribution of this research is as follows.
1. Extracting significant consistent polarity words across domains: A technique which exploits cosine-similarity between context vector of words and χ 2 test is used to identify SCP words across labeled source and unlabeled target domains.
2. An ensemble-based adaptation algorithm: A classifier (C s ) trained on SCP words in the labeled source domain acts as a seed to initiate a classifier (C t ) on the target specific features. These classifiers are then combined in a weighted ensemble to further enhance the cross-domain classification performance.
Our results show that our approach gives a statistically significant improvement over Structured Correspondence Learning (SCL) (Bhatt et al., 2015) and common unigrams in identification of transferable words, which eventually facilitates a more accurate sentiment classifier in the target domain. The road-map for rest of the paper is as follows. Section 2 describes the related work. Section 3 describes the extraction of the SCP and the ensemble-based adaptation algorithm. Section 4 elaborates the dataset and the experimental protocol. Section 5 presents the results and section 6 reports the error analysis. Section 7 concludes the paper. 3

Related Work
The most significant efforts in the learning of transferable knowledge for cross-domain text classification are Structured Correspondence Learning (SCL) (Blitzer et al., 2007) and Structured Feature Alignment (SFA) (Pan et al., 2010). SCL aims to learn the co-occurrence between features from the two domains. It starts with learning pivot features that occur frequently in both the domains. It models correlation between pivots and all other features by training linear predictors to predict presence of pivot features in the unlabeled target domain data. SCL has shown significant improvement over a baseline (shift-unaware) model. SFA uses some domain-independent words as a bridge to construct a bipartite graph to model the co-occurrence relationship between domainspecific words and domain-independent words. Our approach also exploits the concept of cooccurrence (Pan et al., 2010), but we measure the co-occurrence in terms of similarity between context vector of words, unlike SCL and SFA, which literally look for the co-occurrence of words in the corpus. The use of context vector of words in place of words helps to overcome the data sparsity problem .
Domain adaptation for sentiment classification has been explored by many researchers (Jiang and Zhai, 2007;Ji et al., 2011;Saha et al., 2011;Glorot et al., 2011;Zhou et al., 2014;Bhatt et al., 2015). Most of the works have focused on learning a shared low dimensional representation of features that can be generalized across different domains. However, none of the approaches explicitly analyses significance and polarity of words across domains. On the other hand, Glorot et al., (2011) proposed a deep learning approach which learns to extract a meaningful representation for each review in an unsupervised fashion. Zhou et al., (2014) also proposed a deep learning approach to learn a feature mapping between cross-domain heterogeneous features as well as a better feature representation for mapped data to reduce the bias issue caused by the crossdomain correspondences. Though deep learning based approaches perform reasonably good, they don't perform explicit identification and visualization of transferable features across domains unlike SFA and SCL, which output a set of words as transferable (reusable) features. Our approach explicitly determines the words which are equally significant with a consistent polarity across source and target domains. Our results show that the use of SCP words as features identified by our approach leads to a more accurate cross-domain sentiment classifier in the unlabeled target domain.

Approach: Cross-domain Sentiment Classification
The proposed approach identifies words which are equally significant for sentiment classification with a consistent polarity across source and target domains. These Significant Consistent Polarity (SCP) words make a set of transferable knowledge from the labeled source domain to the unlabeled target domain for cross-domain sentiment analysis. The algorithm further adapts to the unlabeled target domain by learning target domain specific features. The following sections elaborate SCP features extraction (3.1) and the ensemblebased cross-domain adaptation algorithm (3.2).

Extracting SCP Features
The words which are not significant for classification in the labeled source domain, do not transfer useful knowledge to the target domain through a supervised classifier trained in the source domain. Moreover, words that are significant in both the domains, but have different polarity orientation transfer the wrong information to the target domain through a supervised classifier trained in the labeled source domain, which also downgrade the cross-domain performance. Our algorithm identifies the significance and the polarity of all the words individually in their respective domains. Then the words which are significant in both the domains with the consistent polarity orientation are used to initiate the crossdomain adaptation algorithm. The following sections elaborate how the significance and the polarity of the words are obtained in the labeled source and the unlabeled target domains.

Extracting Significant Words with the Polarity Orientation from the Labeled Source Domain
Since we have a polarity annotated dataset in the source domain, a statistical test like χ 2 test can be applied to find the significance of a word in the corpus for sentiment classification (Cheng and Zhulyn, 2012;Zheng et al., 2004). We have used goodness of fit chi 2 test with equal number of reviews in positive and negative corpora. This test is generally used to determine whether sample data is consistent with a null hypothesis. 4 Here, the null hypothesis is that the word is equally used in the positive and the negative corpora. The χ 2 test is formulated as follows: Where, c w p is the observed count of a word w in the positive documents and c w n is the observed count in the negative documents. µ w represents an average of the word's count in the positive and the negative documents. Here, µ w is the expected count or the value of the null-hypothesis. There is an inverse relation between χ 2 value and the p-value which is probability of the data given null hypothesis is true. In such a case where a word results in a pvalue smaller than the critical p-value (0.05), we reject the null-hypothesis. Consequently, we assume that the word w belongs to a particular class (positive or negative) in the data, hence it is a significant word for classification (Sharma and Bhattacharyya, 2013).
Polarity of Words in the Labeled Source Domain: Chi-square test substantiates the statistically significant association of a word with a class label. Based on this association we assign a polarity orientation to a word in the domain. In other words, if a word is found significant by χ 2 test, then the exact class of the word is determined by comparing c w p and c w n . For instance, if c w p is higher than c w n , then the word is positive, else negative.

Extracting Significant Words with the Polarity Orientation from the Unlabeled Target Domain
Target domain data is unlabeled and hence, χ 2 test cannot be used to find significance of the words. However, to obtain SCP words across domains, we take advantage of the fact that we have to identify significance of only those words in the target domain which are already proven to be significant in the source domain. We presume that a word which is significant in the source domain as per χ 2 test and occurs with a frequency greater than a certain threshold (θ) in the target domain is significant in the target domain also.
Equation (2) formulates the significance test in the unlabeled target (t) domain. Here, function signif icant s assures the significance of the word w in the labeled source (s) domain and count t gives the normalized count of the w in t. 5 χ 2 test has one key assumption that the expected value of an observed variable should not be less than 5 to be significant. Considering this assumption as a base, we fix the value of θ as 10. 6 Polarity of Words in the Unlabeled Target Domain: Generally, in a polar corpus, a positive word occurs more frequently in context of other positive words, while a negative word occurs in context of other negative words . 7 Based on this hypothesis, we explore the contextual information of a word that is captured well by its context vector to assign polarity to words in the target domain (Rill et al., 2012;Rong, 2014). Mikolov et al., (2013) showed that similarity between context vector of words in vicinity such as 'go' and 'to' is higher compared to distant words or words that are not in the neighborhood of each other. Here, the observed concept is that if a word is positive, then its context vector learned from the polar review corpus will give higher cosine-similarity with a known positive polarity word in comparison to a known negative polarity word or vice versa. Therefore, based on the cosine-similarity scores we can assign the label of the known polarity word to the unknown polarity word. We term known polarity words as Positivepivot and Negative-pivot.
Context Vector Generation: To compute context vector (conV ec) of a word (w), we have used publicly available word2vec toolkit with the skip-gram model (Mikolov et al., 2013). 8 In this model, each word's Huffman code is used as an input to a log-linear classifier with a continuous projection layer and words within a given window are predicted (Faruqui et al., 2014). We construct a 100 dimensional vector for each can-5 Normalized count of w in t shows the proportion of occurrences of w in t. 6 We tried with smaller values of theta also, but they were not found as effective as theta value of 10 for significant words identification. 7 For example, 'excellent' will be used more often in positive reviews in comparison to negative reviews, hence, it would have more positive words in its context. Likewise, 'terrible' will be used more frequently in negative reviews in comparison to positive reviews, hence, it would have more negative words in its context. 8 Available at: https://radimrehurek.com/ gensim/models/word2vec.html didate word from the unlabeled target domain data. The decision method given in Equation 3 defines the polarity assignment to the unknown polarity words of the target domain. If a word w gives a higher cosine-similarity with the PosPivot (Positive-pivot) than the NegPivot (Negative-pivot), the decision method assigns the positive polarity to the word w, else negative polarity to the word w.
If(cosine(conV ec(w), conV ec(PosPivot)) > cosine(conV ec(w), conV ec(NegPivot))) ⇒ Positive If(cosine(conV ec(w), conV ec(PosPivot)) < cosine(conV ec(w), conV ec(NegPivot))) ⇒ Negative (3) Pivot Selection Method: We empirically observed that a polar word which has the highest frequency in the corpus gives more coverage to estimate the polarity orientation of other words while using context vector. Essentially, the frequent occurrence of the word in the corpus allows it to be in context of other words frequently. Therefore a polar word having the highest frequency in the target domain is observed to be more accurate as pivot for identification of polarity of input words. 9 Table 1 shows the examples of a few words in the electronics domain whose polarity orientation is derived based on the similarity scores obtained with PosPivot and NegPivot words in the electronics domain. Transferable Knowledge: The proposed algorithm uses the above mentioned techniques to identify the significance and the polarity of words in the labeled source data (cf. Section 3.1.1) and the unlabeled target data (cf. Section 3.1.2). The words which are found significant in both the domains with the same polarity orientation form a set of SCP features for cross-domain sentiment classification. The weights learned for the SCP features in the labeled source domain by the classification algorithm can be reused for sentiment classification in the unlabeled target domain as SCP features have consistent impacts in both the domains.

Ensemble-based Cross-domain Adaptation Algorithm
Apart from the transferable SCP words (Obtained in Section 3.1), each domain has specific discriminating words which can be discovered only from that domain data. The proposed cross-domain adaptation approach (Algorithm 1) attempts to learn such domain specific features from the target domain using a classifier trained on SCP words in the source domain. An ensemble of the classifiers trained on the SCP features (transferred from the source) and domain specific features (learned within the target) further enhances the cross-domain performance. Table 2 lists the notations used in the algorithm. The working of the cross-domain adaptation algorithm is as follows: 1. Identify SCP features from the labeled source and the unlabeled target domain data.

A SVM based classifier is trained on SCP
words as features using labeled source domain data, named as C s .
3. The classifier C s is used to predict the labels for the unlabeled target domain instances D u t , and the confidently predicted instances of D u t form a set of pseudo labeled instances R n t .

4.
A SVM based classifier is trained on the pseudo labeled target domain instances R n t , using unigrams in R n t as features to include the target specific words, this classifier is named as C t .

Finally, a Weighted Sum Model (WSM) of
C s and C t gives a classifier in the target domain.
The confidence in the prediction of D u t is measured in terms of the classification-score of the document, i.e., the distance of the input document from the separating hyper-plane given by the SVM classifier (Hsu et al., 2003). The top n confidently predicted pseudo labeled instances (R n t ) are used to train classifier C t , where n depends on a threshold that is empirically set to | ± 0.2|. 10 The classifier C s trained on the SCP features (transferred knowledge) from the source domain and the classifier C t trained on self-discovered target specific features from the pseudo labeled target domain instances bring in complementary information from the two domains. Therefore, combining C s and C t in a weighted ensemble (WSM) further enhances the cross-domain performance. Algorithm 1 gives the pseudo code of the proposed adaptation approach.
have shown that 350 to 400 labeled documents are required to get a high accuracy classifier in a domain using supervised classification techniques, but beyond 400 labeled documents there is not much improvement in the classification accuracy. Hence, threshold on classification score is set such that it can give a sufficient number of documents for supervised classification. Threshold |±0.2| gives documents between 350 to 400. rors produced by the individual classifier. The formulation of WSM is given in step-6 of the Algorithm 1. If C s has wrongly predicted a document at boundary point and C t has predicted the same document confidently, then weighted sum of C s and C t predicts the document correctly or vice versa. For example, a document is classified by C s as negative (wrong prediction) with a classification-score of −0.07, while the same document is classified by C t as positive (correct prediction) with a classification-score of 0.33, the WSM of C s and C t will classify the document as positive with a classification-score of 0.12 (Equation 4). (4) Here 0.765 and 0.712 are the weights W s and W t to the classifiers C s and C t respectively. Weights to the Classifiers in WSM: The weights W s and W t are the classification accuracies obtained by C s and C t respectively on the crossvalidation data from the target domain. The weights W s and W t allow C s and C t to participate in the WSM in proportion of their accuracy on the cross-validation data. This restriction facilitates the domination of the classifier which is more accurate.

Dataset & Experimental Protocol
In this paper, we show comparison between SCPbased domain adaptation (our approach) and SCLbased domain adaptation approach proposed by Bhatt el al. (2015) using four domains, viz., Electronics (E), Kitchen (K), Books (B), and DVD. 11 We use SVM algorithm with linear kernel (Tong and Koller, 2002) to train a classifier in all the mentioned classification systems in the paper. To implement SVM algorithm, we have used the publicly available Python based Scikit-learn package (Pedregosa et al., 2011). 12 Data in each domain is divided into three parts, viz., train (60%), validation (20%) and test (20%). The SCP words are extracted from the training data. The weights W S and W t for the source and target classifiers are essentially accuracies obtained by C s and C t respectively on validation dataset from the target domain. We report the accuracy for all the systems on the test data. Table 3 shows the statistics of the dataset.   (2015) is state-of-the-art for cross-domain sentiment analysis. On the other hand, common unigrams of the source and target are the most visible transferable information. 13 Gold standard SCP words: Chi-square test gives us significance and polarity of the word in the corpus by taking into account the polarity labels of the reviews. Application of chi-square test in both the domains, considering that the target domain is also labeled, gives us gold standard SCP words. There is no manual annotation involved.
F-score for SCP Words Identification Task: The set of SCP words represent the usable information across domains for cross-domain classification, hence we compare the F-score for the SCP words identification task obtained with our approach, SCL and common-unigrams in Figure  1. It demonstrates that our approach gives a huge improvement in the F-score over SCL and common unigrams for all the 12 pairs of the source and target domains. To measure the statistical significance of this improvement, we applied t-test on the F-score distribution obtained with our approach, SCL and common unigrams. t-test is a statistical significance test. It is used to determine whether two sets of data are significantly different or not. 14 Our approach performs significantly better than SCL and common unigrams, while SCL performs better than common unigrams as per ttest.
Comparison among C s , C t and WSM: Table 4 shows the comparison among classifiers obtained in the target domain using SCP given by our approach, SCL, common-unigrams, and gold standard SCP for electronics as the source and movie as the target domains. Since electronics and movie are two very dissimilar domains in terms of domain specific words, unlike, books and movie, getting a high accuracy classifier in the movie domain from the electronics domain is a challenging task (Pang et al., 2002). Therefore, in Table 4 results are reported with electronics as the source domain and movie as the target domain. 15 In all four cases, there is difference in the transferred information from the source to the target, but the ensemblebased classification algorithm (Section 3.2) is the same. Table 4 depicts sentiment classification accuracy obtained with C s , C t and WSM. The weights W s and W t in WSM are normalized accuracies by C s and C t respectively on the validation set from the target domain. The fourth column (size) represents the feature set size. We observed that WSM gives the highest accuracy, which validates our assumption that a weighted sum of two classifiers is better than the performance of individual classifiers. The WSM accuracy obtained with SCP words given by our approach is comparable to the accuracy obtained with gold standard SCP words.
The motivation of this research is to learn shared representation cognizant of significant and polarity changing words across domains. Hence, we report cross-domain classification accuracy obtained with three different types of shared representations (transferable knowledge), viz., common-unigrams, SCL and our approach. 16 System-1, system-2 and system-3 in Table 5 show the final cross-domain sentiment classification accuracy obtained with WSM in the target domain 14 The detail about the test is available at: http://www. socialresearchmethods.net/kb/stat_t.php. 15 The movie review dataset is a balanced corpus of 2000 reviews. Available at: http://www.cs.cornell. edu/people/pabo/movie-review-data/ 16 The reported accuracy is the ratio of correctly predicted documents to that of the total number of documents in the test dataset.  Table 4: Classification accuracy in % given by C s , C t and WSM with different feature sets for electronics as source and movie as target.
for 12 pairs of source and target using commonunigrams, SCL and our approach respectively. System-1: This system considers commonunigrams of both the domains as shared representation. System-2: It differs from system-1 in the shared representation, which is learned using Structured Correspondence Learning (SCL) (Bhatt et al., 2015) to initiate the process. System-3: This system implements the proposed domain adaptation algorithm. Here, the shared representation is the SCP words and the ensemble-based domain adaptation algorithm (Section 3.2) gives the final classifier in the target domain. Table 5 depicts that the system-3 is better than system-1 and system-2 for all pairs, except K to B and B to D. For these two pairs, system-2 performs better than system-3, though the difference in accuracy is very low (below 1%).
To enhance the final accuracy in the target domain, Bhatt et al., (2015) performed iterations over the pseudo labeled target domain instances (R n t ). In each iteration, they obtained a new C t trained on increased number of pseudo labeled documents. This process is repeated till all the training instances of the target domain are considered. The C t obtained in the last iteration makes WSM with C s which is trained on the transferable features given by SCL. Bhatt et al., (2015) have shown that iteration-based domain adaptation technique is more effective than one-shot Figure 1: F-score for SCP words identification task (Source → Target) with respect to gold standard SCP words.   Table 6: In-domain sentiment classification accuracy using significant words and unigrams.
adaptation approaches. System-4, system-5, and system-6 in Table 5 incorporate the iterative process into system-1, system-2, and system-3 respectively. We observed the same trend after the inclusion of the iterative process also, as the SCPbased system-6 performed the best in all 12 cases.
On the other hand, SCL-based system-5 performs better than the common-unigrams based system-4. Table 7 shows the results of significance test (ttest) performed on the accuracy distributions produced by the six different systems. The notice-able point is that the iterations over SCL (system-5) and our approach (system-6) narrow down the difference in the accuracy between system-2 and system-3 as system-2 and system-3 have a statistically significant difference in accuracy with the p-value of 0.039 (Row-4 of Table 7), but the difference between system-5 and system-6 is not statistically significant. Essentially, system-3 does not give much improvement with iterations, unlike system-2. In other words, addition of the iterative process with the shared representation given by SCL overcomes the errors introduced by SCL. On the other hand, SCP given by our approach were able to produce a less erroneous system in oneshot. Table 6 shows the in-domain sentiment classification accuracy obtained with unigrams and significant words as features considering labeled data in the domain. System-6 tries to equalize the in-domain accuracy obtained with unigrams.  Table 7: t-test (α = 0.05) results on the difference in accuracy produced by various systems (cf. Table 5).
To validate our assertion that polarity preserving significant words (SCP) across source and target domains make a less erroneous set of transferable knowledge from the source domain to the target domain, we computed Pearson productmoment correlation between F-score obtained for our approach (cf. Figure 1) and cross-domain accuracy obtained with SCP (System-3, cf. Table  5). We observed a strong positive correlation (r) of 0.78 between F-score and cross-domain accuracy. Essentially, an accurate set of SCP words positively stimulates an improved classifier in the unlabeled target domain.

Error Analysis
The pairs of domains which share a greater number of domain-specific words, result in a higher accuracy cross-domain classifier. For example, Electronics (E) and Kitchen (K) domains share many domain-specific words, hence pairing of such similar domains as the source and the target results into a higher accuracy classifier in the target domain. Table 5 shows that K→E outperforms B→E and D→E, and E→K outperforms B→K and D→K. On the other hand, DVD (D) and electronics are two very different domains unlike electronics and Kitchen, or DVD and books. The DVD dataset contains reviews about the music albums. This difference in types of reviews makes them to share less number of words. Table 8 shows the percent (%) of common words among the 4 domains. The percent of common unique words are common unique words divided by the summation of unique words in the domains individually.

Conclusion
In this paper, we proposed that the Significant Consistent Polarity (SCP) words represent the transferable information from the labeled source domain to the unlabeled target domain for crossdomain sentiment classification. We showed a strong positive correlation of 0.78 between the SCP words identified by our approach and the sentiment classification accuracy achieved in the unlabeled target domain. Essentially, a set of less erroneous transferable features leads to a more accurate classification system in the unlabeled target domain. We have presented a technique based on χ 2 test and cosine-similarity between context vector of words to identify SCP words. Results show that the SCP words given by our approach represent more accurate transferable information in comparison to the Structured Correspondence Learning (SCL) algorithm and common-unigrams. Furthermore, we show that an ensemble of the classifiers trained on the SCP features and target specific features overcomes the errors of the individual classifiers.