An Iterative Similarity based Adaptation Technique for Cross-domain Text Classification

Supervised machine learning classiﬁcation algorithms assume both train and test data are sampled from the same domain or distribution. However, performance of the algorithms degrade for test data from different domain. Such cross domain classiﬁcation is arduous as features in the test domain may be different and absence of labeled data could further exacerbate the problem. This paper proposes an algorithm to adapt classiﬁcation model by iteratively learning domain speciﬁc features from the unlabeled test data. More-over, this adaptation transpires in a similarity aware manner by integrating similarity between domains in the adaptation setting. Cross-domain classiﬁcation experiments on different datasets, including a real world dataset, demonstrate efﬁcacy of the proposed algorithm over state-of-the-art.


Introduction
A fundamental assumption in supervised statistical learning is that training and test data are independently and identically distributed (i.i.d.) samples drawn from a distribution. Otherwise, good performance on test data cannot be guaranteed even if the training error is low. In real life applications such as business process automation, this assumption is often violated. While researchers develop new techniques and models for machine learning based automation of one or a handful business processes, large scale adoption is hindered owing to poor generalized performance. In our interactions with analytics software development teams, we noticed such pervasive diversity of learning tasks and associated inefficiency. Novel predictive analytics techniques on standard datasets (or limited client data) did not generalize across different domains ( new products & services) and has limited applicability. Training models from scratch for every new domain requires human annotated labeled data which is expensive and time consuming, hence, not pragmatic.
On the other hand, transfer learning techniques allow domains, tasks, and distributions used in training and testing to be different, but related. It works in contrast to traditional supervised techniques on the principle of transferring learned knowledge across domains. While transfer learning has generally proved useful in reducing the labelled data requirement, brute force techniques suffer from the problem of negative transfer (Pan and Yang, 2010a). One cannot use transfer learning as the proverbial hammer, but needs to gauge when to transfer and also how much to transfer.
To address these issues, this paper proposes a domain adaptation technique for cross-domain text classification. In our setting for cross-domain classification, a classifier trained on one domain with sufficient labelled training data is applied to a different test domain with no labelled data. As shown in Figure 1, this paper proposes an iterative similarity based adaptation algorithm which starts with a shared feature representation of source and target domains. To adapt, it iteratively learns domain specific features from the unlabeled target domain data. In this process, similarity between two domains is incorporated in the adaptation setting for similarity-aware transfer. The major contributions of this research are: • An iterative algorithm for learning domain specific discriminative features from unlabeled data in the target domain starting with an initial shared feature representation.
• Facilitating similarity-aware domain adaptation by seamlessly integrating similarity between two domains in the adaptation settings. Figure 1: Outlines different stages of the proposed algorithm i.e. shared feature representation, domain similarity, and the iterative learning process.
To the best of our knowledge, this is the first-ofits-kind approach in cross-domain text classification which integrates similarity between domains in the adaptation setting to learn domain specific features in an iterative manner. The rest of the paper is organized as follows: Section 2 summarizes the related work, Section 3 presents details about the proposed algorithm. Section 4 presents databases, experimental protocol, and results. Finally, Section 5 concludes the paper.

Related Work
Transfer learning in text analysis (domain adaptation) has shown promising results in recent years (Pan and Yang, 2010a). Prior work on domain adaptation for text classification can be broadly classified into instance re-weighing and featurerepresentation based adaptation approaches.
Instance re-weighing approaches address the difference between the joint distributions of observed instances and class labels in source domain with that of target domain. Towards this direction, Liao et al. (2005) learned mismatch between two domains and used active learning to select instances from the source domain to enhance adaptability of the classifier. Jiang and Zhai (2007) proposed instance weighing scheme for domain adaptation in NLP tasks which exploit independence between feature mapping and instance weighing approaches. Saha et al. (2011) leveraged knowledge from source domain to actively select the most informative samples from the target domain. Xia et al. (2013) proposed a hybrid method for sentiment classification task that also addresses the challenge of mutually opposite orientation words.
A number of domain adaptation techniques are based on learning common feature representation (Pan and Yang, 2010b;Blitzer et al., 2006;Ji et al., 2011;Daumé III, 2009) for text classification. The basic idea being identifying a suitable feature space where projected source and target domain data follow similar distributions and hence, a standard supervised learning algorithm can be trained on the former to predict instances from the latter. Among them, Structural Correspondence Learning (SCL) (Blitzer et al., 2007) is the most representative one, explained later. Daumé (2009) proposed a heuristic based non-linear mapping of source and target data to a high dimensional space. Pan et al. (2008) proposed a dimensionality reduction method Maximum Mean Discrepancy Embedding to identify a latent space. Subsequently, Pan et al. (2010) proposed to map domain specific words into unified clusters using spectral clustering algorithm. In another follow up work, Pan et al. (2011) proposed a novel feature representation to perform domain adaptation via Reproducing Kernel Hilbert Space using Maximum Mean Discrepancy. A similar approach, based on co-clustering (Dhillon et al., 2003), was proposed in Dai et al. (2007) to leverage common words as bridge between two domains. Bollegala et al. (2011) used sentiment sensitive thesaurus to expand features for cross-domain sentiment classification. In a comprehensive evaluation study, it was observed that their approach tends to increase the adaptation performance when multiple source domains were used (Bollegala et al., 2013).
Domain adaptation based on iterative learning has been explored by  and Garcia-Fernandez et al. (2014) and are similar to the philosophy of the proposed approach in appending pseudo-labeled test data to the training set. The first approach uses an expensive feature split to co-train two classifiers while the former presents a single classifier self-training based setting. However, the proposed algorithm offers novel contributions in terms of 1) leveraging two independent feature representations capturing the shared and target specific representations, 2) an ensemble of classifiers that uses labelled source domain and pseudo labelled target domain instances carefully moderated based on similarity between two domains. Ensemble based domain adaptation for text classification was first proposed by Aue and Gammon (2005) though their approach could not achieve significant improvements over baseline. Later, Zhao et al. (2010) proposed online transfer learning (OTL) frame-work which forms the basis of our ensemble based domain adaptation. However, the proposed algorithm differs in the following ways: 1) an unsupervised approach that transforms unlabeled data into pseudo labeled data unlike OTL which is supervised, and 2) incorporates similarity in the adaptation setting for gradual transfer.

Iterative Similarity based Adaptation
The philosophy of our algorithm is gradual transfer of knowledge from the source to the target domain while being cognizant of similarity between two domains. To accomplish this, we have developed a technique based on ensemble of two classifiers. Transfer occurs within the ensemble where a classifier learned on shared representation transforms unlabeled test data into pseudo labeled data to learn domain specific classifier. Before explaining the algorithm, we highlight its salient features: Common Feature Space Representation: Our objective is to find a good feature representation which minimizes divergence between the source and target domains as well as the classification error. There have been several works towards feature-representation-transfer approach such as (Blitzer et al., 2007;Ji et al., 2011) which derives a transformation matrix Q that gives a shared representation between the source and target domains. One of the widely used approaches is Structural Correspondence Learning (SCL) (Blitzer et al., 2006) which aims to learn the co-occurrence between features expressing similar meaning in different domains. Top k Eigenvectors of matrix, W , represent the principal predictors for weight space, Q. Features from both domains are projected on this principal predictor space, Q, to obtain a shared representation. Source domain classifier in our approach is based on this SCL representation. In Section 4, we empirically show how our algorithm generalizes to different shared representations.

Iterative Building of Target Domain Labeled
Data: If we have enough labeled data from the target domain then a classifier can be trained without the need for adaptation. Hence, we wanted to explore if and how (pseudo) labeled data for the target domain can be created. Our hypothesis is that certain target domain instances are more similar to source domain instances than the rest. Hence a classifier trained on (a suitably chosen transformed representation of) source domain instances will be able to categorize similar target do-main instances confidently. Such confidently predicted instances can be considered as pseudo labeled data which are then used to initialize a classifier in target domain.
Only handful of instances in the target domain can be confidently predicted using the shared representation, therefore, we further iterate to create pseudo labeled instances in target domain. In the next round of iterations, remaining unlabeled target domain instances are passed through both the classifiers and their output are suitably combined. Again, confidently labeled instances are added to the pool of pseudo labeled data and the classifier in the target domain is updated. This process is repeated till all unlabeled data is labeled or certain maximum number of iterations is performed. This way we gradually adapt the target domain classifier on pseudo labeled data using the knowledge transferred from source domain. In Section 4, we empirically demonstrate effectiveness of this technique compared to one-shot adaptation approaches.

Domain Similarity-based Aggregation:
Performance of domain adaptation is often constrained by the dissimilarity between the source and target domains (Luo et al., 2012;Rosenstein et al., 2005;Chin, 2013;Blitzer et al., 2007). If the two domains are largely similar, the knowledge learned in the source domain can be aggressively transferred to the target domain. On the other hand, if the two domains are less similar, knowledge learned in the source domain should be transferred in a conservative manner so as to mitigate the effects of negative transfer. Therefore, it is imperative for domain adaptation techniques to account for similarity between domains and transfer knowledge in a similarity aware manner. While this may sound obvious, we do not see many works in domain adaptation literature that leverage inter-domain similarity for transfer of knowledge. In this work, we use the cosine similarity measure to compute similarity between two domains and based on that gradually transfer knowledge from the source to the target domain. While it would be interesting to compare how different similarity measures compare towards preventing negative transfer but that is not the focus of this work. In Section 4, we empirically show marginal gains of transferring knowledge in a similarity aware manner.
predicted labels by Ensemble E α confidence of prediction E Weighted ensemble of Cs and Ct θ1, θ2 confidence threshold for Cs and ensemble E w s , w t Weights for Cs and Ct respectively 3.1 Algorithm Table 1 lists the notations used in this research. Inputs to the algorithm are labeled source domain instances {x s i , y s i } i=1:ns and a pool of unlabeled target domain instances {x t i } i=1:nt , denoted by P u . As shown in Figure 2, the steps of the algorithm are as follows: 1. Learn Q, a shared representation projection matrix from the source and target domains, using any of the existing techniques. SCL is used in this research.
2. Learn C s on SCL-based representation of labeled source domain instances {Qx s i , y s i }.
3. Use C s to predict labels,ŷ i , for instances in P u using the SCL-based representation Qx t i . Instances which are predicted with confidence greater than a pre-defined threshold, θ 1 , are moved from P u to P s with pseudo label,ŷ.
4. Learn C t from instances in P s ∈ {x t i ,ŷ t i } to incorporate target specific features. P s only contains instances added in step-3 and will be growing iteratively (hence the training set here is small). 5. C s and C t are combined in an ensemble, E, as a weighted combination with weights as w s and w t which are both initialized to 0.5.
6. Ensemble E is applied to all remaining instances in P u to obtain the labelŷ i as: (a) If the ensemble classifies an instance with confidence greater than the threshold θ 2 , then it is moved from P u to P s along with pseudo labelŷ i .
where, l is the iteration, sim is the similarity score between domains computed using cosine similarity metric as shown in Eq. 4 where a & b are normalized vector representations for the two domains. I(·) is the loss function to measure the errors of individual classifiers in each iteration: where, η is learning rate set to 0.1, l(y,ŷ) = (y −ŷ) 2 is the square loss function, y is the label predicted by the classifier andŷ is the label predicted by the ensemble.
8. Re-train classifier C t on P s . 9. Repeat step 6 − 8 until P u is empty or maximum number of iterations is reached.
In this iterative manner, the proposed algorithm transforms unlabeled data in the test domain into pseudo labeled data and progressively learns classifier C t . Confidence of prediction, α i for i th instance, is measured as the distance from the decision boundary (Hsu et al., 2003) which is computed as shown in Eq. 6.
where R is the un-normalized output from the support vector machine (SVM) classifier, v is the weight vector for support vectors and |v| = v T v.
Weights of individual classifiers in the ensemble are updated with each iteration that gradually shifts emphasis from the classifier learned on shared representation to the classifier learned on target domain. Algorithm 1 illustrates the proposed iterative learning algorithm.

Algorithm 1 Iterative Learning Algorithm
Input: C s trained on shared co-occurrence based representation Qx, C t initiated on TFIDF representation from P s , P u remaining unlabeled target domain instances. Iterate: l = 0 : till P u = {φ} or l ≤ iterM ax Process: Construct ensemble E as weighted combination of C s and C t with initials weights w s l and w t l as 0.5 and sim = similarity between domains. for i = 1 to n (size of P u ) do Predict labels: E(Qx i , x i ) →ŷ i ; calculate α i if α i > θ 2 then Remove i th instance from P u and add to P s with pseudo labelŷ i . end if. end for. Retrain C t on P s and update w s l and w t l . end iterate. Output: Updated C t , w s l and w t l .

Experimental Results
The efficacy of the proposed algorithm is evaluated on different datasets for cross-domain text classification (Blitzer et al., 2007), (Dai et al., 2007). In our experiments, performance is evaluated on two-class classification task and reported in terms of classification accuracy.

Datasets & Experimental Protocol
The first dataset is the Amazon review dataset (Blitzer et al., 2007) which has four different domains, Books, DVDs, Kitchen appliances and Electronics. Each domain comprises 1000 positive and 1000 negative reviews. In all experiments, 1600 labeled reviews from the source and 1600 unlabeled reviews from the target domains are used in training and performance is reported on the non-overlapping 400 reviews from the target domain.
The second dataset is the 20 Newsgroups dataset (Lang, 1995) which is a text collection of approximately 20, 000 documents evenly partitioned across 20 newsgroups. For cross-domain text classification on the 20 Newsgroups dataset, we followed the protocol of Dai et al. (2007) where it is divided into six different datasets and the top two categories in each are picked as the two classes. The data is further segregated based on sub-categories, where each sub-category is considered as a different domain. Table 2 lists how different sub-categories are combined to represent the source and target domains. In our experiments, 4/5 th of the source and target data is used to learn shared feature representation and results are reported on the remaining 1/5 th of the target data. The third dataset is a real world dataset comprising tweets about the products and services in different domains. The dataset comprises tweets/posts from three collections, Coll1 about gaming, Coll2 about Microsoft products and Coll3 about mobile support. Each collection has 218 positive and negative tweets. These tweets are collected based on user-defined keywords cap-tured in a listening engine which then crawls the social media and fetches comments matching the keywords. This dataset being noisy and comprising short-text is more challenging than the previous two datasets.
All datasets are pre-processed by converting to lowercase followed by stemming. Feature selection based on document frequency (DF = 5) reduces the number of features as well as speed up the classification task. For Amazon review dataset, TF is used for feature weighing whereas TFIDF is used for feature weighing in other two datasets. In all our experiments, constituent classifiers used in the ensemble are support vector machines (SVMs) with radial basis function kernel. Performance of the proposed algorithm for crossdomain classification task is compared with different techniques 1 including 1) in-domain classifier trained and tested on the same domain data, 2) baseline classifier which is trained on the source and directly tested on the target domain, 3) SCL 2 , a widely used domain adaptation technique for cross-domain text classification, 4) 'Proposed w/o sim', removing similarity from Eqs. 2 & 3.

Results and Analysis
For cross-domain classification, the performance degrades mainly due to 1) feature divergence and 2) negative transfer owing to largely dissimilar domains. Table 3 shows the accuracy of individual classifiers and the ensemble for cross-domain classification on the Amazon review dataset. The ensemble has better accuracy than the individual classifiers, therefore, in our experiments the final reported performance is the accuracy of the ensemble. The combination weights in the ensemble represent the contributions of individual classifiers toward classification accuracy. In our experiments, the maximum number of iterations (iterM ax) is set to 30. It is observed that at the end of the iterative learning process, the target specific classifier is assigned more weight mass as compared to the classifier trained on the shared representation. On average, the weights for the two classifiers converge to w s = 0.22 and w t = 0.78 at the end of the iterative learning process. 1 We also compared our performance with sentiment sensitive thesaurus (SST) proposed by (Bollegala et al., 2013) and our algorithm outperformed on our protocol. However, we did not include comparative results because of difference in experimental protocol as SST is tailored for using multiple source domains and our protocol uses single source domain.
2 Our implementation of SCL is used in this paper.  This further validates our assertion that the target specific features are more discriminative than the shared features in classifying target domain instances, which are efficiently captured by the proposed algorithm. Key observations and analysis from the experiments on different datasets is summarized below.

Results on the Amazon Review dataset
To study the effects of different components of the proposed algorithm, comprehensive experiments are performed on the Amazon review dataset 3 .

1) Effect of learning target specific features:
Results in Figure 3 show that iteratively learning target specific feature representation (slow transfer as opposed to one-shot transfer) yields better performance across different cross-domain classification tasks as compared to SCL, SFA (Pan et al., 2010) 4 and the baseline. Unlike SCL and SFA, the proposed approach uses shared and target specific feature representations for the cross-domain classification task. Table 4 illustrates some examples of the target specific discriminative features learned by the proposed algorithm that leads to enhanced performance. At 95% confidence, parametric ttest suggests that the proposed algorithm and SCL are significantly (statistically) different.

2) Effect of similarity on performance:
It is observed that existing domain adaptation techniques enhance the accuracy for cross-domain classification, though, negative transfer exists in camou- Figure 3: Comparing the performance of the proposed approach with existing techniques for crossdomain classification on Amazon review dataset. flage. Results in Figure 3(b) (for the case K → B) describes an evident scenario for negative transfer where the adaptation performance with SCL descends lower than the baseline. However, the proposed algorithm still sustains the performance by transferring knowledge proportionate to similarity between the two domains. To further analyze the effect of similarity, we segregated the 12 cross-domain classification cases into two categories based on similarity between two the participating domains i.e. 1) > 0.5 and 2) < 0.5. Table 5 shows that for 6 out of 12 cases that fall in the first category, the average accuracy gain is 10.8% as compared to the baseline. While for the remaining 6 cases that fall in the second category, the average accuracy gain is 15.4% as compared to the baseline. This strongly elucidates that the proposed similarity-based iterative algorithm not only adapts well when the domain similarity is high but also yields gain in the accuracy when the domains are largely dissimilar. Figure 4 also shows how weight for the target domain classifier w t varies with the number of iterations. It further strengthens our assertion that if domains are similar, algorithm can readily adapt and converges in a few iterations. On the other hand for dissimilar domains, slow iterative transfer, as opposed to one-shot transfer, can achieve similar performance; however, it may take more iterations to converge.While the effect of similarity on domain adaptation performance is evident, this work opens possibilities for further investigations.
3) Effect of varying threshold θ 1 & θ 2 : Figure  5(a) explains the effect of varying θ 1 on the final classification accuracy. If θ 1 is low, C t may get trained on incorrectly predicted pseudo labeled instances; whereas, if θ 1 is high, C t may be deficient of instances to learn a good decision boundary. On the other hand, θ 2 influences the number of iterations required by the algorithm to reach the  stopping criteria. If this threshold is low, the algorithm converges aggressively (in a few iterations) and does not benefit from the iterative nature of learning the target specific features. Whereas a high threshold tends to make the algorithm conservative. It hampers the accuracy because of the unavailability of sufficient instances to update the classifier after each iteration which also leads to large number of iterations to converge (may not even converge). θ 1 and θ 2 are set empirically on a held-out set, with values ranging from zero to distance of farthest classified instance from the SVM hyperplane (Hsu et al., 2003). The knee-shaped curve on the graphs in Figure 5 shows that there exists Figure 5: Bar plot shows % of data that crosses confidence threshold, lower and upper part of the bar represents % correctly and wrongly predicted pseudo labels. The black line shows how the final classification accuracy is effected with threshold. an optimal value for θ 1 and θ 2 which yields the best accuracy. We observed that the best accuracy is obtained when the thresholds are set to the distance between the hyper plane and the farthest support vector in each class.

4) Effect of using different shared representations in ensemble:
To study the generalization ability of the proposed algorithm to different shared representations, experiments are performed using three different shared representations on the Amazon review dataset. Apart from using the SCL representation, the accuracy is compared with the proposed algorithm using two other representations, 1) common features between the two domains ("common") and 2) multiview principal component analysis based representation ("MVPCA") (Ji et al., 2011) as they are previously used for cross-domain sentiment classification on the same dataset. Table 6 shows that the proposed algorithm yields significant gains in cross-domain classification accuracy with all three representations and is not restricted to any specific representation. The final accuracy depends on the initial classifier trained on the shared representation; therefore, if a shared representation sufficiently captures the characteristics of both source and target domains, the proposed algorithm can be built on any such representation for enhanced cross-domain classification accuracy. Figure 6 compares the accuracy of proposed algorithm with existing approaches on the 20 Newsgroups dataset. Since different domain are crafted out from the sub-categories of the same dataset, domains are exceedingly similar and therefore, the baseline accuracy is relatively better  than that on the other two datasets. The proposed algorithm still yields an improvement of at least 10.8% over the baseline accuracy. As compared to other existing domain adaptation approaches like SCL (Blitzer et al., 2007) and CoCC (Dai et al., 2007), the proposed algorithm outperforms by at least 4% and 1.9% respectively. This also validates our assertion that generally domain adaptation techniques accomplishes well when the participating domains are largely similar; however, the similarity aggregation and the iterative learning offer the proposed algorithm an edge over oneshot adaptation algorithms.

Results on real world data
Results in Figure 7 exhibit challenges associated with real world dataset. The baseline accuracy for cross-domain classification task is severely affected for this dataset. SCL based domain adaptation does not yields generous improvements as selecting the pivot features and computing the cooccurrence statistics with noisy short text is arduous and inept. On the other hand, the proposed algorithm iteratively learns discriminative target specific features from such perplexing data and translates it to an improvement of at least 6.4% and 3.5% over the baseline and the SCL respec- Figure 7: Results comparing the accuracy of the proposed approach with existing techniques for cross domain categorization on the real world dataset. tively.

Conclusion
The paper presents an iterative similarity-aware domain adaptation algorithm that progressively learns domain specific features from the unlabeled test domain data starting with a shared feature representation. In each iteration, the proposed algorithm assigns pseudo labels to the unlabeled data which are then used to update the constituent classifiers and their weights in the ensemble. Updating the target specific classifier in each iteration helps better learn the domain specific features and thus, results in enhanced cross-domain classification accuracy. Similarity between the two domains is aggregated while updating weights of the constituent classifiers which facilitates gradual shift of knowledge from the source to the target domain. Finally, experimental results for cross-domain classification on different datasets show the efficacy of the proposed algorithm as compared to other existing approaches.