Cross-domain Text Classification with Multiple Domains and Disparate Label Sets

Advances in transfer learning have let go the limitations of traditional supervised machine learning algorithms for being dependent on annotated training data for training new models for every new domain. However, several applications encounter scenarios where models need to transfer/adapt across domains when the label sets vary both in terms of count of labels as well as their connotations. This paper presents ﬁrst-of-its-kind transfer learning algorithm for cross-domain classiﬁcation with multiple source domains and disparate label sets. It starts with identifying transferable knowledge from across multiple domains that can be useful for learning the target domain task. This knowledge in the form of selective labeled instances from different domains is congregated to form an auxiliary training set which is used for learning the target domain task. Experimental results validate the efﬁcacy of the proposed algorithm against strong baselines on a real world social media and the 20 Newsgroups datasets.


Introduction
A fundamental assumption in supervised statistical learning is that training and test data are independently and identically distributed (i.i.d.) samples drawn from a distribution. Otherwise, good performance on test data cannot be guaranteed even if the training error is low. On the other hand, transfer learning techniques allow domains, tasks, and distributions used in training and testing to be different, but related. It works in contrast to traditional supervised techniques on the principle of transferring learned knowledge across domains. Pan and Yang, in their survey paper (2010), de- scribed different transfer learning settings depending on if domains and tasks vary as well as labeled data is available in one/more/none of the domains. In this paper, we propose a generic solution for multi-source transfer learning where domains and tasks are different and no labeled data is available in the target domain. This is a relatively less chartered territory and arguably a more generic setting of transfer learning.
Motivating example: Consider a social media consulting company helping brands to monitor their social media channels. Two problems typically of interest are: (i) sentiment classification (is a post positive/negative/neutral?) and (ii) subject classification (what was the subject of a post?). While sentiment classification attempts to classify a post based on its polarity, subject classification is towards identifying the subject (or topic) of the post, as illustrated in Figure 1. The company has been using standard classification techniques from an off-the-shelf machine learning toolbox. While machine learning toolkit helps them to create and apply statistical models efficiently, the same model can not be applied on a new collection due to variations in data distributions across collections 1 . It requires a few hundreds of manually labeled posts for every task on every collec-tion. As social media are extremely high velocity and low retention channels, human labeling efforts act like that proverbial narrow bottleneck. Need of the hour was to reduce, if not eliminate, the human-intensive labeling stage while continue to use machine learning models for new collections.
Several transfer learning techniques exist in the literature which can reduce labeling efforts required for performing tasks in new collections. Tasks such as sentiment classification, named entity recognition (NER), part of speech (POS) tagging that have invariant label sets across domains, have shown to be greatly benefited from these works. On the other hand, tasks like subject classification that have disparate label sets across domains have not been able to gain at pace with the advances in transfer learning. Towards that we formulate the problem of Cross-domain classification with disparate label sets as learning an accurate model for the new unlabeled target domain given labeled data from multiple source domains where all domains have (possibly) different label sets.
Our contributions: To the best of our knowledge, this is the first work to explore the problem of cross-domain text classification with multiple source domains and disparate label sets. The other contributions of this work includes a simple yet efficient algorithm which starts with identifying transferable knowledge from across multiple source domains useful for learning the target domain task. Specifically, it identifies relevant class-labels from the source domains such that the instances in those classes can induce classseparability in the target domain. This transferable knowledge is accumulated as an auxiliary training set for an algorithm to learn the target domain classification task followed by suitable transformation of the auxiliary training instances.
Organization of the paper is as follows: Section 2 presents the preliminaries and notation, Section 3 summarizes the related work. Section 4 and 5 present the proposed algorithm and experimental results respectively. Section 6 concludes the paper.

Preliminaries and Notations
A domain D = {X , P (X)} is characterized by two components: a feature space X and a marginal probability distribution P (X), where X = {x 1 , x 2 , ...x n } ∈ X . A task T = {Y, f (·)} also consists of two components: a label space Y and an objective predictive function f (·).
In our settings for cross-domain classification with disparate label sets, we assume M source domains, denoted as D S i , where i = {1, 2, ..M }. Each source domain has different marginal distribution i.e. P (X S i ) = P (X S j ) and different label space i.e. Y S i = Y S j , ∀i, j ∈ M . The label space across domains vary both in terms of count of class-labels as well as their connotations; however, a finite set of labeled instances are available from each source domain. The target domain (D T ) consists of a finite set of unlabeled instances, denoted as t i where i = {1, .., N }. Let Y T be the target domain label space with K class-labels. We assume that the number of classes in the target domain i.e. K is known (analogous to clustering where the number of clusters is given). Table 1 summarizes different settings of transfer learning (Pan and  and how this work differentiates from the existing literature 2 . The first scenario represents the ideal settings of traditional machine learning (Mitchell, 1997) where a model is trained on a fraction of labeled data and performs well for the same task on the future unseen instances from the same domain.

Related Work
The second scenario where the domains vary while the tasks remain the same is referred to as transductive transfer learning. This is the most extensively studied settings in the transfer learning literature and can be broadly categorized as single and multi-source adaptation. Single source adaptation (Chen et al., 2009;Ando and Zhang, 2005;Daumé III, 2009) primarily aims at minimizing the divergence between the source and target domains either at instance or feature levels. The general idea being identifying a suitable low dimensional space where transformed source and target domains data follow similar distributions and hence, a standard supervised learning algorithm can be trained (Daumé III, 2009;Jiang and Zhai, 2007;Pan et al., 2011;Blitzer et al., 2007;Dai et al., 2007;Bhatt et al., 2015).
While several existing single source adaptation techniques can be extended to multi-source adaptation, the literature in multi-source adaptation can be broadly categorized as: 1) feature representation approaches (Chattopadhyay et al., 2012;Sun et al., 2011;Duan et al., 2009;Duan et al., 2012;

Transfer
Learning unlabeled data from the target domain P(X S ) = P(X T )

Multi-source adaptation
Classifier combination; efficient combination of information from multiple sources; Feature representation Intelligent selection of transferable knowledge from multiple sources for adaptation.

Inductive
Unlabeled data in source domain(s) and labeled data in target domain

Self-taught learning
Extracts higher level representations from unlabeled auxiliary data to learn instance-to-label mapping with labeled target instances Learns instance-to-label mapping in the unlabeled target domain using multiple labeled source domains having different data distributions and label spaces.
Labeled data is available in all domains

Multi-task learning
Simultaneously learns multiple tasks within (or across) domain(s) by exploiting the common feature subspace shared across the tasks Learns the optimal class distribution in an unlabeled target domain by minimizing the differences with multiple labeled source domains.
Labeled data in source and target domains Transfer learning with disparate label set Disparate fine grained label sets across domains, however, same coarse grained labels set can be invoked across domains No coarse-to-fine label mapping due to heterogeneity of label sets, Assumes no labelled data in target domain. Bollegala et al., 2013;Crammer et al., 2008;Mansour et al., 2009;Ben-David et al., 2010;Bhatt et al., 2016) and 2) combining pre-trained classifiers (Schweikert and Widmer, 2008;Sun and Shi, 2013;Xu and Sun, 2012;. Our work differentiates in intelligently exploiting selective transferable knowledge from multiple sources unlike existing approaches where multiple sources contribute in a brute-force manner. The third scenario where the tasks differ irrespective of the relationship among domains is referred to as inductive transfer learning. Self-taught learning (Raina et al., 2007) and multi-task (Jiang, 2009;Maurer et al., 2012;Xu et al., 2015;Kumar and Daume III, 2012) learning are the two main learning paradigms in this scenario and Table 1 differentiates our work from these.
This work closely relates to the fourth scenario where we allow domains to vary in the marginal probability distributions and the tasks to vary due to different label spaces 3 . The closest prior work by Kim et al. (2015) address a sequential labeling problem in NLU where the fine grained label sets across domains differ. However, they assume that there exists a bijective mapping between the coarse and fine-grained label sets across domains. They learn this mapping using labeled instances from the target domain to reduce the problem to a standard domain adaptation problem (Scenario 2). elaborated in the next sections.

Exploiting Multiple Domains
If we had the mappings between the source and target domain label sets, we could have leveraged existing transfer learning approaches. However, heterogeneity of label sets across domains and the unlabeled data from the target domain exacerbate the problem. Our objective is to leverage the knowledge from multiple source domains to induce class-separability in the target domain. Inducing class-separability refers to segregating the target domain into K classes using labeled instances from selective K source domain classes.
Towards this, the proposed algorithm divides each source domain into clusters/groups based on the class-labels such that instances with the same label are grouped in one cluster. All source domains are divided into Q clusters where Q = M m=1 ||Y m || represents the count of class-labels across all sources. ||Y m || being the count of classlabels in the m th source domain. C q denotes the q th cluster and µ q denotes its centroid computed as the average of all the members in the cluster. We assert that the target domain instances that have high similarity to a particular source domain cluster can be grouped together. Given N target domain instances and Q source domain clusters, a matrix R (dimension N × Q) is computed based on the similarity of the target domain instances with the source clusters. The i th row of the matrix captures the similarity of the i th target domain instance (t i ) with all the source domain clusters. It captures how different source domain class-labels are associated with the target domain instances and hence, can induce class-separability in the target domain.

Extracting Transferable Knowledge
The similarity matrix R associates target domain instances to the source domain clusters in proportion to their similarity. However, the objective is to select the optimal K source domain clusters that fit the maximum number of target domain instances. This problem is similar to the well-known combinatorial optimization problem of Maximum Coverage (Vazirani, 2003) where given a collection of P sets, we need to select A sets (A < P ) such that the size of the union of the selected sets is maximized. In this paper, we are given Q source domain clusters and need to select K clusters such that the corresponding number of associated tar-get domain instances is maximized. As the Maximum Coverage problem is NP-hard, we implement a greedy algorithm for selecting the k source domain clusters, as illustrated in Algorithm 1.

Algorithm 1 Selecting K Source Clusters
Input: A matrix R, K = number target domain classes, l= number of selected cluster. Initialize: l = 0, Normalize R such that each row sums up to 1. repeat: 1: Pick the column in R which has maximum sum of similarity scores for uncovered target domain instances. 2: Mark elements in the chosen column as covered. 3: l = l + 1 until: l = K Output: K source domain clusters.
A source domain contributes partially in terms of zero or more class-labels (clusters) identified using the Algorithm 1. Therefore, we refer to the labeled instances from the selected clusters of a source domain as the partial transferable knowledge from that domain. This partial transferable knowledge from across multiple source domains is congregated to form an auxiliary training set, referred to as (AU X).

Adapting to the Target Domain
The auxiliary training set comprises labeled instances from selected K source domain clusters 4 . Since, the auxiliary set is pulled out from multiple source domains, it follows different data distribution as compared to the target domain. For a classifier, trained on the K-class auxiliary training set, the distributional variations have to be normalized so that it can generalize well on the target domain.
In this research, we proposed to use an instance weighting technique (Jiang and Zhai, 2007) to minimize the distributional variations by deferentially weighting instances in the auxiliary set. Intuitively, the auxiliary training instances similar to the target domain are assigned higher weights while training the classifier and vice versa. The weight for the i th instance in the auxiliary set should be proportional to the ratio (Pt(x i )) (Pa(x i )) . However, since the actual probability distributions C: Adapting to target domain: 1: Minimize distributional variations using instance weighing technique. 2: Train a K-class classifier using AU X.
Output: K-class target domain classifier.
(P a (x) and P t (x) for the auxiliary set and target domain respectively) are unknown, the instance difference is approximated as (Pt(x i |d=target)) (Pa(x i |d=auxilliary)) , where d is a random variable used to represent whether x i came from the auxiliary set or the target domain. To calculate this ratio, a binary classifier is trained using the auxiliary set and target domain data with labels {-1} and {+1} respectively. The predicted probabilities from the classifier are used to estimate the ratio as the weight for the i th auxiliary instance x i . Finally, a K-class classifier is trained on the weighted auxiliary training set to perform classification on the target domain data.

Algorithm
As shown in Figure 2, the step-by-step flow of the proposed algorithm is summarized below: 1. Divide M source domains into Q clusters, each represented as C q , q = {1, 2, .., Q}.
2. Compute centroid of each cluster as the average of the cluster members, as shown in Eq. 1.
where µ q is the centroid, ||C q || is the membership count and x i is the i th member of C q .
3. For target instances t i ∀i ∈ N , compute cosine similarity with all the source domain cluster centroids to form the matrix R (dimensions: N × Q), as shown in Eq. 2 4. Run Algorithm 1 on R to select K optimal source clusters (i.e. columns of R).
5. Congregate labeled instances from the selected source domain clusters to form the Kclass auxiliary training set.
6. Minimize the divergence between the auxiliary set and target domain using the instance weighing technique, described in Section 4.3.
7. Finally, train a K-class classifier on deferentially weighted auxiliary training instances to perform classification in the target domain.
The K-class classifier trained on the auxiliary training set is an SVM classifier (Chih-Wei Hsu and Lin, 2003) with L2 − loss from the LIB-LINEAR library (Fan et al., 2008). The classifier used in the instance weighing technique is again an SVM classifier with RBF kernel. The proposed algorithm uses distributional embedding i.e. Doc2Vec (Le and Mikolov, 2014) to represent instances from the multiple source and target domains. We used an open-source implementation of Doc2Vec (Le and Mikolov, 2014) for learning 400 dimensional vector representation using DBoW.

Experimental Evaluation
Comprehensive experiments are performed to evaluate the efficacy of the proposed algorithm for cross-domain classification with disparate label sets across domains on two datasets.

Datasets
The first dataset is a real-world Online Social Media (OSM) dataset which consists of 74 collections. Each collection comprises comments/tweets that are collected based on user-defined keywords. These keywords are fed to a listening engine which crawls the social media (i.e. Twitter.com) and fetches comments matching the keywords. The task is to classify the comments in a collection  into user-defined categories. These user-defined categories may vary across collections in terms of count as well as their connotations. Table 2 shows an example of the user-defined categories for a few collections related to "Apple" products. In the experiments, one collection is used as unlabeled target collection and the remaining collections are used as the labeled source collections. We randomly selected 5 target collections to report the performance, as described in Table 3. The second dataset is the 20 Newsgroups (NG) (Lang, 1995) dataset which comprises 20, 000 news articles organized into 6 groups with different sub-groups both in terms of count as well as connotations, as shown in Figure 3(a). Two different experiments are performed on this dataset. In the first experiment ("Exp-1"), one group is considered as the target domain and the remaining 5 groups as the source domains. In the second experiment ("Exp-2"), one sub-group from each of the first five groups 5 is randomly selected to synthesize a target domain while all the groups (with the remaining sub-groups) are used as source domains. Figure 3(b) shows an example on how to synthesize target domains in "Exp-2". There are 720 possible target domains in this experiment and we report the average performance across all possible target domains, referred to as "Grp 7". The task in both the experiments is to categorize the target domain into its K categories (sub-groups) using labeled data from multiple source domains.

Evaluation Metric
The performance is reported in terms of classification accuracy on the target domain. There is no definite mapping between the actual class-labels in the target domain and the K categories (i.e. induced categories) in the auxiliary training set. Therefore, we sequentially evaluate all possible one-to-one mappings between the K categories in the auxiliary training set and target domain to report results for the best performing mapping.

Experimental Protocol
The performance of the proposed algorithm is skylined by the in-domain performance (Gold), i.e. a classifier trained and tested on the labeled target domain data. We also compared the performance with spherical K-means clustering (Dhillon and Modha, 2001) used to group the target domain data into K categories against the ground truth, referred to CL. Spherical K-means clustering is based on cosine similarity and performs better for high-dimensional sparse data such as text.
To compare with a baseline and an existing adaptation algorithm, we selected the most similar source domain 6 with exactly K number of classlabels and report the performance on the best possible mapping, as described in Section 5.2. To compute the baseline (BL), a classifier trained on the source domain is used to categorize the target domain. A widely used domain adaptation algorithm, namely structural correspondence learning (SCL) (Blitzer et al., 2007) is also applied using the selected source domain.

Results and Analysis
Key observations and analysis from the experimental evaluations are summarized below:

Results on the OSM Dataset
Results in Figure 4 and Table 4 show the efficacy of the proposed algorithm for cross-domain classification with disparate label sets as it outperforms other approaches by at least 15%. Coll ID(#) refers to the target collection and the corresponding count of class-labels. Results in Table  4 also compare the performance of the proposed technique without the distributional normalization of the auxiliary training set, referred to as "W/O". Results suggest that suitably weighing instances from the auxiliary training set mitigates the distributional variations and enhances the cross-domain performance by at least 3.3%.

Results on the 20Newsgroups Dataset
Results in Table 5 show that the proposed algorithm outperforms other techniques for both the experiments by at least 15 % and 18% respectively on the 20 Newsgroups dataset. In Table 5, "-" refers to the cases where a single source domain with the same number of class-labels as in the target domain is not available. In "Exp-1" where the source and target categories vary in terms of counts as well as their connotations, the proposed algorithm efficiently induces the classes in the unlabeled target domain using the partial transferable knowledge from multiple sources. For "Exp-2", it is observed that the performance of the proposed algorithm is better than the performance in "Exp-1" as the target categories have closely related categories (from the same group) in the source do- Figure 5: Effects of selected source collections on the OSM dataset. mains. Table 5 reports the average performance across all the 720 possible combinations of target domains with a standard deviation of 2.6. Table 6 validates our assertion that multiple sources are necessary to induce class-separability in the target domain as a single source is not sufficient to cater to the heterogeneity of class-labels across domains. It also suggests that the proposed algorithm can learn class-separability in the target domain by using arbitrary diverse class-labels from different sources and does not necessarily require class-labels to follow any sort of coarse-tofine mapping across domains.

Effect of Multiple Source Domains
To evaluate the effects of using multiple sources, further experiments were performed by varying the number of available source domains. For the OSM dataset, we varied the number of available source collections from 1 to 73 starting with the most similar source collection and repeatedly adding the next most similar collection in the pool of available collections. We observe that even the most similar collection was not independently sufficient to induce classes in the target collection and it was favorable to exploit multiple collections. Moreover, adding collections based on similarity to the target collection had a better likelihood of achieving higher performance as compared to adding random collections.
In another experiment, we first identified the source collections which contributed to learning the target task. We removed these collections and applied the proposed algorithm on the remaining source collections. Figure 5 shows the perfor- mance of the proposed algorithm on 5 such iterations of removing the contributing source collections from the previous iteration. We observed a significant drop in the performance with each iteration which signifies the effectiveness of the proposed algorithm in extracting highly discriminating transferable knowledge from multiple sources.

Comparing with Domain Adaptation
We applied domain adaptation techniques considering the auxiliary training set to be a single source domain with the same number of classes as that in the target domain. We applied two of the widely used domain adaptation techniques, namely SCL (Blitzer et al., 2007) and SFA  referred to as "AuxSCL" and "AuxSFA" respectively. Results in Table 7 suggest that the proposed algorithm significantly outperforms "AuxSCL" and "AuxSFA" on the two datasets. Generally, existing domain adaptation techniques are built on the co-occurrences of the common features with the domain specific features and hence, capture how domain specific features in one domain behaves w.r.t to the domain specific features in the other domain. They assume homogeneous labels and expect the aligned features across domains to behave similarly for the prediction task. However, these features are misaligned when the label set across domains vary in terms of their connotations.

Effect of Different Representations
The proposed algorithm uses Doc2Vec (Le and Mikolov, 2014) for representing instances from multiple domains. However, the proposed algorithm can build on different representations and hence, we compare its performance with traditional TF-IDF representation (including unigrams  and bigrams) and a dense representation using TF-IDF+PCA ( reduced to a dimension such that it covers 90% of the variance). We observe that Doc2Vec representation clearly outperforms the other two representations as it addresses the drawbacks of bag-of n-gram models in terms of implicitly inheriting the semantics of the words in a document and offering a more generalizable concise vector representation.

Conclusions
This paper presented the first study on crossdomain text classification in presence of multiple domains with disparate label sets and proposed a novel algorithm for the same. It proposed to extract partial transferable knowledge from across multiple source domains which was beneficial for inducing class-separability in the target domain. The transferable knowledge was assimilated in terms of selective labeled instances from different source domain to form a K-class auxiliary training set. Finally, a classifier was trained using this auxiliary training set, following a distribution normalizing instance weighing technique, to perform the classification task in the target domain. The efficacy of the proposed algorithm for cross-domain classification across disparate label sets will expand the horizon for ML-based algorithms to be more widely applicable in more general and practically observed scenarios.