A Hassle-Free Unsupervised Domain Adaptation Method Using Instance Similarity Features

We present a simple yet effective unsupervised domain adaptation method that can be generally applied for different NLP tasks. Our method uses unlabeled target domain instances to induce a set of instance similarity features. These features are then combined with the original features to represent labeled source domain instances. Using three NLP tasks, we show that our method consistently out-performs a few baselines, including SCL, an existing general unsupervised domain adaptation method widely used in NLP. More importantly, our method is very easy to implement and incurs much less computational cost than SCL.


Introduction
Domain adaptation aims to use labeled data from a source domain to help build a system for a target domain, possibly with a small amount of labeled data from the target domain. The problem arises when the target domain has a different data distribution from the source domain, which is often the case. In NLP, domain adaptation has been well studied in recent years. Existing work has proposed both techniques designed for specific NLP tasks (Chan and Ng, 2007;Daume III and Jagarlamudi, 2011;Yang et al., 2012;Plank and Moschitti, 2013;Hu et al., 2014;Nguyen and Grishman, 2014) and general approaches applicable to different tasks (Blitzer et al., 2006;Daumé III, 2007;Jiang and Zhai, 2007;Dredze and Crammer, 2008;Titov, 2011). With the recent trend of applying deep learning in NLP, deep learning-based domain adaptation methods (Glorot et al., 2011;Chen et al., 2012;Yang and Eisenstein, 2014) have also been adopted for NLP tasks (Yang and Eisenstein, 2015).
There are generally two settings of domain adaptation. We use supervised domain adaptation to refer to the setting when a small amount of labeled target data is available, and when no such data is available during training we call it unsupervised domain adaptation.
Although many domain adaptation methods have been proposed, for practitioners who wish to avoid implementing or tuning sophisticated or computationally expensive methods due to either lack of enough machine learning background or limited resources, simple approaches are often more attractive. A notable example is the frustratingly easy domain adaptation method proposed by Daumé III (2007), which simply augments the feature space by duplicating features in a clever way. However, this method is only suitable for supervised domain adaptation. A later semi-supervised version of this easy adaptation method uses unlabeled data from the target domain (Daumé III et al., 2010), but it still requires some labeled data from the target domain. In this paper, we propose a general unsupervised domain adaptation method that is almost equally hasslefree but does not use any labeled target data.
Our method uses a set of unlabeled target instances to induce a new feature space, which is then combined with the original feature space. We explain analytically why the new feature space may help domain adaptation. Using a few different NLP tasks, we then empirically show that our method can indeed learn a better classifier for the target domain than a few baselines. In particular, our method performs consistently better than or competitively with Structural Correspondence Learning (SCL) (Blitzer et al., 2006), a wellknown unsupervised domain adaptation method in NLP. Furthermore, compared with SCL and other advanced methods such as the marginalized structured dropout method (Yang and Eisenstein, 2014) and a recent feature embedding method (Yang and Eisenstein, 2015), our method is much easier to implement.
In summary, our main contribution is a simple, effective and theoretically justifiable unsupervised domain adaptation method for NLP problems.

Adaptation with Similarity Features
We first introduce the necessary notation needed for presenting our method. Without loss of generality, we assume a binary classification problem where each input is represented as a feature vector x from an input vector space X and the output is a label y ∈ {0, 1}. This assumption is general because many NLP tasks such as text categorization, NER and relation extraction can be cast into classification problems and our discussion below can be easily extended to multi-class settings. We further assume that we have a set of labeled instances from a source domain, denoted by We also have a set of unlabeled instances from a target domain, denoted by D t = {x t j } M j=1 . We assume a general setting of learning a linear classifier, which is essentially a weight vector w such that x is labeled as 1 if w x ≥ 0. 1 A naive method is to simply learn a classifier from D s . The goal of unsupervised domain adaptation is to make use of both D s and D t to learn a good w for the target domain. It has to be assumed that the source and the target domains are similar enough such that adaptation is possible.

The Method
Our method works as follows. We first randomly select a subset of target instances from D t and normalize them. We refer to the resulting vectors as exemplar vectors, denoted by E = {e (k) } K k=1 . Next, we transform each source instance x into a new feature vector by computing its similarity with each e (k) , as defined below: (1) where indicates transpose and s(x, x ) is a similarity function between x and x . In our work we use dot product as s. 2 Once each labeled source domain instance is transformed into a Kdimensional vector by Equation 1, we can append this vector to the original feature vector of the source instance and use the combined feature vectors of all labeled source instances to train a classifier. To apply this classifier to the target domain, each target instance also needs to add this K-dimensional induced feature vector.
It is worth noting that the exemplar vectors are randomly chosen from the available target instances and no special trick is needed. Overall, the method is fairly easy to implement, and yet as we will see in Section 3, it performs surprisingly well. We also want to point out that our instance similarity features bear strong similarity to what was proposed by Sun and Lam (2013), but their work addresses a completely different problem and we developed our method independently of their work.

Justification
In this section, we provide some intuitive justification for our method without any theoretical proof.

Learning in the Target Subspace
Blitzer et al. (2011) pointed out that the hope of unsupervised domain adaptation is to "couple" the learning of weights for target-specific features with that of common features. We show our induced feature representation is exactly doing this.
First, we review the claim by Blitzer et al. (2011). We note that although the input vector space X is typically high-dimensional for NLP tasks, the actual space where input vectors lie can have a lower dimension because of the strong feature dependence we observe with NLP tasks. For example, binary features defined from the same feature template such as the previous word are mutually exclusive. Furthermore, the actual lowdimensional spaces for the source and the target domains are usually different because of domainspecific features and distributional difference between the domains. Borrowing the notation used by Blitzer et al. (2011), define subspace X s to be the (lowest dimensional) subspace of X spanned by all source domain input vectors. Similarly, a subspace X t can be defined. Define X s,t = X s X t , the shared subspace between the two domains. Define X s,⊥ to be the subspace that is orthogonal to X s,t but together with X s,t spans X s , that is, X s,⊥ + X s,t = X s . Similarly we can define X ⊥,t . Essentially X s,t , X s,⊥ and X ⊥,t are the shared subspace and the domain-specific subspaces, and they are mutually orthogonal.
We can project any input vector x into the three subspaces defined above as follows: Similarly, any linear classifier w can be decomposed into w s,t , w s,⊥ and w ⊥,t , and For a naive method that simply learns w from D s , the learned component w ⊥,t will be 0, because the component x ⊥,t of any source instance is 0, and therefore the training error would not be reduced by any non-zero w ⊥,t . Moreover, any non-zero w s,⊥ learned from D s would not be useful for the target domain because for all target instances we have x s,⊥ = 0. So for a w learned from D s , only its component w s,t is useful for domain transfer. Blitzer et al. (2011) argues that with unlabeled target instances, we can hope to "couple" the learning of w ⊥,t with that of w s,t . We show that if we use only our induced feature representation without appending it to the original feature vector, we can achieve this. We first define a matrix M E whose column vectors are the exemplar vectors from E. Then g(x) can be rewritten as M E x. Let w denote a linear classifier learned from the transformed labeled data. w makes prediction based on w M E x, which is the same as (M E w ) x. This shows that the learned classifier w for the induced features is equivalent to a linear classifier w = M E w for the original features.
It is not hard to see that M E w is essentially k w k e (k) , i.e. a linear combination of vectors in E. Because e (k) comes from X t , we can write There are two things to note from the formula above. (1) The learned classifier w does not have any component in the subspace X s,⊥ , which is good because such a component would not be useful for the target domain.
(2) The learned w ⊥,t will unlikely be zero because its learning is "coupled" with the learning of w s,t through w . In effect, we pick up target specific features that correlate with useful common features.
In practice, however, we need to append the induced features to the original features to achieve good adaptation results. One may find this counter-intuitive because this results in an expanded instead of restricted hypothesis space. Our explanation is that because of the typical L 2 regularizer used during training, there is an incentive to shift the weight mass to the additional induced features. The need to combine the induced features with original features was also reported in previous domain adaptation work such as SCL (Blitzer et al., 2006) and marginalized denoising autoencoders (Chen et al., 2012).

Reduction of Domain Divergence
Another theory on domain adaptation developed by Ben-David et al. (2010) essentially states that we should use a hypothesis space that can achieve low error on the source domain while at the same time making it hard to separate source and target instances. If we use only our induced features, then X s,⊥ is excluded from the hypothesis space. This is likely to make it harder to distinguish source and target instances. To verify this, in Table 1 we show the following errors based on three feature representations: (1) The training error on the source domain (ε s ).
(2) The classification error when we train a classifier to separate source and target instances. (3) The error on the target domain using the classifier trained from the source domain (ε t ). ISF-means only our induced instance similarity features are used while ISF uses combined feature vectors. The results show that ISF achieves relatively lowε s and increases the domain separation error. These two factors lead to a reduction inε t .

Difference from EA++
The easy domain adaptation method EA proposed by Daumé III (2007) has later been extended to a semi-supervised version EA++ (Daumé III et al., 2010), where unlabeled data from the target domain is also used. Theoretical justifications for both EA and EA++ are given by . Here we briefly discuss how our method is different from EA++ in terms of using unlabeled data. In both EA and EA++, since labeled target data is available, the algorithms still learn two classifiers, one for each domain. In our algorithm, we only learn a single classifier using labeled data from the source domain. In EA++, unlabeled target data is used to construct a regularizer that brings the two classifiers of the two domains closer. Specifically, the regularizer defines a penalty if the source classifier and the target classifier make different predictions on an unlabeled target instance. However, with this regularizer, EA++ does not strictly restrict either the source classifier or the target classifier to lie in the target subspace X t . In contrast, as we have pointed out above, when only the induced features are used, our method leverages the unlabeled target instances to force the learned classifier to lie in X t .

Tasks and Data Sets
We consider the following NLP tasks. Personalized Spam Filtering (Spam): The data set comes from ECML/PKDD 2006 discovery challenge. The goal is to adapt a spam filter trained on a common pool of 4000 labeled emails to three individual users' personal inboxes, each containing 2500 emails. We use bag-of-word features for this task, and we report classification accuracy. Gene Name Recognition (NER): The data set comes from BioCreAtIvE Task 1B (Hirschman et al., 2005). It contains three sets of Medline abstracts with labeled gene names. Each set corresponds to a single species (fly, mouse or yeast). We consider domain adaptation from one species to another. We use standard NER features including words, POS tags, prefixes/suffixes and contextual features. We report F1 scores for this task. Relation Extraction (Relation): We use the ACE2005 data where the annotated documents are from several different sources such as broadcast news and conversational telephone speech. We report the F1 scores of identifying the 7 major relation types. We use standard features including entity types, entity head words, contextual words and other syntactic features derived from parse trees.

Methods for Comparison
Naive uses the original features.
Common uses only features commonly seen in both domains.
SCL is our implementation of Structural Correspondence Learning (Blitzer et al., 2006). We set the number of induced features to 50 based on preliminary experiments. For pivot features, we follow the setting used by Blitzer et al. (2006) and select the features with a term frequency more than 50 in both domains.
PCA uses principal component analysis on D t to obtain K-dimensional induced feature vectors and then appends them to the original feature vectors.
ISF is our method using instance similarity features. We first transform each training instance to a K-dimensional vector according to Equation 1 and then append the vector to the original vector.
For all the three NLP tasks and the methods above that we compare, we employ the logistic regression (a.k.a. maximum entropy) classification algorithm with L 2 regularization to train a classifier, which means the loss function is the cross entropy error. We use the L-BFGS optimization algorithm to optimize our objective function.

Results
In Table 2, we show the comparison between our method and Naive, Common and SCL. For ISF, the parameter K is set to 100 for Spam, 50 for NER and 500 for Relation after tuning. As we can see from the table, Common, which removes source domain specific features during training, can sometimes improve the classification performance, but this is not consistent and the improvement is small. SCL can improve the performance in most settings for all three tasks, which confirms the general effectiveness of this method. For our method ISF, we can see that on average it outperforms both Naive and SCL significantly. When we zoom into the different source-target domain pairs of the three tasks, we can see that ISF outperforms SCL in most of the cases. This shows that our method is competitive despite its simplicity. It is also worth pointing out that SCL incurs much more computational cost than ISF.
We next compare ISF with PCA. Because PCA is also expensive, we only managed to run it on the Spam task. Table 3 shows that ISF also outperforms PCA significantly.  Table 2: Comparison of performance on three NLP tasks. For each source-target pair of each task, the performance shown is the average of 5-fold cross validation. We also report the overall average performance for each task. We tested statistical significance only for the overall average performance and found that ISF was significantly better than both Naive and SCL with p < 0.05 (indicated by * * ) based on the Wilcoxon signed-rank test.

Conclusions
We presented a hassle-free unsupervised domain adaptation method. The method is simple to implement, fast to run and yet effective for a few NLP tasks, outperforming SCL, a widely-used unsupervised domain adaptation method. We believe the proposed method can benefit a large number of practitioners who prefer simple methods than sophisticated domain adaptation methods.