Unsupervised Domain Adaptation with Feature Embeddings

Representation learning is the dominant technique for unsupervised domain adaptation, but existing approaches often require the specification of"pivot features"that generalize across domains, which are selected by task-specific heuristics. We show that a novel but simple feature embedding approach provides better performance, by exploiting the feature template structure common in NLP problems.


INTRODUCTION
Domain adaptation is crucial if natural language processing is to be successfully employed in highimpact application areas such as social media, patient medical records, and historical texts.Unsupervised domain adaptation is particularly appealing, since it requires no labeled data in the target domain.Some of the most successful approaches to unsupervised domain adaptation are based on representation learning: transforming sparse high-dimensional surface features into dense vector representations, which are often more robust to domain shift (Blitzer et al., 2006;Glorot et al., 2011).However, these methods are computationally expensive (sometimes employing thousands of dense features), and often require special task-specific heuristics to select good "pivot features".
We present FEMA (Feature EMbeddings for domain Adaptation), a novel representation learning approach for domain adaptation in structured feature spaces.Like prior work in representation learning, FEMA learns dense features that are more robust to domain shift.However, FEMA diverges from previous approaches based on reconstructing pivot features; instead, it uses techniques from neural language models to directly obtain low-dimensional embeddings.FEMA outperforms prior work on adapting POS tagging from the Penn Treebank to web text.

LEARNING FEATURE EMBEDDINGS
Feature co-occurrence statistics are the primary source of information driving many unsupervised methods for domain adaptation.For example, both Structural Correspondence Learning (SCL; Blitzer et al., 2006) and Denoising Autoencoders (Chen et al., 2012) learn to reconstruct a subset of "pivot features", as shown in Figure 1(a).The reconstruction function is then employed to project each instance into a dense representation, which will hopefully be better suited to crossdomain generalization.The pivot features are chosen to be both predictive of the label and general across domains.Meeting these two criteria requires task-specific heuristics.Furthermore, the pivot features correspond to a small subspace of the feature co-occurrence matrix.We face a tradeoff between the amount of feature co-occurrence information that we can use, and the computational complexity for representation learning and downstream training.
We avoid this tradeoff by inducing low dimensional feature embeddings directly.We exploit the tendency of many NLP tasks to divide features into templates, with exactly one active feature per template (Smith, 2011); this is shown in the center of Figure 1.Rather than treating each instance as an undifferentiated bag-of-features, we exploit this template structure to induce feature embeddings: dense representations of individual features.Each embedding is selected to help predict the features that fill out the other templates; see Figure 1(b).The embeddings for each active feature are then concatenated together across templates, giving a dense representation for the entire instance.Mikolov et al., 2013), which is a simple yet efficient method for learning word embeddings.The training objective is to find feature embeddings that are useful for predicting other active features in the instance.For the instance n ∈ {1 . . .N } and feature template t ∈ {1 . . .T }, we denote f n (t) as the index of the active feature; for example, in the instance shown in Figure 1, f n (t) = 'new' when t indicates the previous-word template.The skip-gram approach induces distinct "input" and "output" embeddings for each feature, written u fn(t) and v fn(t) , respectively.The role of these embeddings can be seen in the negative sampling objective, where t and t are feature templates, k is the number of negative samples, P (n) t is a noise distribution for template t , and σ is the sigmoid function.
Feature embeddings can be applied to domain adaptation by learning embeddings of all features on the union of the source and target data sets.The dense feature vector for each instance is obtained by concatenating the feature embeddings for each template.Finally, since it has been shown that nonlinearity is important for generating robust representations (Bengio et al., 2013), we follow (Chen et al., 2012) and apply the hyperbolic tangent function to the embeddings.The augmented representation x (aug) n of instance n is the concatenation of the original feature vector and the feature embeddings: where ⊕ is vector concatenation.

EXPERIMENTS
We evaluate FEMA on part-of-speech (POS) tagging: adaptation of English POS tagging from news text to web text, as in the SANCL shared task (Petrov & McDonald, 2012).

EXPERIMENT SETUP
Datasets We use data from the SANCL shared task (Petrov & McDonald, 2012), which contains several web-related corpora (newsgroups, reviews, weblogs, answers, emails) as well as the WSJ portion of OntoNotes corpus (Hovy et al., 2006).Following Schnabel & Schütze (2014), we use sections 02-21 of WSJ for training and section 22 for development, and use 100,000 unlabeled WSJ sentences from 1988 for learning representations.On the web text side, each of the five target domains has an unlabeled training set of 100,000 sentences, along with development and test sets of about 1000 labeled sentences each.SVM tagger While POS tagging is classically treated as a structured prediction problem, we follow Schnabel & Schütze (2014) by taking a classification-based approach.Specifically, we apply a support vector machine (SVM) classifier, adding dense features from FEMA (and the alternative representation learning techniques) to the set of basic features.We apply sixteen basic feature templates introduced by Ratnaparkhi et al. (1996).Feature embeddings are learned for all lexical and affix features, yielding a total of thirteen embeddings per instance.
Competitive systems We consider two competitive unsupervised domain adaptation methods that require pivot features: Structural Correspondence Learning (SCL; Blitzer et al., 2006) and marginalized Denoising Autoencoders (mDA; Chen et al, 2012).We use structured dropout noise for mDA (Yang & Eisenstein, 2014).We also directly compare with WORD2VEC1 word embeddings, and with a baseline approach in which we simply train on the source domain data using the surface features, and then test on the target domain.Aside from our own implemented methods, we compare against published results of FLORS (Schnabel & Schütze, 2014), which uses distributional features for domain adaptation.We also republish the results of Schnabel and Schutze using the Stanford POS Tagger, a maximum entropy Markov model (MEMM) tagger.
Parameter tuning All the hyper-parameters are tuned on development data.Following Blitzer et al. (2006), we consider 6918 pivot features that appear more than 50 times in all the domains for SCL and mDA.The best parameters for SCL are dimensionality K = 50 and rescale factor α = 5.
For both FEMA and word2vec, the best embedding size is 100 and the best number of negative samples is 5.The noise distribution P (n) t is simply the unigram probability of each feature in the template t.Mikolov et al. (2013b) argue for exponentiating the unigram distribution, but we find it makes little difference here.The window size of word embeddings is set as 5.

RESULTS
As shown in Table 1 and 2, FEMA outperforms competitive systems on all target domains except REVIEW, where FLORS performs slightly better.FLORS uses more basic features than FEMA; these features could in principle be combined with feature embeddings for better performance.Compared with the other representation learning approaches, FEMA is roughly 1% better on average, corresponding to an error reduction of 10%.Its training time is approximately 70 minutes on a 24-core machine, using an implementation based on gensim.2This is slightly faster than SCL, although slower than mDA with structured dropout noise.

RELATED WORK
Representation learning approaches to domain adaptation seek for cross-domain representations, which were first induced via auxiliary prediction problems (Ando & Zhang, 2005), such as the prediction of pivot features (Blitzer et al., 2006).In these approaches, as well as in later work on denoising autoencoders (Chen et al., 2012), the key mechanism is to learn a function to predict a subset of features for each instance, based on other features of the instance.
Word embeddings can be viewed as special case of representation learning, where the goal is to learn representations for each word, and then to supply these representations in place of lexical features (Turian et al., 2010).

CONCLUSION
Feature embeddings can be used for domain adaptation in any problem involving feature templates.They offer strong performance, avoid practical drawbacks of alternative representation learning approaches, and are easy to learn using existing word embedding methods.

Figure 1 :
Figure 1: Representation learning techniques in structured feature spaces

Table 1 :
Accuracy results for adaptation from WSJ to Web Text on SANCL dev set.

Table 2 :
Accuracy results for adaptation from WSJ to Web Text on SANCL test set.