Neural Structural Correspondence Learning for Domain Adaptation

We introduce a neural network model that marries together ideas from two prominent strands of research on domain adaptation through representation learning: structural correspondence learning (SCL, (Blitzer et al., 2006)) and autoencoder neural networks (NNs). Our model is a three-layer NN that learns to encode the non-pivot features of an input example into a low dimensional representation, so that the existence of pivot features (features that are prominent in both domains and convey useful information for the NLP task) in the example can be decoded from that representation. The low-dimensional representation is then employed in a learning algorithm for the task. Moreover, we show how to inject pre-trained word embeddings into our model in order to improve generalization across examples with similar pivot features. We experiment with the task of cross-domain sentiment classification on 16 domain pairs and show substantial improvements over strong baselines.


Introduction
Many state-of-the-art algorithms for Natural Language Processing (NLP) tasks require labeled data. Unfortunately, annotating sufficient amounts of such data is often costly and labor intensive. Consequently, for many NLP applications even resource-rich languages like English have labeled data in only a handful of domains.
Domain adaptation (Daumé III, 2007;Ben-David et al., 2010), training an algorithm on labeled data taken from one domain so that it can perform properly on data from other domains, is therefore recognized as a fundamental challenge in NLP. Indeed, over the last decade domain adaptation methods have been proposed for tasks such as sentiment classification (Bollegala et al., 2011b), POS tagging (Schnabel and Schütze, 2013), syntactic parsing (Reichart and Rappoport, 2007;McClosky et al., 2010;Rush et al., 2012) and relation extraction (Jiang and Zhai, 2007;Bollegala et al., 2011a), if to name just a handful of applications and works.
Leading recent approaches to domain adaptation in NLP are based on Neural Networks (NNs), and particularly on autoencoders (Glorot et al., 2011;Chen et al., 2012). These models are believed to extract features that are robust to crossdomain variations. However, while excelling on benchmark domain adaptation tasks such as crossdomain product sentiment classification (Blitzer et al., 2007), the reasons to this success are not entirely understood.
In the pre-NN era, a prominent approach to domain adaptation in NLP, and particularly in sentiment classification, has been structural correspondence learning (SCL) (Blitzer et al., 2006(Blitzer et al., , 2007. Following the auxiliary problems approach to semi-supervised learning (Ando and Zhang, 2005), this method identifies correspondences among features from different domains by modeling their correlations with pivot features: features that are frequent in both domains and are important for the NLP task. Non-pivot features from different domains which are correlated with many of the same pivot features are assumed to correspond, providing a bridge between the domains. Elegant and well motivated as it may be, SCL does not form the state-of-the-art since the neural approaches took over.
In this paper we marry these approaches, proposing NN models inspired by ideas from both.
Particularly, our basic model receives the nonpivot features of an input example, encodes them into a hidden layer and then, instead of decoding the input layer as an autoencoder would do, it aims to decode the pivot features. Our more advanced model is identical to the basic one except that the decoding matrix is not learned but is rather replaced with a fixed matrix consisting of pre-trained embeddings of the pivot features. Under this model the probability of the i-th pivot feature to appear in an example is a (non-linear) function of the dot product of the feature's embedding vector and the network's hidden layer vector. As explained in Section 3, this approach encourages the model to learn similar hidden layers for documents that have different pivot features as long as these features have similar meaning. In sentiment classification, for example, although one positive review may use the unigram pivot feature excellent while another positive review uses the pivot great, as long as the embeddings of pivot features with similar meaning are similar (as expected from high quality embeddings) the hidden layers learned for both documents are biased to be similar. We experiment with the task of cross-domain product sentiment classification of (Blitzer et al., 2007), consisting of 4 domains (12 domain pairs) and further add an additional target domain, consisting of sentences extracted from social media blogs (total of 16 domain pairs). For pivot feature embedding in our advanced model, we employ the word2vec algorithm (Mikolov et al., 2013). Our models substantially outperform strong baselines: the SCL algorithm, the marginalized stacked denoising autoencoder (MSDA) model (Chen et al., 2012) and the MSDA-DAN model (Ganin et al., 2016) that combines the power of MSDA with a domain adversarial network (DAN).

Background and Contribution
Domain adaptation is a fundamental, long standing problem in NLP (e.g. (Roark and Bacchiani, 2003;Chelba and Acero, 2004;Daume III and Marcu, 2006)). The challenge stems from the fact that data in the source and the target domains are often distributed differently, making it hard for a model trained in the source domain to make valuable predictions in the target domain.
Domain adaptation has various setups, differing with respect to the amounts of labeled and unlabeled data available in the source and target do-mains. The setup we address, commonly referred to as unsupervised domain adaptation is where both domains have ample unlabeled data, but only the source domain has labeled training data.
Here, we discuss works that, like us, take the representation learning path. Most works under this approach follow a two steps protocol: First, the representation learning method (be it SCL, an autoencoder network, our proposed network model or any other model) is trained on unlabeled data from both the source and the target domains; Then, a classifier for the supervised task (e.g. sentiment classification) is trained in the source domain and this trained classifier is applied to test examples from the target domain. Each input example of the task classifier, at both training and test, is first run through the representation model of the first step and the induced representation is fed to the classifier. Recently, end-to-end models that jointly learn to represent the data and to perform the classification task have also been proposed. We compare our models to one such method (MSDA-DAN, (Ganin et al., 2016)).
Below, we first discuss two prominent ideas in feature representation learning: pivot features and autoencoder neural networks. We then summarize our contribution in light of these approaches.
Pivot and Non-Pivot Features The definitions of this approach are given in Blitzer et al. (2006Blitzer et al. ( , 2007, where SCL is presented in the context of POS tagging and sentiment classification, respectively. Fundamentally, the method divides the shared feature space of both the source and the target domains to the set of pivot features that are frequent in both domains and are prominent in the NLP task, and a complementary set of non-pivot features. In this section we abstract away from the actual feature space and its division to pivot and non-pivot subsets. In Section 4 we discuss this issue in the context of sentiment classification. For representation learning, SCL employs the pivot features in order to learn mappings from the original feature space of both domains to a shared, low-dimensional, real-valued feature space. This is done by training classifiers whose input consists of the non-pivot features of an input example and their binary classification task (the auxiliary task) is predicting, every classifier for one pivot feature, whether the pivot associated with the classifier appears in the input example or not. These classifiers are trained on unlabeled data from both the target and the source domains: the training supervision naturally occurs in the data, no human annotation is required. The matrix consisting of the weight vectors of these classifiers is then post-processed with singular value decomposition (SVD), to facilitate final compact representations. The SVD derived matrix serves as a transformation matrix which maps feature vectors in the original space into a low-dimensional real-valued feature space.
Numerous works have employed the SCL method in particular and the concept of pivot features for domain adaptation in general. A prominent method is spectral feature alignment (SFA, (Pan et al., 2010)). This method aims to align domain-specific (non-pivot) features from different domains into unified clusters, with the help of domain-independent (pivot) features as a bridge.
Recently, Gouws et al. (2012) and Bollegala et al. (2015) implemented ideas related to those described here within an NN for cross-domain sentiment classification. For example, the latter work trained a word embedding model so that for every document, regardless of its domain, pivots are good predictors of non-pivots, and the pivots' embeddings are similar across domains. Yu and Jiang (2016) presented a convolutional NN that learns sentence embeddings using two auxiliary tasks (whether the sentence contains a positive or a negative domain independent sentiment word), purposely avoiding prediction with respect to a large set of pivot features. In contrast to these works our model can learn useful cross-domain representations for any type of input example and in our cross-domain sentiment classification experiments it learns document level embeddings. That is, unlike Bollegala et al. (2015) we do not learn word embeddings and unlike Yu and Jiang (2016) we are not restricted to input sentences.
Autoencoder NNs An autoencoder is comprised of an encoder function h and a decoder function g, typically with the dimension of h smaller than that of its argument. The reconstruction of an input x is given by r(x) = g(h(x)). Autoencoders are typically trained to minimize a reconstruction error loss(x, r(x)). Example loss functions are the squared error, the Kullback-Leibler (KL) divergence and the cross entropy of elements of x and elements of r(x). The last two loss functions are appropriate options when the elements of x or r(x) can be interpreted as probabilities of a discrete event. In Section 3 we get back to this point when defining the cross-entropy loss function of our model. Once an autoencoder has been trained, one can stack another autoencoder on top of it, by training a second model which sees the output of the first as its training data (Bengio et al., 2007). The parameters of the stack of autoencoders describe multiple representation levels for x and can feed a classifier, to facilitate domain adaptation.
Recent prominent models for domain adaptation for sentiment classification are based on a variant of the autoencoder called Stacked Denoising Autoencoders (SDA, (Vincent et al., 2008)). In a denoising autoencoder (DEA) the input vector x is stochastically corrupted into a vectorx, and the model is trained to minimize a denoising reconstruction error loss(x, r(x)). SDA for crossdomain sentiment classification was implemented by Glorot et al. (2011). Later, Chen et al. (2012) proposed the marginalized SDA (MSDA) model that is more computationally efficient and scalable to high-dimensional feature spaces than SDA.
Marginalization of denoising autoencoders has gained interest since MSDA was presented. Yang and Eisenstein (2014) showed how to improve efficiency further by exploiting noising functions designed for structured feature spaces, which are common in NLP. More recently, Clinchant et al. (2016) proposed an unsupervised regularization method for MSDA based on the work of Ganin and Lempitsky (2015) and Ganin et al. (2016).
There is a recent interest in models based on variational autoencoders (Kingma and Welling, 2014;Rezende et al., 2014), for example the variational fair autoencoder model (Louizos et al., 2016), for domain adaptation. However, these models are still not competitive with MSDA on the tasks we consider here.

Our Contribution
We propose an approach that marries the above lines of work. Our model is similar in structure to an autoencoder. However, instead of reconstructing the input x from the hidden layer h(x), its reconstruction function r receives a low dimensional representation of the non-pivot features of the input (h(x np ), where x np is the non-pivot representation of x (Section 3)) and predicts whether each of the pivot features appears in this example or not. As far as we know, we are the first to exploit the mutual strengths of pivot-based methods and autoencoders for domain adaptation.

Neural SCL Models
We propose two models: the basic Autoencoder SCL (AE-SCL, 3.2)), that directly integrates ideas from autoencoders and SCL, and the elaborated Autoencoder SCL with Similarity Regularization (AE-SCL-SR, 3.3), where pre-trained word embeddings are integrated into the basic model.

Definitions
We denote the feature set in our problem with f , the subset of pivot features with f p ⊆ {1, . . . , |f |} and the subset of non-pivot features with We further denote the feature representation of an input example X with x. Following this notation, the vector of pivot features of X is denoted with x p while the vector of non-pivot features is denoted with x np .
In order to learn a robust and compact feature representation for X we will aim to learn a nonlinear prediction function from x np to x p . As discussed in Section 4 the task we experiment with is cross-domain sentiment classification. Following previous work (e.g. (Blitzer et al., 2006(Blitzer et al., , 2007Chen et al., 2012) our feature representation consists of binary indicators for the occurrence of word unigrams and bigrams in the represented document. In what follows we hence assume that the feature representation x of an example X is a binary vector, and hence so are x p and x np .

Autoencoder SCL (AE-SCL)
In order to solve the prediction problem, we present an NN architecture inspired by autoencoders ( Figure 1). Given an input example X with a feature representation x, our fundamental idea is to start from a non-pivot feature representation, , and, finally, predict with a function r w r (h w h (x np )) the occurrences of pivot features, x p , in the example.
As is standard in NN modeling, we introduce non-linearity to the model through a non-linear activation function denoted with σ (the sigmoid function in our models). Consequently we get: In what follows we denote the output of the model with o = r w r (h w h (x np )).
Since the sigmoid function outputs values in the [0, 1] interval, o can be interpreted as a vector of probabilities with the i-th coordinate reflecting the probability of the i-th pivot feature to appear in the input example. Cross-entropy is hence a natural loss function to jointly reason about all pivots: As x p is a binary vector, for each pivot feature, x p i , only one of the two members of the sum that take this feature into account gets a non-zero value. The higher the probability of the correct event is (whether or not x p i appears in the input example), the lower is the loss.

Autoencoder SCL with Similarity Regularization (AE-SCL-SR)
An important observation of Blitzer et al. (2007), is that some pivot features are similar to each other to the level that they indicate the same information with respect to the classification task. For example, in sentiment classification with word unigram features, the words (unigrams) great and excellent are likely to serve as pivot features, as the meaning of each of them is preserved across domains. At the same time, both features convey very similar (positive) sentiment information to the level that a sentiment classifier should treat them as equals.
The AE-SCL-SR model is based on two crucial observations. First, in many NLP tasks the pivot features can be pre-embeded into a vector space where pivots with similar meaning have similar vectors. Second, the set f p X i of pivot features that appear in an example X i is typically much smaller than the setf p X i of pivot features that do not appear in it. Hence, if the pivot features of X 1 and X 2 convey the same information about the NLP task (e.g. that the sentiment of both X 1 and X 2 is positive), then even if f p X 1 and f p X 2 are not identical, the intersection between the larger setsf p X 1 andf p X 2 is typically much larger than the symmetric difference between f p X 1 and f p X 2 . For instance, consider two examples, X 1 with the single pivot feature f 1 = great, and X 2 , with the single pivot feature f 2 = excellent. Crucially, even though X 1 and X 2 differ with respect to the existence of f 1 and f 2 , due to the similar meaning of these pivot features, we expect both X 1 and X 2 not to contain many other pivot features, such as terrible, awful and mediocre, whose meanings conflict with that of f 1 and f 2 .
To exploit these observations, in AE-SCL-SR the reconstruction matrix w r is pre-trained with a word embedding model and is kept fixed during the training and prediction phases of the neural network. Particularly, the i-th row of w r is set to be the vector representation of the i-th pivot feature as learned by the word embedding model. Except from this change, the AE-SCL-SR model is identical to the AE-SCL model described above. Now, denoting the encoding layer for X 1 with h 1 and the encoding layer for X 2 with h 2 , we expect both σ(w r k i · h 1 ) and σ(w r k i · h 2 ) to get low values (i.e. values close to 0), for those k i conflicting pivot features: pivots whose meanings conflict with that of f p X 1 and f p X 2 . By fixing the representations of similar conflicting features to similar vectors, AE-SCL-SR provides a strong bias for h 1 and h 2 to be similar, as its only way to bias the predictions with respect to these features to be low is by pushing h 1 and h 2 to be similar. Consequently, under AE-SCL-SR the vectors that encode the non-pivot features of documents with similar pivot features are biased to be similar to each other. As mentioned in Section 4 the vector h = σ −1 (h) forms the feature representation that is fed to the sentiment classifier to facilitate domain adaptation. By definition, when h 1 and h 2 are similar so are theirh 1 andh 2 counterparts.

Experiments
In this section we describe our experiments. To facilitate clarity, some details are not given here and instead are provided in the appendices. Cross-domain Sentiment Classification To demonstrate the power of our models for domain adaptation we experiment with the task of crossdomain sentiment classification (Blitzer et al., 2007). The data for this task consist of Amazon product reviews from four product domains: Books (B), DVDs (D), Electronic items (E) and Kitchen appliances (K). For each domain 2000 labeled reviews are provided: 1000 are classified as positive and 1000 as negative, and these are augmented with unlabeled reviews: 6000 (B), 34741 (D), 13153 (E) and 16785 (K).
We also consider an additional target domain, denoted with Blog: the University of Michigan sentence level sentiment dataset, consisting of sentences taken from social media blogs. 2 The dataset for the original task consists of a labeled training set (3995 positive and 3091 negative) and a 33052 sentences test set for which sentiment labels are not provided. We hence used the original test set as our target domain unlabeled set and the original training set as our target domain test set.
Baselines Cross-domain sentiment classification has been studied in a large number of papers. However, the difference in preprocessing methods, dataset splits to train/dev/test subsets and the different sentiment classifiers make it hard to directly compare between the numbers reported in past.
We hence compare our models to three strong baselines, running all models under the same conditions. We aim to select baselines that represent the state-of-the-art in cross-domain sentiment classification in general, and in the two lines of work we focus at: pivot based and autoencoder based representation learning, in particular.
The first baseline is SCL with pivot features selected using the mutual information criterion (SCL-MI, (Blitzer et al., 2007)). This is the SCL method where pivot features are frequent in the unlabeled data of both the source and the target domains, and among those features are the ones with the highest mutual information with the task (sentiment) label in the source domain labeled data. We implemented this method. In our implementation unigrams and bigrams should appear at least 10 times in both domains to be considered frequent. For non-pivot features we consider unigrams and bigrams that appear at least 10 times in their domain. The same pivot and non-pivot selection criteria are employed for our AE-SCL and AE-SCL-SR models.
Among autoencoder models, SDA has shown by Glorot et al. (2011) to outperform SFA and SCL on cross-domain sentiment classification and later on Chen et al. (2012) demonstrated superior performance for MSDA over SDA and SCL on the same task. Our second baseline is hence the MSDA method (Chen et al., 2012), with code taken from the authors' web page. 3 To consider a regularization scheme on top of MSDA representations we also experiment with the MSDA-DAN model (Ganin et al., 2016) which employs a domain adversarial network (DAN) with the MSDA vectors as input. In Ganin et al. (2016) MSDA-DAN has shown to substantially outperform the DAN model when DAN is randomly initialized. The DAN code is taken from the authors' repository. 4 For reference we compare to the No-DA case where the sentiment classifier is trained in the source domain and applied to the target domain without adaptation. The sentiment classifier we employ, in this case as well as with our methods and with the SCL-MI and MSDA baselines, is a standard logistic regression classifier. 5 6 Experimental Protocol Following the unsupervised domain adaptation setup (Section 2), we have access to unlabeled data from both the source and the target domains, which we use to train the representation learning models. However, only the source domain has labeled training data for sentiment classification. The original feature set we start from consists of word unigrams and bigrams.
All methods (baselines and ours), except from MSDA-DAN, follow a two-step protocol at both training and test time. In the first step, the input example is run through the representation model which generates a new feature vector for this example. Then, in the second step, this vector is concatenated with the original feature vector of the ex-ample and the resulting vector is fed into the sentiment classifier (this concatenation is a standard convention in the baseline methods).
For MSDA-DAN all the above holds, except from one exception. MSDA-DAN gets an input representation that consists of a concatenation of the original and the MSDA-induced feature sets. As this is an end-to-end model that predicts the sentiment class jointly with the new feature representation, we do not employ any additional sentiment classifier. As in the other models, MSDA-DAN utilizes source domain labeled data as well as unlabeled data from both the source and the target domains at training time.
We experiment with a 5-fold cross-validation on the source domain (Blitzer et al., 2007): 1600 reviews for training and 400 reviews for development. The test set for each target domain of Blitzer et al. (2007) consists of all 2000 labeled reviews of that domain, and for the Blog domain it consists of the 7086 labeled sentences provided with the task dataset. In all five folds half of the training examples and half of the development examples are randomly selected from the positive reviews and the other halves from the negative reviews. We report average results across these five folds, employing the same folds for all models.
Hyper-parameter Tuning The details of the hyper-parameter tuning process for all models (including data splits to training, development and test sets) are described in the appendices. Here we provide a summary. AE-SCL and AE-SCL-SR: For the stochastic gradient descent (SGD) training algorithm we set the learning rate to 0.1, momentum to 0.9 and weightdecay regularization to 10 −5 . The number of pivots was chosen among {100, 200, . . . , 500} and the dimensionality of h among {100, 300, 500}. For the features induced by these models we take their w h x np vector. For AE-SCL-SR, embeddings for the unigram and bigram features were learned with word2vec (Mikolov et al., 2013). Details about the software and the way we learn bigram representations are in the appendices. Baselines: For SCL-MI, following (Blitzer et al., 2007) we tuned the number of pivot features between 500 and 1000 and the SVD dimensions among 50,100 and 150. For MSDA we tuned the number of reconstructed features among {500, 1000, 2000, 5000, 10000}, the number of model layers among {1, 3, 5} and the corrup-  (Gillick and Cox, 1989;Blitzer et al., 2006),   ), AE-SCL-SR performs best in 3 of 4 setups, providing particularly large improvements when training is in the Kitchen (K) domain. The average improvement of AE-SCL-SR over MSDA is 5.2% and over a non-adapted classifier is 11.7%. As before, MSDA-DAN performs similarly to MSDA on the unified test set, although the differences in the individual setups are much higher. The differences between AE-SCL-SR and the other models are statistically significant in most cases. 7   SCL-SR, this is a weaker effect which only moderates the overall superiority of AE-SCL-SR. 8 The unlabeled documents from all four domains are strongly biased to convey positive opinions (Section 4). This is indicated, for example, by the average score given to these reviews by their authors: 4.29 (B), 4.33 (D), 3.96 (E) and 4.16 (K), on a scale of 1 to 5. This analysis suggests that AE-SCL-SR better learns from of its unlabeled data.

Similar Pivots
Recall that AE-SCL-SR aims to learn more similar representations for documents with similar pivot features. Table 2 demonstrates this effect through pairs of test documents from 8 product review setups. 9 The documents contain pivot features with very similar meaning and indeed they belong to the same sentiment class. Yet, in all cases AE-SCL-SR correctly classifies both 8 The reported numbers are averaged over the 5 folds and rounded to the closest integer, if necessary. The comparison between AE-SCL-SR and MSDA-DAN yields a very similar pattern and is hence excluded from space considerations. 9 We consider for each setup one example pair from one of the five folds such that the dimensionality of the hidden layers in both models is identical. documents, while AE-SCL misclassifies one.
The rightmost column of the table presents the difference in the ranking of the cosine similarity between the representation vectorsh of the documents in the pair, according to each of the models. Results (in numerical values and percentage) are given with respect to all cosine similarity values between theh vectors of any document pair in the test set. As the documents with the highest similarity are ranked 1, the positive difference between the ranks of AE-SCL and those of AE-SCL-SR indicate that AE-SCL's rank is lower. That is, AE-SCL-SR learns more similar representations for documents with similar pivot features.

Conclusions and Future Work
We presented a new model for domain adaptation which combines ideas from pivot based and autoencoder based representation learning. We have demonstrated how to encode information from pre-trained word embeddings to improve the generalization of our model across examples with semantically similar pivot features. We demonstrated strong performance on cross-domain sentiment classification tasks with 16 domain pairs and provided initial qualitative analysis that supports the intuition behind our model. Our approach is general and applicable for a large number of NLP tasks (for AE-SCL-SR this holds as long as the pivot features can be embedded in a vector space).
In future we would like to adapt our model to more general domain adaptation setups such as where adaptation is performed between sets of source and target domains and where some labeled data from the target domain(s) is available.

A Hyperparameter Tuning
This appendix describes the hyper-parameter tuning process for the models compared in our paper. Some of these details appear in the full paper, but here we provide a detailed description.
AE-SCL and AE-SCL-SR We tuned the parameters of both our models in two steps. First, we randomly split the unlabeled data from both the source and the target domains in a 80/20 manner and combine the large subsets together and the small subsets together so that to generate unlabeled training and validation sets. On these training/validation sets we tune the hyperparameters of the stochastic gradient descent (SGD) algorithm we employ to train our networks: learning rate (0.1), momentum (0.9) and weight-decay regularization (10 −5 ). Note that these values are tuned on the fully unsupervised task of predicting pivot features occurrence from non-pivot input representation, and are then employed in all the source-traget domain combinations, across all folds. 10 After tuning the SGD parameters, in the second step we tuned the model's hyper-parameters for each fold of each source-target setup. The hyperparameters are the number of pivots (100 to 500 in steps 100) and the dimensionality of h (100 to 500 in steps of 200). We select the values that yield the best performing model when training on the training set and evaluating on the training domain development set of each fold. 11 We further explored the quality of the various intermediate representations generated by the models as sources of features for the sentiment classifier. The vectors we considered are: w h x np , h = σ(w h x np ), w r h and r = σ(w r h). We chose the w h x np vector, denoted in the paper in the paper withh.
For AE-SCL-SR, embeddings for the unigram and bigram features were learned with word2vec (Mikolov et al., 2013). 12 To learn bigram representations, in cases where a bigram pivot (w1,w2) is included in a sentence we generate the triplet 10 Both AE-SCL and AE-SCL-SR converged to the same values. This is probably because for each parameter we consider only a handful of values: learning rate (0.01,0.1,1), momentum (0.1,0.,5,0.9) and weight-decay regularization (10 −4 ,10 −5 , 10 −6 ). 11 When tuning the SGD parameters we experimented with 100 and 500 pivots and dimensionality of 100 and 500 for h. 12 We employed the Gensim package and trained the model on the unlabeled data from both the source and the target domains of each adaptation setup (https:// radimrehurek.com/gensim/). w1,w1-w2, w2. For example, the sentence It was a very good book with the bigram pivot very good is re-written as: It was a very very-good good book. The revised corpus is then fed into word2vec. The dimension of the hidden layer h of AE-SCL-SR is the dimension of the induced embeddings.
In both parameter tuning steps we use the unlabeled validation data for early stopping: the SGD algorithm stops at the first iteration where the validation data error increases rather then when the training error or the loss function are minimized.
SCL-MI Following (Blitzer et al., 2007) we used 1000 pivot features . 13 The number of SVD dimensions was tuned on the labeled development data to the best value among 50,100 and 150.
MSDA Using the labeled dev. data we tuned the number of reconstructed features (among 500, 1000, 2000, 5000 and 10000) the number of model layers (among {1, 3, 5}) and the corruption probability (among {0.1, 0.2, . . . , 0.5}). For details on these hyper-parameters see (Chen et al., 2012). Ganin et al. (2016) we tuned the hyperparameters on the labeled development data as follows. The λ adaptation parameter is chosen among 9 values between 10 −2 and 1 on a logarithmic scale. The hidden layer size l is chosen among {50, 100, 200} and the learning rate µ is fixed to 10 −3 .

B Experimental Choices
Variants of the Product Review Data There are two releases of the datasets of the Blitzer et al. (2007) cross-domain product review task.
We use the one from http://www.cs.jhu.edu/˜mdredze/ datasets/sentiment/index2.html where the data is imbalanced, consisting of more positive than negative reviews. We believe that our setup is more realistic as when collecting unlabeled data, it is hard to get a balanced set. Note that Blitzer et al. (2007) used the other release where the unlabeled data consists of the same number of positive and negative reviews.
Test Set Size While Blitzer et al. (2007) used only 400 target domain reviews for test, we use the entire set of 2000 reviews. We believe that this decision yields more robust and statistically significant results. 13 Results with 500 pivots were very similar.