Transductive Adaptation of Black Box Predictions

Access to data is critical to any machine learning component aimed at training an accurate predictive model. In reality, data is often a subject of technical and legal constraints. Data may contain sensitive topics and data owners are often reluctant to share them. Instead of access to data, they make available decision making procedures to enable predictions on new data. Under the black box classifier constraint, we build an effective domain adaptation technique which adapts classifier predictions in a transductive setting. We run experiments on text categorization datasets and show that significant gains can be achieved, especially in the unsupervised case where no labels are available in the target domain.


Introduction
While huge volumes of unlabeled data are generated and made available in various domains, the cost of acquiring data labels remains high. Domain Adaptation problems arise each time when one leverage labeled data in one or more related source domains, to learn a classifier for unseen data in a target domain which is related, but not identical. The majority of domain adaptation methods makes an assumption of largely available source collections; this allows to measure the discrepancy between distributions and either build representations common to both target and sources, or directly reuse source instances for a better target classification (Xu and Sun, 2012).
Numerous approaches have been proposed to address domain adaptation for statistical machine translation (Koehn and Schroeder, 2007), opinion mining, part of speech tagging and document ranking (Daumé, 2009), , (Zhou and Chang, 2014). Most effective techniques include feature replication (Daumé, 2009), pivot features (Blitzer et al., 2006),  and finding topic models shared by source and target collections (Chen and Liu, 2014). Domain adaptation has equally received a lot of attention in computer vision (Gopalan et al., 2015) where domain shift is a consequence of changing conditions, such as background, location and pose, etc.
More recently, domain adaptation has been tackled with word embedding techniques or deep learning. (Bollegala et al., 2015) proposed an unsupervised method for learning domain-specific word embedding while (Yang and Eisenstein, 2014) relied on word2vec models (Mikolov et al., 2013) to compute feature embedding. Deep learning has been considered as a generic solution to domain adaptation (Vincent et al., 2008;Glorot et al., 2011), (Chopra et al., 2013) and transfer learning problems (Long et al., 2015). For instance, denoising autoencoders are successful models which find common features between source and target collection. They are trained to reconstruct input data from partial random corruption and can be stacked into a multi-layered network where the weights are fine-tuned with backpropagation (Vincent et al., 2008) or marginalized out (Chen et al., 2012).
Domain adaptation is also very attractive for service companies operating customer business processes as it can reduce annotation costs. For instance, opinion mining components deployed in a service solution can be customized to a new customer and adapted with few annotations in order to achieve a contractual performance.
But, in reality, the simplifying assumption of having access to source data rarely holds and limits therefore the application of existing domain adaptation methods. Source data are often a subject of legal, technical and contractual constraints between data owners and data customers. Often, customers are reluctant to share their data. Instead, they often put in place decision making procedures. This allows to obtain predictions for new data under a black box scenario. Note that this scenario is different from the differential privacy setting (Dwork and Roth, 2014) in the sense that no queries to the raw source database are allowed whereas, in our case, only requests for predicting labels of target documents are permitted. This makes privacy preserving machine learning methods inapplicable here (Chaudhuri and Monteleoni, 2008), (Agrawal and Srikant, 2000).
In addition, black boxes systems are frequent in natural language processing applications. For instance, Statistical Machine Translation (SMT) systems are often used as black box to extract features (Specia et al., 2009). Similarly, the problem of adapting SMT systems for cross lingual retrieval has been addressed in (Nikoulina et al., 2012) where target document collections cannot be accessed and the retrieval engine works as a black box.
In this paper we address the problem of adapting classifiers trained on the source data and available as black boxes. The case of available source classifiers has been studied by (Duan et al., 2009) to regularize supervised target classifiers, but we consider here a transductive setting, where the source classifiers are used to predict class scores for a set of available target instances.
We then apply the denoising principle (Vincent et al., 2008) and consider these predictions on target instances as corrupted by the domain shift from the source to target. More precisely, we use the stacked Marginalized Denoising Autoencoders (Chen et al., 2012) to reconstruct the predictions by exploiting the correlation between the target features and the predicted scores. This method has the advantage of coping with unsupervised cases where no labels in the target domain is available. We test the prediction denoising method on two benchmark text classification datasets and demonstrate its capacity to significantly improve the classification accuracy.

Transductive Prediction Adaptation
The domain adaptation problem consists of leveraging the source labeled and target unlabeled data to derive a hypothesis performing well on the target domain. To achieve this goal, most DA methods compute correlation between features in source and target domains. With no access to source data, we argue that the above principle can be extended to the correlation between target features and the source class decisions. We tune an adaptation trick by considering predicted class scores as augmented features for target data. In other words, we use the source classifiers as a pivot to transfer knowledge from source to target. In addition, one can exploit relations between the predictions scores and the target feature distribution to provide adapted predictions.

Marginalized Denoising Autoencoder
The stacked Marginalized Denoising Autoencoder (sMDA) is a version of the multi-layer neural network trained to reconstruct input data from partial random corruption (Vincent et al., 2008) proposed by (Chen et al., 2012), where the random corruption is marginalized out yielding the optimal reconstruction weights in the closed form.
The basic building block of the method is a onelayer linear denoising autoencoder where a set of N input documents x n are corrupted M times by random feature dropout with the probability p. It is then reconstructed with a linear mapping W : R d → R d by minimizing the squared reconstruction loss 1 : (1) LetX be the concatenation of M replicated version of the original data andX be the matrix representation of the M corrupted versions.
Then, the solution of (1) can be expressed as the closed-form solution for ordinary least squares W = PQ −1 with Q =XX and P =XX , where the solution depends on the re-sampling of x 1 , . . . , x N and which features are randomly corrupted.
It is preferable to consider all possible corruptions of all possible inputs when the denoising transformation W is computed, i.e. letting m → ∞. By the weak law of large numbers, the matrices P and Q converge to their expected values E[Q], E[P] as more copies of the corrupted data are created. In the limit, one can derive their expectations and express the corresponding mapping for W in a closed form as W = E[P] E[Q] −1 , where: and E[P] ij = S ij q j where q = [1 − p, . . . , 1 − p, 1] ∈ R d+1 and S = XX is the covariance matrix of the uncorrupted data. This closed form denoising layer with a unique noise p is referred in the following as marginalized denoising autoencoder (MDA). It was shown by (Chen et al., 2012) that MDA can be applied with success to domain adaptation where the source set X s and target set X t are concatenated to form X and the mapping W can exploit the correlation between source and target features. The case of fully available source and target data is referred as a dream case in the evaluation section.

Prediction Adaptation
Without access to X s , MDA cannot be directly applied to [X s ; X t ]. Instead, we augment the feature set X t with the class predictions represented as vector f s (x t ) of class predictions P s (Y = y|x t n ), n = 1, . . . , N . Let u t n = [x t n ; f s (x t n )] be the target instance augmented with the source classifier predictions and U = [u t 1 u t 2 . . . u t N ] be the input to the MDA. Then we compute the optimal mapping W * = min W ||U − WŨ|| 2 that takes into account the correlation between the target features x t and class predictions f s (x t ). The reconstructed class predictions can be obtained as W * [1:N,d+1:d+C] · f s (x t ), where C is the number of classes, and used to label the target data. Algorithm 1 summarizes all steps of the transductive prediction adaptation for a single source domain; the generalization to multiple sources is straightforward 2 .

Experimental results
We test our approach on two standard domain adaptation datasets: the Amazon reviews (AMT) and the 20Newsgroups (NG). The AMT dataset consists of products reviews with 2 classes (positive and negative) represented by tf-idf normalized Algorithm 1 Transductive prediction adaptation. Require: Unlabeled target dataset X t ∈ R N ×d .
Require: Class predictions f s (x t ) = [P s (Y = 1|x t i ), . . . , P s (Y = C|x t n )] ∈ R C . 1: Compose U ∈ R N ×(d+C) with u t n = [x t n ; f s (x t n )]. 2: Use MDA with noise level p to estimate W * = min W ||U − WŨ|| 2 . 3: Get the denoised class predictions for x t as y t = W * [1:N,d+1:d+C] · f s (x t ). 4: Label x t with c * = argmax c {y t c |y t }. 5: return Labels for X t .
bag-of-words, used in previous studies on domain adaptation (Blitzer et al., 2011). We consider the 10,000 most frequent features and four domains used in the studies: kitchen (k), dvd (d), books (b) and electronics (e) with roughly 5,000 documents per domain. We use all the source dataset as training and test on the whole target dataset. We set the MDA noise level p to high values (e.g. 0.9), as document representations are sparse and adding low noise have no effect on the features already equal to zero.
In Table 1, we show the performance of the Transductive Prediction Adaptation (TPA) on 12 adaptation tasks in the AMT dataset. The first column shows the accuracies for the dream case where the standard MDA is applied to both source and target data. The second column shows the baseline results (f s (X t )) obtained directly as class predictions by the source classifier. The classification model is an l 2 regularized Logistic Regression 3 cross-validated with regularized parameter C ∈ [0.0001, 0.001, 0.1, 1, 10, 50, 100].
The two last columns show the results obtained with two versions of TPA (results are underlined when improving over the baseline and in bold when yielding the highest values). In the first version, target instances x t n contains only features (words and bigrams) appearing in the source documents and used to make the predictions f (x t n ). In the second version, denoted as TPAe, we extend TPA with words unseen in the source documents. If the extension part is denoted v t n , we obtain an augmented representation u t n = [x t n ; v t n ; f (x t n )] as input to MDA. As we can see, both TPA and TPAe significantly outperform the baseline f s (X t ) obtained with no adaptation. Furthermore, extending TPA with words present in target documents only allows to further improve the classification accuracy in most cases. Finally, TPAe often outperforms the dream case and also on average (note however that MDA * uses the features common to source and target documents as input).
To understand the effect of prediction adaptation we analyze the book → electronics adaptation task. In the mapping W, we sort the weights corresponding to the correlation between the positive class and the target features. Features with the highest weights (up-weighted by TPA) are great, my, sound, easy, excellent, good, easy to, best, yo, a great, when, well, the best. On contrary, the words that got the smallest weight (down-weighted by TPA) are no, was, number, don't, after, money, if, work, bad, get, buy. As TPA is totally unsupervised, we run additional experiments to understand its practical usefulness. We compare TPA to the case of weakly annotated target data, where few target examples are labelled and used for training a target classifier. Trained with 40, 100 and 200 target examples, a logistic regression yields an average accuracy of 64.63%, 68.01% and 75.13% over 12 tasks and a Multinomial Naives Bayes reports 65.82%, 71.49% and 76%, respectively. Even with 200 labeled target documents, the target versus target classification results are significantly below the 79.8% average accuracy of the baseline source classifier.
All these values are therefore significantly below the 83.73% obtained with TPAe. This strongly supports the domain adaptation scenario, when a sentiment analysis classifier trained on a larger source set and adapted to target documents can do better than a classifier trained on a small set of labeled target documents. Furthermore, we have seen that the baseline can be significantly improved by TPA and even more by TPAe without the need of even a small amount of manual labeling of the target set.
The second group of evaluation tests is on the 20Newsgroup dataset. It contains around 20,000 documents of 20 classes and represents a standard testbed for text categorization. For the domain adaptation, we follow the setting described in (Pan et al., 2012). We filter out rare words (appearing less than 3 times) and keep at most 10,000 features for each task with a tf-idf termweighting. As all documents are organized as a hierarchy, the domain adaptation tasks are defined on category pairs with sources and targets corresponding to subcategories. For example, for the 'comp vs sci' task, subcategories such as comp.sys.ibm.pc.hardware and sci.crypt are set as source domains and comp.sys.ibm.mac.hardware and sci.med as targets, respectively.
In our experiments we consider 5 adaptation tasks on category pairs ( 'comp vs sci','rec vs talk', 'rec vs sci', 'sci vs talk' and 'comp vs rec' as in (Pan et al., 2012) ), and run the baseline, TPA and TPAe methods. For each category pair, we additionally inverse the source and target roles; this explains two sets of experimental results for each pair. We show the evaluation results in Table 2. It is easy to observe again the significant improvement over the baseline f s (x t n ) and the positive effect of including the unseen words in the TPA.

Conclusion
In this paper we address the domain adaptation scenario without access to source data and where source classifiers are available as black boxes. In the transductive setting, the source classifiers can predict class scores for target instances, and we consider these predictions as corrupted by domain shift. We use the Marginalized Denoising Autoencoders (Chen et al., 2012) to reconstruct the predictions by exploiting the "correlation" between the target features and the predicted scores. We test the transductive prediction adaptation on two known benchmarks and demonstrate that it can significantly improve the classification accuracy, comparing to the baseline and to the case of full access to source data. This is an encouraging result because it demonstrates that domain adaptation can still be effective despite the absence of source data. Lastly, in the future, we would like to explore the adaptation of other language processing components, such as named entity recognition, with our method.