Learning to select data for transfer learning with Bayesian Optimization

Domain similarity measures can be used to gauge adaptability and select suitable data for transfer learning, but existing approaches define ad hoc measures that are deemed suitable for respective tasks. Inspired by work on curriculum learning, we propose to learn data selection measures using Bayesian Optimization and evaluate them across models, domains and tasks. Our learned measures outperform existing domain similarity measures significantly on three tasks: sentiment analysis, part-of-speech tagging, and parsing. We show the importance of complementing similarity with diversity, and that learned measures are–to some degree–transferable across models, domains, and even tasks.


Introduction
Natural Language Processing (NLP) models suffer considerably when applied in the wild. The distribution of the test data is typically very different from the data used during training, causing a model's performance to deteriorate substantially. Domain adaptation is a prominent approach to transfer learning that can help to bridge this gap; many approaches were suggested so far Daumé III, 2007;Jiang and Zhai, 2007;Ma et al., 2014;Schnabel and Schütze, 2014). However, most work focused on one-toone scenarios. Only recently research considered using multiple sources. Such studies are rare and typically rely on specific model transfer approaches (Mansour, 2009;Wu and Huang, 2016).
Inspired by work on curriculum learning (Bengio et al., 2009;Tsvetkov et al., 2016), we instead propose-to the best of our knowledge-the first model-agnostic data selection approach to trans-fer learning. Contrary to curriculum learning that aims at speeding up learning (see §6), we aim at learning to select the most relevant data from multiple sources using data metrics. While several measures have been proposed in the past (Moore and Lewis, 2010; Axelrod et al., 2011;Van Asch and Daelemans, 2010;Plank and van Noord, 2011;Remus, 2012), prior work is limited in studying metrics mostly in isolation, using only the notion of similarity (Ben- David et al., 2007) and focusing on a single task (see §6). Our hypothesis is that different tasks or even different domains demand different notions of similarity. In this paper we go beyond prior work by i) studying a range of similarity metrics, including diversity; and ii) testing the robustness of the learned weights across models (e.g., whether a more complex model can be approximated with a simpler surrogate), domains and tasks (to delimit the transferability of the learned weights).
The contributions of this work are threefold. First, we present the first model-independent approach to learn a data selection measure for transfer learning. It outperforms baselines across three tasks and multiple domains and is competitive with state-of-the-art domain adaptation approaches. Second, prior work on transfer learning mostly focused on similarity. We demonstrate empirically that diversity is as important asand complements-domain similarity for transfer learning. Finally, we show-for the first timeto what degree learned measures transfer across models, domains and tasks.

Background: Transfer learning
Transfer learning generally involves the concepts of a domain and a task (Pan and Yang, 2010). A domain D consists of a feature space X and a marginal probability distribution P (X) over X , where X = {x 1 , · · · , x n } ∈ X . For document classification with a bag-of-words, X is the space of all document vectors, x i is the i-th document vector, and X is a sample of documents.
Given a domain D = {X , P (X)}, a task T consists of a label space Y and a conditional probability distribution P (Y |X) that is typically learned from training data consisting of pairs {x i , y i }, where x i ∈ X and y i ∈ Y .
Finally, given a source domain D S , a corresponding source task T S , as well as a target domain D T and a target task T T , transfer learning seeks to facilitate the learning of the target conditional probability distribution P (Y T |X T ) in D T with the information gained from D S and T S where D S = D T or T S = T T . We will focus on the scenario where D S = D T assuming that T S = T T , commonly referred to as domain adaptation. We investigate transfer across tasks in §5.3.
Existing research in domain adaptation has generally focused on the scenario of one-to-one adaptation: Given a set of source domains A and a set of target domains B, a model is evaluated based on its ability to adapt between all pairs (a, b) in the Cartesian product A × B where a ∈ A and b ∈ B (Remus, 2012). However, adaptation between two dissimilar domains is often undesirable, as it may lead to negative transfer (Rosenstein et al., 2005). Only recently, many-to-one adaptation (Mansour, 2009;Wu and Huang, 2016) has received some attention, as it replicates the realistic scenario of multiple source domains where performance on the target domain is the foremost objective.

Data selection model
In order to select training data for adaptation for a task T , existing approaches rank the available n training examples X = {x 1 , x 2 , · · · , x n } of k source domains D = {D 1 , D 2 , · · · , D k } according to a domain similarity measure S and choose the top m samples for training their algorithm. While this has been shown to work empirically (Moore and Lewis, 2010; Axelrod et al., 2011;Plank and van Noord, 2011;Van Asch and Daelemans, 2010;Remus, 2012), using a pre-existing metric leaves us unable to adapt to the characteristics of our task T and target domain D T and foregoes additional knowledge that may be gleaned from the interaction of different metrics. For this reason, we propose to learn the following linear domain similarity measure S as a linear combina-tion of feature values: where φ(X) ∈ R n×l are the similarity and diversity features further described in §3.2 for each training example, with l being the number of features, while w ∈ R l are the weights learned by Bayesian Optimization.
We aim to learn weights w in order to optimize the objective function J of the respective task T on a small number of validation examples of the corresponding target domain D T .

Bayesian Optimization for data selection
As the learned measure S should be agnostic of the particular objective function J, we cannot use gradient-based methods for optimization. Similar to Tsvetkov et al. (2016), we use Bayesian Optimization (Brochu et al., 2010), which has emerged as an efficient framework to optimize any function. For instance, it has repeatedly found better settings of neural network hyperparameters than domain experts (Snoek et al., 2012).
Given a black-box function f : X → R, Bayesian Optimization aims to find an inputx ∈ arg min x∈X f (x) that globally minimizes f . For this, it requires a prior p(f ) over the function and an acquisition function a p(f ) : X → R that calculates the utility of any evaluation at any x.
Bayesian Optimization then proceeds iteratively. At iteration t, 1) it finds the most promising input x t ∈ arg max a p (x) through numerical optimization; 2) it then evaluates the surrogate function y t ∼ f (x t ) + N (0, σ 2 ) on this input and adds the resulting data point (x t , y t ) to the set of observations O t−1 = (x j , y j ) j=1...t−1 ; 3) finally, it updates the prior p(f |O t ) and the acquisition function a p(f |Ot) .
For data selection, the black-box function f looks as follows: 1) It takes as input a set of weights w that should be evaluated; 2) the training examples of all source domains are then scored and sorted according to Equation 1; 3) the model for the respective task T is trained on the top n samples; 4) the model is evaluated on the validation set according to the evaluation measure J and the value of J is returned.
Gaussian Processes (GP) are a popular choice for p(f ) due to their descriptive power (Rasmussen, 2006).
We use GP with Monte Carlo acquistion and Expected Improvement (EI) (Močkus, 1974) as acquisition function as this combination has been shown to outperform comparable approaches (Snoek et al., 2012). 1

Features
Existing work on data selection for domain adaptation selects data based on its similarity to the target domain. Several measures have been proposed in the literature (Van Asch and Daelemans, 2010;Plank and van Noord, 2011;Remus, 2012), but were so far only used in isolation.
Only selecting training instances with respect to the target domain also fails to account for instances that are richer and better suited for knowledge acquisition. For this reason, we consider-to our knowledge for the first time-whether intrinsic qualities of the training data accounting for diversity are of use for domain adaptation in NLP.

Similarity
We use a range of similarity metrics. Some metrics might be better suited for some tasks, while different measures might capture complementary information. We thus use the following measures as features for learning a more effective domain similarity metric.
We define similarity features over probability distributions in accordance with existing literature (Plank and van Noord, 2011). Let P be the representation of a source training example, while Q is the corresponding target domain representation. Let further M = 1 2 (P + Q), i.e. the average distribution between P and Q and let D KL (P ||Q) = n i=1 p i log p i q i , i.e., the KL divergence between the two domains. We do not use D KL as a feature as it is undefined for distributions where some event q i ∈ Q has probability 0, which is common for term distributions. Our features are: • Jensen-Shannon divergence (Lin, 1991): . Jensen-Shannon divergence is a smoothed, symmetric variant of D KL that has been successfully used for domain adaptation (Plank and van Noord, 2011;Remus, 2012). • Rényi divergence (Rényi, 1961): Rényi divergence reduces to D KL if α = 1. We set α = 0.99 following Van Asch and Daelemans (2010). 1 We also experimented with FABOLAS (Klein et al., 2017), but found its ability to adjust the training set size during optimization to be inconclusive for our relatively small training sets.
• Bhattacharyya distance (Bhattacharya, 1943): • Cosine similarity (Lee, 2001): P ·Q P Q . We can treat the distributions alternatively as vectors and consider geometrically motivated distance functions such as cosine similarity as well as the following.
• Euclidean distance (Lee, 2001): We consider three different representations for calculating the above domain similarity measures: • Term distributions (Plank and van Noord, 2011): t ∈ R |V | where t i is the probability of the i-th word in the vocabulary V . • Topic distributions (Plank and van Noord, 2011): t ∈ R n where t i is the probability of the i-th topic as determined by an LDA model (Blei et al., 2003) trained on the data and n is the number of topics. • Word embeddings (Mikolov et al., 2013): where n is the number of words with embeddings in the document, v w i is the pre-trained embedding of the i-th word, p(w i ) its probability, and a is a smoothing factor used to discount frequent probabilities. A similar weighted sum has recently been shown to outperform supervised approaches for other tasks (Arora et al., 2017). As embeddings may be negative, we use them only with the latter three geometric features above.
Diversity For each training example, we calculate its diversity based on the words in the example. Let p i and p j be probabilities of the word types t i and t j in the training data and cos(v t i , v t j ) the cosine similarity between their word embeddings. We employ measures that have been used in the past for measuring diversity (Tsvetkov et al., 2016): • Number of word types: #types.

Tasks, datasets, and models
We evaluate our approach on three tasks: sentiment analysis, part-of speech (POS) tagging, and dependency parsing. We use the n examples with the highest score as determined by the learned data selection measure for training our models. 2 We show statistics for all datasets in Table 1.
Sentiment Analysis For sentiment analysis, we evaluate on the Amazon reviews dataset (Blitzer et al., 2006). We use tf-idf-weighted unigram and bigram features and a linear SVM classifier . We set the vocabulary size to 10,000 and the number of training examples n = 1600 to conform with existing approaches (Bollegala et al., 2011) and stratify the training set.
POS tagging For POS tagging and parsing, we evaluate on the coarse-grained POS data (12 universal POS) of the SANCL 2012 shared task (Petrov and McDonald, 2012). Each domainexcept for WSJ-contains around 2000-5000 labeled sentences and more than 100,000 unlabeled sentences. In the case of WSJ, we use its dev and test data as labeled samples and treat the remaining sections as unlabeled. We set n = 2000 for POS tagging and parsing to retain enough examples for the most-similar-domain baseline.
To evaluate the impact of model choice, we compare two models: a Structured Perceptron (inhouse implementation with commonly used features pertaining to tags, words, case, prefixes, as well as prefixes and suffixes) trained for 5 iterations with a learning rate of 0.2; and a state-of-theart Bi-LSTM tagger (Plank et al., 2016) with word and character embeddings as input. We perform early stopping on the validation set with patience of 2 and use otherwise default hyperparameters 3 as provided by the authors.
Parsing For parsing, we evaluate the state-ofthe-art Bi-LSTM parser by Kiperwasser and Goldberg (2016) with default hyperparameters. 4 We use the same domains as used for POS tagging, i.e., the dependency parsing data with gold POS as made available in the SANCL 2012 shared task. 5  (Petrov and McDonald, 2012) for POS tagging and parsing (below).

Training details
In practice, as feature values occupy different ranges, we have found it helpful to apply znormalisation similar to Tsvetkov et al. (2016). We moreover constrain the weights w to [−1, 1].
For each dataset, we treat each domain as target domain and all other domains as source domains. Similar to Bousmalis et al. (2016), we chose to use a small number (100) target domain examples as validation set. We optimize each similarity measure using Bayesian Optimization with 300 iterations according to the objective measure J of each task (accuracy for sentiment analysis and POS tagging; LAS for parsing) with respect to the validation set of the corresponding target domain.
Unlabeled data is used in addition to calculate the representation of the target domain and to calculate the source domain representation for the most similar domain baseline. We train an LDA model (Blei et al., 2003) with 50 topics and 10 iterations for topic distribution-based representations and use GloVe embeddings (Pennington et al., 2014) trained on 42B tokens of Common Crawl data 6 for word embedding-based representations.
For sentiment analysis, we conduct 10 runs of each feature set for every domain and report mean and variance. For POS tagging and parsing, we observe that variance is low and perform one run while retaining random seeds for reproducibility.  Table 2: Accuracy scores for data selection for sentiment analysis domain adaptation on the Amazon reviews dataset (Blitzer et al., 2006). Best: bold; second-best: underlined.

Baselines and features
We compare the learned measures to three baselines: i) a random baseline that randomly selects n training samples from all source domains We optimize data selection using Bayesian Optimization with every feature set: similarity features respectively based on i) word embeddings, ii) term distributions, and iii) topic distributions; and iv) diversity features. In addition, we investigate how well different representations help each other by using similarity features with the two bestperforming representations, term distributions and topic distributions. Finally, we explore whether diversity and similarity-based features complement each other by in turn using each similarity-based feature set together with diversity features.

Results
Sentiment analysis We show results for sentiment analysis in Table 2. First of all, the baselines show that the sentiment review domains are clearly delimited. Adapting between two similar domains such as Book and DVD is more productive than adaptation between dissimilar domains, e.g. Books and Electronics, as shown in previous work . This explains the strong performance of the most-similar-domain baseline. In contrast, selecting individual examples based on a domain similarity measure performs only as good as chance. Thus, when domains are more clear-cut, selecting from the closest domain is a stronger baseline than selecting from the entire pool of source data.
If we learn a data selection measure using Bayesian Optimization, we are able to outperform the baselines with almost all feature sets. Performance gains are considerable for all domains with individual feature sets (term similarity, word embeddings similarity, diversity and topic similarity), except for Books were improvements for some single feature sets are smaller. Term distributions and topic distributions are the best-performing representations for calculating similarity, with term distributions performing slightly better across all domains. Combining term distribution-based and topic distributionbased features only provides marginal gains over the individual feature sets, demonstrating that most of the information is contained in the similarity features rather than the representations. Diversity features perform comparatively to the best similarity features and outperform them on two domains. Furthermore, the combination of diversity and similarity features yields another sizable gain of around 1 percentage point for almost all domains over the best similarity features, which shows that diversity and similarity features capture complementary information. Term distribution and topic distribution-based similarity features in  Table 3: Results for data selection for part-of-speech tagging and parsing domain adaptation on the SANCL 2012 shared task dataset (Petrov and McDonald, 2012). POS: Part-of-speech tagging. Pars: Parsing. POS tagging models: Structured Perceptron (P); Bi-LSTM tagger (B) (Plank et al., 2016). Parsing model: Bi-LSTM parser (BIST) (Kiperwasser and Goldberg, 2016). Evaluation metrics: Accuracy (POS tagging); Labeled Attachment Score (parsing). Best: bold; second-best: underlined.
conjunction with diversity features finally yield the best performance, outperforming the baselines by 2-6 points in absolute accuracy. Finally, we compare data selection to training on all available source data (in this setup, 6,000 instances). The result complements the findings of the most-similar baseline: as domains are dissimilar, training on all available sources is detrimental. POS tagging Results for POS tagging are given in Table 3. Using Bayesian Optimization, we are able to outperform the baselines with almost all feature sets, except for a few cases (e.g., diver-sity and word embeddings similarity, topic and term distributions). Overall term distributionbased similarity emerges as the most powerful individual feature. Combining it with diversity does not prove as beneficial as in the sentiment analysis case, however, often yields the second-best results. Notice that for POS tagging/parsing, in contrast to sentiment analysis, the most-similar domain baseline is not effective, it often performs only as good as chance, or even hurts. In contrast, the baseline that selects instances (JS -examples) rather than a domain performs better. This makes sense as in SA topically closer domains express sentiment in more similar ways, while for POS tagging having more varied training instances is intuitively more beneficial. In fact, when inspecting the domain distribution of our approach, we find that the best SA model chooses more instances from the closest domain, while for POS tagging instances are more balanced across domains. This suggests that the Web treebank domains are less clear-cut. In fact, training a model on all sources, which is considerably more and varied data (in this setup, 14-17.5k training instances) is beneficial. This is in line with findings in machine translation (Mirkin and Besacier, 2014), which show that similarity-based selection works best if domains are very different. Results are thus less pronounced for POS tagging, and we leave experimenting with larger n for future work.
To gain some insight into the optimization procedure, Figure 1 shows the development accuracy for the Structured Perceptron for an example domain. The top-right and bottom graphs show the hypothesis space exploration of Bayesian Optimization for different single feature sets, while the  Table 4: Accuracy scores for cross-model transfer of learned data selection weights for part-of-speech tagging from Structured Perceptron (P proxy ) to Bi-LSTM tagger (B) (Plank et al., 2016) on the SANCL 2012 shared task dataset (Petrov and McDonald, 2012). Data selection weights are learned using model M S ; Bi-LSTM tagger (B) is then trained using the learned weights. Better than baselines: underlined.
top-left graph displays the overall best dev accuracy for different features. We observe again that term similarity is among the best feature sets and results in a larger explored space (more variance), in contrast to the diversity features whose development accuracy increases less and results in an overall less explored space. Exploration plots for other features/models looks similar.
Parsing The results for parsing are given in Table 3. Diversity features are stronger than for POS tagging and outperform the baselines for all except the Reviews domain. Similarly to POS tagging, term distribution-based similarity as well as its combination with diversity features yield the best results across most domains.

Transfer across models
In addition, we are interested how well the metric learned for one target domain transfers to other settings. We first investigate its ability to transfer to another model. In practice, a metric can be learned using a model that is cheap to evaluate and serves as proxy for a state-of-the-art model, in a way similar to uptraining (Petrov et al., 2010). For this, we employ the data selection features learned using the Structured Perceptron model for POS tagging and use them to select data for the Bi-LSTM tagger. The results in Table 4 indicate that cross-model transfer is indeed possible, with most transferred feature sets achieving similar results or even outperforming features learned with the Bi-LSTM. In particular, transferred diversity significantly outperforms its in-model equivalent. This is encouraging, as it allows to learn a data selection metric using less complex models.

Transfer across domains
We explore whether data selection parameters learned for one target domain transfer to other target domains. For each domain, we use the weights  Table 5: Accuracy scores for cross-domain transfer of learned data selection weights on Amazon reviews (Blitzer et al., 2006). D S : target domain used for learning metric S. B: Book. D: DVD. E: Electronics. K: Kitchen. Sim: term distributionbased similarity. Div: diversity. Best per feature set: bold. In-domain results: gray. SDAMS (Wu and Huang, 2016) listed as comparison.
with the highest performance on the validation set and use them for data selection with the remaining domains as target domains. We conduct 10 runs for the best-performing feature sets for sentiment analysis and report the average accuracy scores in Table 5 (for POS tagging, see Table 6).
The transfer of the weights learned with Bayesian Optimization is quite robust in most cases. Feature sets like Similarity or Diversity trained on Books outperform the strong JS -D baseline in all 6 cases, for Electronics and Kitchen in 4/6 cases (off-diagonals for box 2 and 3 in Table 5). In some cases, the transferred weights even outperform the data selection metric learned for the respective domain, such as on D->E with sim and sim+div features and by almost 2 pp on E->D.  Table 6: Accuracy scores for cross-domain transfer of learned data selection weights for part-of-speech tagging with the Structured Perceptron model on the SANCL 2012 shared task dataset (Petrov and Mc-Donald, 2012). D S : target domain used for learning metric S. Best: bold. In-domain results: gray.
Transferred similarity+diversity features mostly achieve higher performance than other feature sets, but the higher number of parameters runs the risk of overfitting to the domain as can be observed with two instances of negative transfer with sim+div features.
As a reference, we also list the performance of the state-of-the-art multi-domain adaptation approach (Wu and Huang, 2016), which shows that task-independent data selection is in fact competitive with a task-specific, heuristic state-of-the-art domain adaptation approach. In fact our transferred similarity+diversity feature (E->D) outperforms the state-of-the-art (Wu and Huang, 2016) on DVD. This is encouraging as previous work (Remus, 2012) has shown that data selection and domain adaptation can be complementary.

Transfer across tasks
We finally investigate whether data selection is task-specific or whether a metric learned on one task can be transferred to another one. For each feature set, we use the learned weights for each domain in the source task (for sentiment analysis, we use the best weights on the validation set; for POS tagging, we use the Structured Perceptron model) and run experiments with them for all domains in the target task. 7 We report the averaged accuracy 7 E.g., for SA->POS, for each feature set, we obtain one set of weights for each of 4 SA domains, which we use to   Table 5. In-task results: gray. Better than base: underlined.
scores for transfer across all tasks in Table 7. Transfer is productive between related tasks, i.e. POS tagging and parsing results are similar to those obtained with data selection learned for the particular task. We observe large drops in performance for transfer between unrelated tasks, such select data for the 6 POS domains, yielding 4 · 6 = 24 results. as sentiment analysis and POS tagging, which is expected since these are very different tasks. Between related tasks, the combination of similarity and diversity features achieves the most robust transfer and outperforms the baselines in both cases. This suggests that even in the absence of target task data, we only require data of a related task to learn a successful data selection measure.

Related work
Most prior work on data selection for transfer learning focuses on phrase-based machine translation. Typically language models are leveraged via perplexity or cross-entropy scoring to select target data (Moore and Lewis, 2010; Axelrod et al., 2011;Duh et al., 2013;Mirkin and Besacier, 2014). A recent study investigates data selection for neural machine translation (van der Wees et al., 2017). Perplexity was also used to select training data for dependency parsing (Søgaard, 2011), but has been found to be less suitable for tasks such as sentiment analysis (Ruder et al., 2017). In general, there are fewer studies on data selection for other tasks, e.g., constituent parsing (McClosky et al., 2010), dependency parsing (Plank and van Noord, 2011;Søgaard, 2011) and sentiment analysis (Remus, 2012). Work on predicting task accuracy is related, but can be seen as complementary (Ravi et al., 2008;Van Asch and Daelemans, 2010).
Many domain similarity metrics have been proposed.  show that proxy A distance can be used to measure the adaptability between two domains in order to determine examples for annotation. Van Asch and Daelemans (2010) find that Rényi divergence outperforms other metrics in predicting POS tagging accuracy, while Plank and van Noord (2011) observe that topic distribution-based representations with Jensen-Shannon divergence perform best for data selection for parsing. Remus (2012) apply Jensen-Shannon divergence to select training examples for sentiment analysis. Finally, Wu and Huang (2016) propose a similarity metric based on a sentiment graph. We test previously explored similarity metrics and complement them with diversity.
Very recently interest emerged in curriculum learning (Bengio et al., 2009). It is inspired by human active learning by providing easier examples at initial learning stages (e.g., by curriculum strategies such as growing vocabulary size). Curriculum learning employs a range of data metrics, but aims at altering the order in which the entire training data is selected, rather than selecting data. In contrast to us, curriculum learning is mostly aimed at speeding up the learning, while we focus on learning metrics for transfer learning. Other related work in this direction include using Reinforcement Learning to learn what data to select during neural network training (Fan et al., 2017).
There is a long history of research in adaptive data selection, with early approaches grounded in information theory using a Bayesian learning framework (MacKay, 1992). It has also been studied extensively as active learning (El-Gamal, 1991). Curriculum learning is related to active learning (Settles, 2012), whose view is different: active learning aims at finding the most difficult instances to label, examples typically close to the decision boundary. Confidence-based measures are prominent, but as such are less widely applicable than our model-agnostic approach.
The approach most similar to ours is by Tsvetkov et al. (2016) who use Bayesian Optimization to learn a curriculum for training word embeddings. Rather than ordering data (in their case, paragraphs), we use Bayesian Optimization for learning to select relevant training instances that are useful for transfer learning in order to prevent negative transfer (Rosenstein et al., 2005). To the best of our knowledge there is no prior work that uses this strategy for transfer learning.

Conclusion
We propose to use Bayesian Optimization to learn data selection measures for transfer learning. Our results outperform existing domain similarity metrics on three tasks (sentiment analysis, POS tagging and parsing), and are competitive with a state-of-the-art domain adaptation approach. More importantly, we present the first study on the transferability of such measures, showing promising results to port them across models, domains and related tasks.