Evaluating the Values of Sources in Transfer Learning

Transfer learning that adapts a model trained on data-rich sources to low-resource targets has been widely applied in natural language processing (NLP). However, when training a transfer model over multiple sources, not every source is equally useful for the target. To better transfer a model, it is essential to understand the values of the sources. In this paper, we develop , an efficient source valuation framework for quantifying the usefulness of the sources (e.g., ) in transfer learning based on the Shapley value method. Experiments and comprehensive analyses on both cross-domain and cross-lingual transfers demonstrate that our framework is not only effective in choosing useful transfer sources but also the source values match the intuitive source-target similarity.


Introduction
Transfer learning has been widely used in learning models for low-resource scenarios by leveraging the supervision provided in data-rich source corpora. It has been applied to NLP tasks in various settings including domain adaptation (Blitzer et al., 2007;Ruder and Plank, 2017), cross-lingual transfer (Täckström et al., 2013;Wu and Dredze, 2019), and task transfer (Liu et al., 2019b;Vu et al., 2020).
A common transfer learning setting is to train a model on a set of sources and then evaluate it on the corresponding target (Yao and Doretto, 2010;Yang et al., 2020). 1 However, not every source corpus contributes equally to the transfer model. Some of them may even cause a performance drop (Ghorbani and Zou, 2019;Lin et al., 2019). Therefore, it is essential to understand the value of each source in the transfer learning not only to achieve 1 In this paper, we focus on two transfer learning scenarios: 1) cross-lingual and 2) cross-domain. We train a model on a set of source corpora and evaluate on a target corpus where each "corpus" refers to the corresponding domain or language.  Figure 1: SEAL-Shap estimates the value of each source corpus by the average marginal contribution of that particular source corpus to every possible subset of the source corpora. Each block inside SEAL-Shap denotes a possible subset and the marginal contribution is derived by the difference of transfer results while trained with and without the corresponding source. Based on the source values, we select a subset of source corpora that achieves high transfer accuracy. a good transfer performance but also for analyzing the source-target relationships.
Nonetheless, determining the value of a source corpus is challenging as it is affected by many factors, including the quality of the source data, the amount of the source data, and the difference between source and target at lexical, syntax and semantics levels (Ahmad et al., 2019;Lin et al., 2019). The current source valuation or ranking methods are often based on single source transfer performance (McDonald et al., 2011;Lin et al., 2019;Vu et al., 2020) or leave-one-out approaches (Tommasi and Caputo, 2009;Li et al., 2016;Feng et al., 2018;Rahimi et al., 2019). They do not consider the combinations of the sources. Consequently, they may identify the best single source corpus effectively but their top-k ranked source corpora may achieve limited gain in transfer results.
In this paper, we introduce SEAL-Shap (Source sElection for trAnsfer Learning via Shapley value), a source valuation framework 2 (see Fig 1) based on the Shapley value (Shapley, 1952;Roth, 1988) in cooperative game theory. SEAL-Shap adopts the notion of Shapely value to understand the contribution of each source by computing the approximate average marginal contribution of that particular source to every possible subset of the sources.
Shapley value is a unique contribution distribution scheme that satisfies the necessary conditions for data valuation like fairness and additivity (Dubey, 1975;Jia et al., 2019a,b). As many model explanation methods including Shapley value are computationally costly ( Van den Broeck et al., 2021), in a different context of features and data valuation in machine learning, Ghorbani and Zou (2019) propose to use an approximate Shapley value to estimate the feature or data values.
However, the existing approximation methods for estimating Shapley values are not scalable for NLP applications. NLP models are often large (e.g., BERT (Devlin et al., 2019)) and NLP transfer learning usually assumes a large amount of source data. To deal with the scalability issue, we propose a new sampling scheme, a truncation method, and a caching mechanism to efficiently approximate the source Shapley values.
We evaluate the effectiveness of SEAL-Shap under various applications in quantifying the usefulness of the source corpora and in selecting potential transfer sources. We consider two settings of source valuation or selection: (1) where a small target corpus is available; and (2) where we only have access to the linguistic or statistical features of the target, such as language distance to the sources, typological properties, lexical overlap etc. For the first setting, we use the small target data as the validation set to measure the values of the sources w.r.t the target. For the second setting, we follow Lin et al. (2019) to train a source ranker based on SEAL-Shap and the available features.
We conduct extensive experiments in both (zeroshot) cross-lingual and cross-domain transfer settings on three NLP tasks, including POS tagging, sentiment analysis, and natural language inference (NLI) with different model architectures (BERT and BiLSTM). In a case study, on the cross-lingual transfer learning, we exhibit that the source language values are correlated with the language 2 Our source codes are available at https://github. com/rizwan09/NLPDV/ family and language distance-indicating that our source values are meaningful and follow the intuitive source-target relationships. Lastly, we analyze the approximation correctness and the run-time improvement of our source valuation framework SEAL-Shap.

Source Valuation Framework
We propose SEAL-Shap, a source valuation framework. We start with the setting where we have only one target and multiple sources. We denote the target corpus by V and the corresponding set of source corpora by D = {D 1 , · · · , D m }. Our goal is to quantify the value Φ j of each source corpus D j to the transfer performance on V and explain model behaviors. Once the source values are measured, we can then develop a method to select either all the sources or a subset of sources (i.e., ⊆ D) that realizes a good transfer accuracy on V . Below, we first review the data Shapley value and its adaptation for transfer learning. Then, we describe how SEAL-Shap efficiently quantifies Φ j and how to use it to select a subset of sources for model transfer.

Background: Data Shapley Value
Shapley value is designed to measure individual contributions in collaborative game theory and has been adapted for data valuation in machine learning (Ghorbani and Zou, 2019;Jia et al., 2019a,b). In the transfer learning setting, on a target corpus V , let Score(C Ω , V ) represent the transfer performance of a model C trained on a set of source corpora Ω. 3 The Shapley value Φ j is defined as the average marginal contribution of a source corpus D j to every possible subsets of corpora D: TMC-Shap for Transfer Learning: Computing the exact source-corpus Shapley value, described above, is computationally difficult as it involves evaluating the performances of the transfer models trained on all the possible combinations of the source corpora. Hence, Ghorbani and Zou (2019) propose to approximate the evaluation by a truncated Monte Carlo method. Given the target corpus V and a set of source corpora D, for each epoch, a source training data set Ω ⊆ D is maintained and a random permutation π on D is performed (corresponds to line 6 in Algorithm 1 which is discussed in Sec 2.2). Then it loops over every source corpus π j in the ordered list π and compute its marginal contribution by evaluating how much the performance improves by adding π j to Ω: Score(C Ω∪π j , V ) − Score(C Ω , V ). These processes are repeated multiple rounds and the average of all marginal contributions associated with a particular source corpus is taken as its approximate Shapley value (line 18 in Algorithm 1). When the size of Ω increase, the marginal contribution of adding a new source corpus becomes smaller. Therefore, to reduce the computation, Ghorbani and Zou (2019) propose to truncate the computations at each epoch when the marginal contribution of adding a new source π j is smaller than a user defined threshold Tolerance (line 10-11, 18 in Algorithm 1). 4

SEAL-Shap
from Ω by sampling training instances from each source corpus Ω x with a user defined sample rate η. Then, we train the model on T (line 14-15 in Algorithm 1). The quantitative effectiveness of this technique is discussed in Sec 4.4 and the impact of different sampling rates are presented in Fig 5. Truncation As discussed in Sec 2.1, at each epoch, Ghorbani and Zou (2019) truncate the computations once a marginal contribution becomes small when looping over the ordered list π of that corresponding epoch, typically for the last few sources in π. On the other hand, at the beginning of each epoch, when computing the marginal contribution by adding the first source corpus π 1 into an empty Ω, the contribution is computed by the performance gap between a model trained on π 1 and a random baseline model without any training. Usually, the performance of a random model (v 0 ) is low and hence, the marginal contribution is high in the first step, in general. As this scale of marginal contributions at the first step is drastically different from later steps, it leads TMC-Shap to converge slowly. Hence, to restrict the variance of the marginal contributions, we down weight the marginal contributions of the first step by setting v 0 = ρ, where ρ is a hyper-parameter 6 indicating the baseline performance of a model (line 7, 18 in Algorithm 1).
Caching When computing the source Shapley values, we have to repeatedly evaluate the performance of the model on different subsets of source corpora. Sometimes, we may encounter subsets that we have evaluated before. For example, consider a set of source corpora D = {D 1 , D 2 , D 3 } and we evaluate their Shapley values through two permutations: π 1 = [D 3 , D 1 , D 2 ], and π 2 = [D 1 , D 3 , D 2 ]. When we compute the marginal contribution of the last source corpus D 2 , in both cases the training set Ω = {D 1 , D 3 }. That is, if we cache the result of Score(C D 1 ∪D 3 ), then we can reuse the scores. We implement this cache mechanism in line 1, 13, 16, 17 in Algorithm 1. With these optimization techniques, we improve the computation time by about 2x (see Sec 4.4). This enables us to apply this techniques in NLP transfer learning. Note that whenever an Ω causes a cache miss, for each source Ω x , as discussed above in this Section, we sample a new set of instances (line 13-14 in Algorithm-1). Thus, given a reasonably large number of epochs, our approach performs sampling for a large number of times and in aggregation, it evaluates a wide number of samples in each source.

SEAL-Shap for Multiple Targets
Many applications require to evaluate the values of a set of sources with respect to a set of targets. For example, under the zero-shot transfer learning setting, we assume a model is purely trained on the source corpora without using any target data. Consequently, then the same trained model can be evaluated on multiple target corpora. With this intuition, whenever the model is trained on a new training set Ω, SEAL-Shap evaluates it on all the target corpora and caches all of them accordingly.

Source Values without Evaluation Corpus
In the previous discussions above, we assume a small annotated target corpus is available and can be used to evaluate the transfer performances. However, in some scenarios, only some linguistic or statistical features of the sources and targets, such as language distance and word overlap, are available. Lin et al. (2019) show that by using these features, we can train a ranker to sort the sources to unknown targets by predicting their value. In the following, we extend their ranker by incorporating it with SEAL-Shap.
Given the set of training corpora D and the actual target corpus V , we iteratively consider each training corpus D j as target and the rest m-1 corpora as the sources. We compute the corresponding source values Y Now, w.r.t the target D j , the linguistic or statistical features of the source corpora (e.g., language distance from the target, lexical overlap between the corresponding source and the target) X where F j denotes the source feature generator function for the corresponding target D j . This feature vector of the source corpora (X D j D ) is a training input and their value vector (Y D j D ) is the corresponding training output for the ranker. We repeat this for each training corpus and generate the respective training inputs and outputs for the ranker. Once trained, for the actual target V and the source corpora D, the ranker can predict the values of the source corpora Y V D only based on the linguistic source features X V D .

Source Corpora Selection by SEAL-Shap
The source values computed in Sec 2.2-2.4 estimate the usefulness of the corresponding transfer sources and can be used to identify the potential sources which lead to the good transfer performances. We select the potential source corpora in two ways. (i) Top-k: We simply sort the sources based on their values and select the user defined top-k sources. (ii) Threshold: When an annotated evaluation dataset in target corpus V is available, after computing the source values, we empirically set a threshold θ and select each source that has source value higher than θ. On that evaluation target corpus, we tune and set θ for which the corresponding transfer model achieves the best performance.

Experimental Settings
We conduct experiments on zero-shot cross-lingual and cross-domain transfer settings. Models are trained only on the source languages/domains and directly applied in target languages/domains. dataset (Conneau et al., 2018), that covers 15 different languages. XNLI task is a 3-way classification task (entailment, neutral, and contradiction Hyper-parameters Tuning For all BERT models, we tune the learning rate, batch size, and number of epochs. We also tune the number of epochs nepoch in Algorithm 1, the threshold SEAL-Shap value θ, initial score ρ. Details are in Appendix K.

Results and Discussion
In the following, we first verify SEAL-Shap is an effective tool for source valuation. Then, we evaluate the source values when an evaluation target corpus is unavailable. In Sec 4.3, we interpret the relations between sources and targets based on the SEAL-Shap values. Finally, we analyze our method with comprehensive ablation studies.

Evaluating Source Valuation
We assess our source valuation approach in compare to the following baselines: (i) Baseline-s: source values are based on the single source transfer performance. (ii) Leave-one-out (LOO): source values are based on how much transfer performance we loose if we train the model on all the sources except the corresponding one. (iii) Baseline-r: a random baseline that assigns random values to sources. 8 (iv) Greedy DFS: the top-1 ranked source is same as that of Baseline-s. Next, it selects one of the remaining sources as top-2 that gives the best transfer result along with the top-1 and so on. (v) Lang-Dist: (if available) in reverse order of targetsource language distance (Ahmad et al., 2019). 9 Balancing Source Corpora In the experiements, our focus is to understand the values of the sources. For some datasets, the sizes of source corpora are very different. For example, in UD Treebank, the number of instances in Czech, and Turkish is 69k, 3.5k, respectively. Since data-size is an obvious factor, we conduct experiments on balanced data to reduce the influence of data-size in the analysis. We sub-sample the source corpora to ensure their sizes are similar. Specifically, for the cross-domain NLI task, we sample 20k instances for each source.  Table 1: Performance on universal POS tagging when using each of language as the target language and the rest as source languages . '*', '$', ' †' denote SEAL-Shap model is statistically significantly outperforms All Sources, Baseline-r and Baseline-s respectively using paired bootstrap test with p ≤ 0.05. "en" refers to the only source ("en") results in Wu and Dredze (2019).
For others, we sub-sample each source such that the size of the corpus is the same as the smallest one in the dataset. However, our approach can handle both balanced or unbalanced data and the source values are similar in conclusions (e.g., see Fig 5).
Result: We first compare these methods by selecting top-k sources ranked by each of the approach and reporting the corresponding transfer performance. With k = 3, we plot the corresponding transfer results and the running time for valuation in Fig 2. As mentioned in Sec 1, the relatively strong Baseline-s can select the best performing top-1 source but with top-2 and top-3 sources, the performances drop on cross-domain sentiment analysis and cross-lingual POS tagging (See Fig 2(c) and 2(a)) while our approach shows a consistent gain in all of the these tasks and with top-3 sources it achieves the best performances. Appendix I plots the results with higher k. Next, as in Sec 2.5, we tune a threshold θ and either select all the sources as useful or a smaller subset of m number of sources (i.e., m < |D|) whose SEAL-Shap values are higher than θ. In the followings, we compare the model performances of these m sources selected by SEAL-Shap with the same top-m sources ranked by the aforementioned baseline methods. Being relatively weak or slow, we do not further report performances for LOO,   Lang-Dist, and Greedy DFS. Rather we consider another strong baseline All Sources that uses all the source corpora D. This is a strong baseline as it is trained on more source-corpus instances in general.

Cross-Lingual POS Tagging
We evaluate the source selection results on zero-shot cross-lingual POS tagging in Table 1. Among the 31 target languages, in 21 of them, SEAL-Shap selects a small subset of source corpora. From the Table, overall, SEAL-Shap selects source corpora with high usefulness for training the model, and except for few cases the model constantly outperforms all the baselines by more than 0.5% in avg token accuracy. In 13 of them, it is statistically significant by a paired bootstrap test. The gap is especially high for English, Czech, and Hindi. These results demonstrate that SEAL-Shap is capable in both quantifying the source values and also in source selection. We report the full results on the dev and test set of all target languages in Appendix M, N respectively. For each row in Table 1, the number of selected sources are reported in Appendix S.   sources, Baseline-s source values match with ours in general. However, SEAL-Shap significantly outperforms Baseline-r on all 5 cases and All-Sources twice. It even outperforms MMD, and RENYI (Liu et al., 2019a) on Newsgroups (N), Reviews (R), and Weblogs (WB) despite they select source data at instance level and use additional resources.

Cross-Domain POS Tagging
Cross-Lingual NLI In Table 3, we show the XNLI results in 8 target languages where SEAL-Shap selects a small subset of source corpora. Among them, in 3 languages, Baseline-r marginally surpasses ours. However, in 5 other languages SEAL-Shap outperforms all the baselines with clear margin specially on Bulgarian, Vietnamese with about 1% better accuracy (full results in Appendix E). Cross-Domain NLI Next, we evaluate SEAL-Shap on the modified GLUE dataset in Table 5. SEAL-Shap outperforms Baseline-s once and other baselines in all cases. Its highest performance improvement is gained on QNLI, where it outperforms others by 4%. Cross-Domain Sentiment Analysis Among the 13 target domains in the multi-domain sentiment analysis dataset, in 5 domains SEAL-Shap selects a small subset (full results in Appendix O). As in Table 4), with a large margin, SEAL-Shap achieves higher accuracy than all other baselines and, in 4 cases, it is even better than Cai and Wan (2019) that uses unlabeled target data.
Our experimental evidences show that SEAL-Shap is an effective tool in choosing useful transfer sources and can achieve higher transfer performances than other source valuation approaches.

Results without an Evaluation Corpus
We evaluate the effectiveness of SEAL-Shap to build a straightforward ranker that directly computes the source values without any evaluation target corpus (see Sec 2.4). We use the ranker in Lin et al. (2019) as the underlying ranking model. First, we show that the source values evaluated by the ranker is as good as SEAL-Shap that uses its annotated target dataset. We compare the transfer performances of the top-k sources based on the source values computed with and without the evaluation corpus. Then, we show that the ranker trained with SEAL-Shap is more effective than training it with the existing single source based Baseline-s.
In cross-lingual POS tagging on UD Treebank, for each of the 31 target languages, we set aside that language and consider the remaining 30 languages as the training corpora. We then train the ranker as described in Sec 2.4 and compute the source values using it. As for reference, we pass the evaluation target dataset and the 30 source languages to SEAL-Shap to compute their values on (a) XNLI, target: 'es', R <10% (b) mGLUE, target: MNLI-mm, R=10-20% (c) SANCL'12, target: wsj, R ∼50% Figure 5: Source values by TMC-Shap and ours. TMC-Shap uses unbalanced full source corpora whereas SEAL-Shap that achieves similar source values uses balanced and sampled source corpora. Even with a small sample rate (R), source order is almost same. Higher sampling rate typically refers to better approximation but leads to expensive runtime. In general, for a reasonably large corpus, 20-30% samples (>few thousands) are found sufficient to achieve reasonable approximation.
the evaluation dataset. With k = 3, we compare the transfer results of the top-k sources of these two methods in Fig 3. We also plot the results of the baseline ranker (Lin et al., 2019) that is trained with Baseline-s. Results show that the ranker source values are similar to the sources values estimated by SEAL-Shap with an annotated evaluation dataset and also it outperforms the baseline.

Interpret Source Value by SEAL-Shap
In this Section, we show that SEAL-Shap values provide a means to understand the usefulness of the transfer sources in cross-lingual and cross-domain transfer. We first analyze cross-lingual POS tagging. Following Ahmad et al. (2019), we consider using language family and word-order distance as a reference distance metric. We anticipate that languages in the same language family with smaller word-order distance from the target language are more valuable in multi-lingual transfer. We plot SEAL-Shap of source languages evaluated on two target languages English ("en") and Hindi ("hi") in Fig 4. In the x-axis, a common set of twenty different source languages are grouped into ten different language families and sorted based on the word order distance from English. As the figure illustrates, Germanic and Romance languages have higher Shapley values when using English as the target language. The value gradually decreases for language of other families when the word order distance increase. As for the target language Hindi, the trend is opposite, in general. Figure 7: Similar SEAL-Shap value curves for two closely related XNLI targets "en" and "fr". In XNLI, the source corpora are prepared by machine translating from "en". This data processing may affect the source values. Translation into "zh" being relatively better, although different from both targets, its source values are higher than others.
Analogously, in cross-domain NLI, we find that correlation between QNLI, and QQP is high whereas between MNLI-mm and QQP, it is lower (see Appendix Q). SEAL-Shap on Similar Targets Intuitively, if two target corpora are similar, the corresponding Shapley values of the source corpora when transferring to these two targets should be similar as well. To verify, in Fig 6, we plot the Shapley values of twenty nine source languages for targets Russian and Serbian on cross-lingual POS tagging. Also we plot the source values when transferring a NLI model to English and French in Fig 7. We observe that the corresponding curves are almost identical, and SEAL-Shap in fact selects the same set of source corpora as potential. These results suggest that if there is no sufficient data in the target corpus, it is also possible to use a neighboring corpus as a proxy to compute SEAL-Shap values. Source Values Influenced by Data Processing Typically, the sources with least or negative source values are from the domains/languages that are different from the targets (e.g., Fig 4). However, in some cases, source usefulness (i.e., values) is affected by the data preparing process. For example, in XLNI, the source corpora are prepared by machine translation from "en" (Conneau et al., 2018) and the quality of this translation into "zh" is better   in compare to other languages, in general. Consequently, in Fig 7, "zh" has higher source value for both targets "en" and "fr".

Analysis and Ablation Study
Finally, we analyze the proposed Algorithm 1 for computing Shapley value approximately. How good is the approximation? In Fig 5, we compare SEAL-Shap with TMC-Shap (Ghorbani and Zou, 2019) on three datasets (details in Appendix F). Overall, the Shapley values obtained by SEAL-Shap and TMC-Shap are highly correlated and their relative orders are matched, while SEAL-Shap is much more efficient. Note that, the rankings themselves being same/similar, the model performances using the same/similar top-k sources are same/similar, too; therefore, we do not list their transfer performances furthermore. Ablation Study: We examine the effectiveness of each proposed components in SEAL-Shap. Results are shown in Table 6 and details are in Appendix F-H. Results show that without the proposed approximation, TMC-Shap is computational costly and is impractical to use to analyze the value of source corpus in the NLP transfer setting. All the proposed components contribute to significantly speed-up the computations. Is the approximation sensitive to the order of permutations? As SEAL-Shap is a Monte Carlo approximation, we study if SEAL-Shap is sensitive to the random seed using the cross-lingual POS tagging task. To analyze, we first compute a reference Shapley values by running SEAL-Shap until empirically convergence (blue line). Then, we report the Shapley value produced by another random seed. Fig 8 shows that with enough epochs, the values computed by different random seeds are highly correlated (more in Appendix H).

Related Work
As discussed in Section 1, transfer learning has been extensively studied in NLP to improve model performance in low-resource domains and languages. In the litearture, various approaches have been proposed to various tasks, including text classification (

Conclusion
We propose SEAL-Shap to quantify the value of the source corpora in transfer learning for NLP by computing an approximate Shapley value for each corpus. We show that SEAL-Shap can be used to select source corpora for transfer and provide insight on understanding the value of source corpora. In the future, we plan to further improve the runtime of our source valuation approach by limiting the repetition of model training.

Acknowledgments
We thank the anonymous reviewers for their insightful feedback. We also thank UCLA-NLP for discussion and feedback. This work was supported in part by NSF 1927554 and DARPA MCS program under Cooperative Agreement N66001-19-2-4032. The views expressed are those of the authors and do not reflect the official policy or position of the Department of Defense or the U.S. Government.    Table 9: Data Statistics of SANCL 2012 shared task dataset (Petrov and McDonald, 2012) 5100

C Modified GLUE NLI Task for Domain Transfer
For NLI, we consider the 2-class classification (e.g., entailed or not) corpora used in Ma et al. (2019) that is made upon modification on 4 Glue benchmark (Wang et al., 2018) problems: SNLI, MNLI, QNLI, and QQP. We split the MNLI training set into a corpora of "fiction", "slate", "govt.", "travel", and "telephone" as in Williams et al. (2018) and always include them in the source corpora for all target domains. Here, As the annotations for GLUE test sets are publicly unavailable, for each target domain, we consider the original dev set as pseudo test set and randomly select 2k instances from training set for parameter tuning (i.e., pseudo dev set). For MNLI as target, we have two original dev set. Hence, we take the 2k instances from matched dev set as pseudo dev set and consider the miss-matched corpus as pseudo test set. Therefore, in zero-shot setting, the number of source corpora for target MNLI is 8, and for others it is 7.    All model performances are same when selecting all source corpora as potential.

F How good is the approximation?
We consider three different datasets: XNLI (target: 'es'), Modified GLUE NLI dataset(target: 'MNLImm'), and SANCL 2012 shared task for POS tagging (target: 'WSJ'). We use the corresponding full size source tasks except for the extremely large XNLI in which we randomly sample half of each source task (180k instances) and compute the Shapley value adopting the source code released by Ghorbani and Zou (2019). Then, on XNLI dataset we consider sample size 50k, on GLUE 20k, on SANCL 2012 sahred task 2k for each source task. Then we use Algorithm 1 (in the main paper) with tuned initial score to compute the approximate data Shapley value. Instead of full convergence, we do early stop by setting the Shapley value nepochs to 10, 50, 30 on XNLI, GLUE, and SANCL datsets respectively.

H Approximate Shapley Value with Different Seeds:
In Figure 9, we plot the SEAL-Shap value w.r.t same threshold (θ k = 0) by different seeds.    Number of model parameters BERT model 10 million parameters. For each task, no preprocessing is performed other than the tokenization of words into subwords with WordPiece except for cross-lingual POS for which we use an oppen-sourced multilingual preprocessing toolkit 10 to remove "strange control character" tokens. Following Wu and Dredze (2019), we also limit subwords sequence length to 128 to fit in a single GPU for all tasks. For all tasks, we use the accuracy metric.

5108
K Hyper-parameters Tuning: For the small multi-domain sentiments analysis dataset, we do a full search of the combination of learning rate, batch size and number epochs up to 5. For all other large scale datasets, we perform a greedy search. We first find the best combination of learning rate and batch size. Then we tune the number of epochs. For the extremely large XNLI, in which for any target task the multi-source training data size is ∼ 5.5M, we tune only when our framework select a smaller subset of the source corpora for learning rate in {3×10 −5 , 5×10 −5 }, batch size {32} and epochs within 50k steps (i.e., no more than 3 epochs). On XNLI, when our framework selects all source corpora as potential, we do not further tune the hyper-parameters both baselines have the same training set as SEAL-Shap. Hence, we report the result using a default learning rate 5 −5 , batch size 32. All test results reported in this paper are performed on the corresponding test set 11 using a single gpu. nepoch within 10 or 20. For any target task V k , the corresponding threshold Shapley value θ k is chosen in {1×10 −2 , 1×10 −3 , 5×10 −3 }, initial scores ρ in {R, N , 0.5, All sources/2, All Sources, µ} where R is a random baseline model performance (i.e., randomly initialized model performance); given the total number of sources n, N = n−1 , and D is all source tasks. This means we also tune SEAL-Shap value as the mean of a combination of SEAL-Shap values, leave one out values, and single source transfer values. For Shapley value computation, for multiple seeds run like on cross-lingual POS tagging and XNLI, seed 42, 43 is used. All the hyperparameter tuning is done with default seed in the open-sourced Transformer implementation 13 which is 42. All the SEAL-Shap values are calculated using single seed. Only for plotting Figure 6 in main paper, on cross-lingual POS tagging for target English two different seeds are used. The blue curve in Figure 6, and the results in Table 2 are using the same seed and all other plots uses the other seed. All the parameter configuration and the dev set performance will be reported here upon acceptance. All computations are performed on gpus; in general using (4,8,1) #gpus. Note that while tuning, if there is no θ for which the corresponding subset of sources (i.e., ⊂ D) achieves better result than using all of D, then we select the set of all sources D assuming each source is contributing positively. 11 for GLUE NLI, pseudo test set 12 github.com/flairNLP/flair/blob/master/resources/docs/EXPERIMENTS.md 13 github.com/huggingface/transformers/ L Training a Direct Source Selection Ranker using SEAL-Shap

M Dev set Results
In   Table 14: Performance on universal POS tagging (test set) when using each of language as the target language and the rest as source languages . '*' and '$' denote SEAL-Shap model is statistically significantly outperforms All Sources and Baseline-s respectively using paired bootstrap test with p ≤ 0.05. 'en' refers to the best single source ('en') results, reported in Wu and Dredze (2019).
All model performances are same when selecting all source corpora as potential. (See line 2 in Table  13 and Table 14).

O Full Cross-domain Sentiment Analysis Results
Model books kitchen dvd electronics apparel camera baby health magazines MR software video toys sports Avg Cai and Wan (2019)   5113 P SEAL-Shap values for two similar targets Figure 14: Similar SEAL-Shap value curvature of two close language English ("en") and French ("fr") on crosslingual NLI.