Domain Divergences: a Survey and Empirical Analysis

Domain divergence plays a significant role in estimating the performance of a model when applied to new domains. While there is significant literature on divergence measures, choosing an appropriate divergence measures remains difficult for researchers. We address this shortcoming by both surveying the literature and through an empirical study. We contribute a taxonomy of divergence measures consisting of three groups -- Information-theoretic, Geometric, and Higher-order measures -- and identify the relationships between them. We then ground the use of divergence measures in three different application groups -- 1) Data Selection, 2) Learning Representation, and 3) Decisions in the Wild. From this, we identify that Information-theoretic measures are prevalent for 1) and 3), and higher-order measures are common for 2). To further help researchers, we validate these uses empirically through a correlation analysis of performance drops. We consider the current contextual word representations (CWR) to contrast with the older word distribution based representations for this analysis. We find that traditional measures over word distributions still serve as strong baselines, while higher-order measures with CWR are effective.


Introduction
Machine learning models perform poorly when they are tested on data that comes from a different target domain. Target domain performance largely depends on the divergence between the domains (Ben-David et al., 2010). Measuring domain divergence efficiently is important for adapting models in the new domain -the topic of domain adaptation. Domain adaptation also applies in the tasks of predicting the model performance drop in real-world settings (Van Asch and Daelemans, 2010), and in choosing among alternate models (Xia et al., 2020).
Research has invested much effort to define and measure domain divergence. Linguists use register variation to capture varieties in text -the difference in distribution of the prevalent features between two registers (Biber and Conrad, 2009). (Ben-David et al., 2010) introduce a probabilistic measure H-divergence -to measure the difference between feature-distributions in the source and target domain. Further, information-theoretic measures like Leibler (KL) and Jenssen Shannon (JS) divergence based on surface features of text are also used for different applications (Plank and van Noord, 2011;Van Asch and Daelemans, 2010). In recent times, there is an emerging class of measures like Maximum Mean Discrepancy (MMD) and central Moment Discrepancy (CMD) (Gretton et al., 2007;Zellinger et al., 2017) that consider higher order moments of random variables. These measures are utilised for different applications in NLP, albeit in silos.
Given a plethora of divergence measures in the NLP literature, researchers are not clear on which measure is suitable for a given NLP task. To aid them, we first comprehensively review the NLP literature on domain divergences. Unlike, prior surveys in NLP, which focus on domain adaptation on particular tasks like, machine translation (Chu and Wang, 2018), and statistical (non-neural network) models (Jiang, 2007;Margolis, 2011), our work takes a different tack. We study domain adaptation through the vehicle of domain divergence measures. We group divergence measures into a taxonomy of three classes: Information Theoretic, Geometric, and Higher-Order measures. We then identify relationships between measures under each class. To identify the common class of measures used in different NLP applications, we recognise three novel applications of divergence -Data Selection, Learning Representations, and Decisions in the Wild -and organise their litera-ture. We find that Information Theoretic measures over word distributions are common for Data Selection and Decisions in the wild, while Higherorder measures over continuous features are common for Learning representations.
As divergence between domains is a major determiner of target domain performance, a good domain divergence metric ideally predicts the corresponding performance drop of a model when applied on a new domain. We further help researchers identify appropriate measures, by performing a correlation analysis over 130 domain adaptation scenarios and three standard and varied NLP tasks: Part of Speech Tagging (POS), Named Entity Recognition (NER), and Sentiment Analysis. While information theoretic measures over word distributions are popular in the literature, are higher order measures calculated over contextual word representations better indicators of performance drop? We find that while higher order measures are better, traditional measures are still reliable indicators of performance drop.
We limit our survey to works that have a focus on domain divergence measures, and which consider unsupervised domain adaptation (UDA); i.e., where there is no annotated data available in the target domain, which is thus more practical yet more challenging. For a complete treatment of neural networks and UDA in NLP, we refer the reader to (Ramponi and Plank, 2020). Also, we do not treat multilingual work -although cross lingual transfer may be considered as an extreme form of domain adaptation, measuring distance between languages requires different divergence measures, outside of our purview.

A Taxonomy of Domain Divergence Measures
We devise a taxonomy for domain divergence measures, shown in Figure 1. Our taxonomy contains three main class of measures. Individual measures belong to a single category, where relationships can exist between measures from different categories. We provide detailed description of individual measures in Appendix A.
Geometric measures calculate the distance between two vectors in a metric space. As a domain divergence measure, they are used to calculate the distance between features of instances (tf.idf , continuous representations, etc.) from different domains. The P-norm is a generic form of the distance between two vectors, where Manhattan distance (p=1) and Euclidean distance (p=2) are common settings. Cosine (Cos) uses the cosine of the angle between two vectors to measure similarity, with 1 − Cos measuring the distance. Geometric measures are easy to calculate, but are ineffective for high-dimensional vectors.
Information-theoretic measures captures the distance between probability distributions. For example, cross entropy over n-gram word distributions are extensively used to rank sentences in a domain for further selection. f -divergence (Csiszár, 1972) are a general family of divergences where f is a convex function. Different formulations of the f function lead separately to KL and JS divergence. Chen and Cardie (2018) show that reducing f -divergence measure is equivalent to reducing the PAD measures (see next section). Another special case of f -divergence are the family of α divergences, themselves generalisations of KL-Div. Renyi Divergence is a member of the α-divergences and tends towards KL-Div as as α → 1 (Edge A ); Often applied to optimal transport problems, Wasserstein distance measures the distance as the amount of work needed to convert one probability distribution to the other and finds applications in general machine learning and domain adaptation specifically. There are a variety of information theory based measures and they are linked to each other. KL-Div is also related to Cross Entropy (CE). In this paper, CE refers to other measures based on entropy.
Higher-Order measures consider matching higher order moments of random variables. Their properties are amenable to end-to-end learning based domain adaptation and has been adopted extensively recently. Maximum Mean Discrepancy (MMD) is one such measure which considers matching first order moments of variables in a Reproducible Kernel Hilbert Space. On the other hand, CORAL (Sun et al., 2017) considers second order moments and CMD (Zellinger et al., 2017) consider higher order moments. CORAL and CMD are desirable because they avoid computationally expensive kernel matrix computations. KL-Div can also be considered as matching the first-order moment (Zellinger et al., 2017); Edge B .
A few other measures do not have ample support in the literature. These include informationtheoretic measures such as Bhattacharya coeffi- Figure 1: Taxonomy for divergence measures. i) Geometric measures the distance between vectors in a metric space ii) Information-theoretic measures the distance between probability distributions and iii) Higher-order measures the distance between distributions considering higher moments or the distance between representations or their projections in a nonlinear space. cient, higher-order measures like PAD* (Elsahar and Gallé, 2019), Word Vector Variance (WVV), and Term Vocabulary Overlap (TVO) (Dai et al., 2019).

Applications of Domain Divergences
Our key observation of the literature is that there are three primary families of applications for divergence measures in NLP (cf. Table 2): (i) Data Selection: selects a subset of text from a source domain that shares similar characteristics as target domain. The selected subset is then used to learn a target domain model. (ii) Learning Representations: to align source and target domain feature distributions to ensure domain-invariance . (iii) Decisions in the Wild: helps practitioners predict the performance or drops in performance on new data which can be subsequently used to drive decisions about annotating more data, choosing alternate models etc. Our taxonomy synthesises the diversity and the prevalence of the divergence measures in NLP.

Data Selection
Divergence measures are used to select a subset of text from the source domain that shares similar characteristics to the target domain. The selected data serves as supervised data for training models in the target domain. They are also used for learning self-supervised language modeling representations.
Simple word-level and surface-level text features like word frequency distributions, tf.idf weighted distributions have sufficient power to distinguish between text varieties and help in data selection. Geometric measures like cosine, used with word frequency distributions are effective for selecting data in parsing and POS tagging (Plank and van Noord, 2011). In another work, Remus (2012) show that that JS-Div -an information theoretic measure, is effective for sentiment analysis. While these features are useful to select supervised data for an end-task, they also can be used to select data to pre-train language-models subsequently used for NER. (Dai et al., 2019) use Term Vocabulary Overlap for selecting data for pretraining language models. Geometric and Informationtheoretic measures with word level distributions are inexpensive to calculate. However, estimating the distributions reliably needs large-scale data.
Continuous or distributed representations of words alleviate the short-comings of frequencybased probability distribution of words and are useful for data selection. Representations like Continuous Bag of Words (CBOW) and Skipgram vectors (Mikolov et al., 2013) and GloVe (Pennington et al., 2014), project words which have similar context closer together in space. However, they are static and do not change according to the context in which it is used. Contextual word representations produce different embedding depending on the context. Such representations which are mostly from neural networks (Devlin et al., 2019;Peters et al., 2018) help in capturing contextual similarities between words in two different domains. A Geometric measure -Word Vector Variance along with continuous word representations is useful for selecting data -similar in tenor to target data for pretraining neural networks (Dai et al., 2019). Further, P-norm and representations from pretrained neural machine translation models have been found effective for machine translation (Wang et al., 2017). Recently, (Aharoni and Goldberg, 2020) showed that contextual representation from top layers of BERT cluster according to different domains and can be used to perform data selection for neural machine translation (Aharoni and Goldberg, 2020).
Language models determine the probability of a sentence. If a language model trained on the target domain assigns high probability to a sentence from the source domain, then the sentence should have similar characteristics to the source domain. Cross Entropy and their variants capture this notion of similarity between two domains. They have been extensively used for data selection in statistical machine translation (Yasuda et al., 2008;Moore and Lewis, 2010;Axelrod et al., 2011;Duh et al., 2013;Liu et al., 2014). However, cross entropy based methods for data selection does not work effectively for neural machine translation (NMT). (van der Wees et al., 2017;Silva et al., 2018). Instead, (van der Wees et al., 2017) come up with a dynamic subset selection where new subset is chosen every epoch during NMT training. Similar to language model, probabilistic scores from supervised classifiers which distinguish between samples from two domains can help in data selection. The probabilities assigned by such discriminators in construing source domain text as target domain text serves as the divergence measure for data selection in machine translation (Chen and Huang, 2016). However, they would require considerable amount of target domain data which is not available always. Alternatively, instead of training domain discriminators and then using it for data selection, (Chen et al., 2017) train a discriminator and selector in an alternating optimisation manner.
From the literature review we find that different measures are found to be effective for different NLP tasks. (Ruder and Plank, 2017) argue that owing to the different characteristics of the task, different methods can be useful. Hence, instead of using measures individually, they show that learning a linear combination of different measure to be useful for NER, parsing and sentiment analysis. However, this is not always possible, especially in unsupervised domain adaptation where there is no supervised data in target domain. From Table 2, we note that information theoretic measures and geometric measures based on frequency based distributions and continuous representations are common for text prediction and structured prediction tasks. The effectiveness of higher order measures are still not ascertained for these tasks.
Further, we find that for SMT data selection, variants of Cross Entropy measures find extensive use cases. However, the conclusions of (van der Wees et al., 2017) are more measured regarding the benefits of CE measures and the liked for NMT. Contextual word representations with cosine similarity has found some initial exploration for neural machine translation with higher order measures yet to be explored for data selection in Neural Machine Translation. We note that, the literature pays closer attention to data selection for machine translation compared to other tasks, owing to its popularity and practical applications. Also, given thousands of languages, obtaining parallel sentences between each combination of language is impractical.

Learning Representations
Domain adaptation aims to learn a model that works across different domains. One way to achieve this is to learn representations that are domain-invariant while being discriminative to perform well on a task (Ganin et al., 2015;Ganin and Lempitsky, 2015). Here, we limit our review to works utilising divergence measures. We exclude feature-based UDA methods like Structural Corresponding Learning (SCL) (Blitzer et al., 2006), Autoencoder-SCL, pivot based language models (Ziser and Reichart, 2017Ben-David et al., 2020).
The theory of domain divergence (Ben-David et al., 2010) shows that the target domain error is bounded by the source domain error and domain divergence (H-divergence). PAD, an approximation of H-divergence, is large when a domain discriminator's error is small. Here, the discriminator is a supervised model that distinguishes samples between source and target domains. For domaininvariance, learned representations should not be able to distinguish source and target domain samples.
Motivated by H divergence, Domain Adversarial Neural Networks (DANN) (Ganin et al., 2015) aim to learn domain-invariant representations by using a domain discriminator. The network employs a min-max game -between the representation learner and the domain discriminator -inspired by Generative Adversarial Networks (Goodfellow et al., 2014). The encoder is trained by reversing the gradients calculated for the discriminator. In a later work, Bousmalis et al. (2016) argue that domain-specific peculiarities are lost in a DANN and propose Domain Separation Networks (DSN), where both domain-specific and -invariant representations are captured in a sharedprivate network. DSN is flexible in its choice of divergence measures and finds PAD to perform better over MMD.
Obtaining domain invariant representations, is desirable for many different NLP tasks, especially for tasks like sequence labelling where annotating large amounts of data is hard. They are typically used when there is a single source domain and a single target domain -for sentiment analysis ( (Ghosal et al., 2020) condition DANN with external common sense knowledge graph using graph convolutional neural networks for sentiment analysis. In contrast to the above works, (Wang et al., 2018) use MMD outside the adversarial learning framework. They use MMD to learn to reduce the discrepancy between neural network representations belonging to two different domains. Such concepts has been explored in computer vision by (Tzeng et al., 2014). Multi-task learning helps in improving generalisation by modeling two complimentary tasks. A key to obtaining benefits is to learn a shared representation that captures the common features of two tasks. However, such representations might still contain task-specific peculiarities. The sharedprivate model of DSN can help in disentangling such representations and has been used for sentiment analysis (Liu et al., 2017) and Chinese specific NER and word segmentation (Cao et al., 2018).

Complimentary information can be available in
Although we do not deal with multi-lingual learning in this work, we have to note that, DANN and DSN can be extended to learn language agnostic representations useful for text classification and structured prediction works (Chen et al., 2018;Zou et al., 2018;Yasunaga et al., 2018).
Most of the works that adopt DANN and DSN framework reduce either the PAD or MMD distance between distributions. However, reducing the divergences, combined with other auxiliary task specific loss functions can result in training instabilities and vanishing gradient problem when the domain discriminator becomes increasingly more accurate . Using other higher order measures can result in better, stable learning. CMD has been used for sentiment analysis (Zellinger et al., 2017;Peng et al., 2018). Wasserstein distance has been used for duplicate question detection (Shah et al., 2018) and to learn domain-invariant attention distributions for emotional regression .
We can see from Table 2 that, most works extend the popular DSN framework to learn domain invariant representations, in different scenarios across NLP-tasks. The original work from (Bousmalis et al., 2016) includes MMD divergence besides PAD, which is not adopted in subsequent works, possibly due to the reported poor performance. Most of the works require careful balancing between multiple objective functions (Han and Eisenstein, 2019), which can affect the stability of training. The stability of training can be improved by selecting appropriate divergence measures like CMD (Zellinger et al., 2017) and Wasserstein Distance (Arjovsky et al., 2017) and we envision more working using such measures owing to their advantages.

Decisions in the Wild
Models perform poorly when they are deployed in the real world. The performance degrades due to the difference in distribution between training and test data. Such performance degradation, can always be alleviated by expensive large-scale annotation in the new domain. However, they are expensive and given thousands of domains -becomes infeasible. Thus predicting the performance in a new domain -where there is no labelled data is important. In recent times, it has received many theoretical considerations (Rosenfeld et al., 2020;Chuang et al., 2020;Steinhardt, 2016). As many researchers and engineers deploy models in the real world, it is important from a practical perspective. Empirically, works in NLP consider the divergence that exists between data in an unknown domain and known dataset to measure drops in performance.
Simple measures based on word level features has been used to predict the performance of a machine learning model in a new domain. Information theoretic measures like Renyi-Div, KL-Div has been used for predicting performance drops in POS (Van Asch and Daelemans, 2010) and Cross-Entropy based measure has been used for dependency parsing (Ravi et al., 2008). Prediction of performance can also be useful for machine translation where obtaining parallel data is hard. Based on distance features between languages and dataset features, (Xia et al., 2020) predict performance of the model on new languages for a variety of NLP tasks, including machine translation, POS, parsing etc. Such performance prediction models have also been done in the past for statistical machine translation (Birch et al., 2008;Specia et al., 2013).
However, (Ponomareva and Thelwall, 2012) argue that predicting drop in performance is more appropriate (why?) compared to just performance. They find that JS-Div is effective for predicting drop in performance of Sentiment Analysis systems. Only recently, predicting model failures in practical deployments from an empirical viewpoint has regained attention. (Elsahar and Gallé, 2019) find the efficacy of test higher-order measures to predict the drop in performance for POS and SA. However, analysing performance drops using contextual word representations is still lacking. We tackle this in the next section.

Empirical Analysis
How relevant are traditional measures over word distributions for measuring domain divergences? We examine this question given that contextual word representations such as BERT, ELMo, Dis-tilBERT (Devlin et al., 2019;Peters et al., 2018;Sanh et al., 2019) are widespread and given that higher-order measures are increasingly being used to learn representations.
We perform an empirical study to assess their suitability for three important NLP tasks. POS, NER, and SA. We leave Natural language generation and MT for future work.
Performance difference between the source and the target domain depends on the divergence between feature distributions between domain (Ben-David et al., 2010). Like many other works (Ganin et al., 2016) we assume a co-variate shift, where the marginal distribution over features change, but the conditional label distributions does not-that is, Although, difference in conditional label distribution can increase the H-Divergence measure (Wisniewski and Yvon, 2019), it requires labels in the target domain for assessment. In this work we assume no labelled data in the target domain, like in a realistic setting.
For all our experiments, unless otherwise mentioned, we use the DistilBERT (Sanh et al., 2019) pre-trained transformer model. It has competitive performance to BERT, but has faster inference times and lower resource requirements. We leave experimentation with other BERT variants, such as Roberta (Liu et al., 2019), for future work. For every text segment, we obtain the activations from the final layer and average-pool the representations. For domain adaptation scenarios, we train the models on the source domain training split. We test the best model from the grid search on the test dataset of the same domain and also the test dataset of the other domains (cf. Appendix C).
Datasets: For POS, we select 5 different corpora from the English Word Tree Bank of Universal Dependency corpus (Nivre et al., 2016) 1 and also include the GUM, Lines, and ParTUT datasets. We follow Elsahar and Gallé (2019) and consider these as 8 domains. For NER, we consider CONLL 2003 (Tjong Kim Sang and De Meulder, 2003) Divergence Measures: We consider 12 divergence measures. For Cos, we follow the instance based calculation (Ruder et al., 2017). For MMD, Wasserstein, and CORAL, we randomly sample 1000 sentences and average the results over 3 runs. For MMD, we experiment with different kernels (cf. Appendix A) and use default values of σ from the GeomLoss package (Feydy et al., 2019). For TVO, KL-div, JS-div, Renyi-div, based on word frequency distribution we filter out stop words and consider the top 10k frequent words across domains to build our vocabulary (Ruder et al., 2017;Gururangan et al., 2020). We use α=0.99 for Renyi as found effective by Plank and van Noord (2011). We do not choose CE as it is mainly used in MT and not found effective for text classification and structured prediction (Ruder et al., 2017).
Model Architecture: For POS and NER, we follow the original BERT model where a linear layer is added upon the base DistilBERT model, and a prediction is made for every token. If the token is split into multiple tokens because of Byte Pair Encoding, the label for the first token is predicted. For sentiment analysis and domain discriminators, we pool the representation from the last layer of DistilBERT and add a linear layer for prediction. Grid search and hyper-parameter are set as in Appendix B.

Performance Drop Correlation: Do traditional measures over distributions remain relevant?
For POS, the PAD measure has the best correlation with performance drop (cf. Table 1). Informationtheoretic measures over word frequency distributions such as JS-div and KL-div, TVO, which have been prevalent for data selection and drop in performance (cf.  Figure 2c, we can see a better notion of distinct clusters. The Silhouette scores in tandem with the t-SNE plots indicate that datasets are, in fact, not distinct domains, and considering data-driven methods for defining domains is needed (Aharoni and Goldberg, 2020).

Discussion
One of the premises for drop in performance is that different datasets are from different domains. In Section 5.2, we showed that one has to treat the notion of dataset-is-domain carefully. Although we observe a drop in performance across different NLP tasks, it brings to light the question about the relationship between underlying domains and performance. We see better notions of clusters for NER and sentiment analysis (cf. Figures 2b and  2c). We can expect the drop in performance to be indicative of these domain separations. Comparing the best correlations from Table 1, we see higher correlations for NER and sentiment analysis, compared against POS. For POS, there are no indicative domain clusters and the effect of domain divergence on drop in performance may be less; whereas for SA, both the t-SNE plot and the Silhouette scores for JS-Div (cf. Figure 2c) corroborate comparatively better separation and we can see that the correlation is higher as well. To evaluate whether the domain adaptation techniques are bridging the differences in data distributions and the performance improvements are not because of model artifacts, one must be more careful in selecting the datasets.
Overlapping datasets also have consequences for data selection strategies. For example, Moore and Lewis (2010) select pseudo in-domain data from the source corpora. The Silhouette coefficients being negative and close to 0 shows that on average, many data points in a dataset belong to nearby domains. Data selection strategies thus may be effective. If the Silhouette coefficients are more negative and if more points in the source aptly belong to the target domain, we should expect increased sampling from such source domains to yield additional performance benefits in the target domain.

Conclusion
We survey domain adaptation work, focusing on domain divergence measures and their usage for data selection, learning domain-invariant representations, and making decisions in the wild. We synthesised the divergence measures into a taxonomy of information theoretic, geometric and higher-order measures. While traditional measures are common for data selection and making decisions in the wild, higher-order measures are prevalent in learning representations. Based on our correlation experiments, silhouette scores, and t-SNE plots, we make the following recommendations: • PAD is a reliable indicator of performance drop. Use it when there are sufficient examples to train a domain discriminator.
• JS-Div is symmetric and a formal metric. It is related to PAD, easy to compute, and serves as a strong baseline.
• While Cosine is popular, it is an unreliable indicator of performance drop.
• Do not consider datasets as domains for domain adaptation experiments. Instead, cluster the representations and define appropriate domains.

A Domain Divergence Measures
This section provides the necessary background on different kinds of divergence measures used in the literature. They can be either informationtheoretic -which measure the distance between two probability distributions, geometric -which measure the distance between two vectors in a space, or higher-order which capture similarity in a projected space and consider higher order moments of random variables.

A.1 Information-Theoretic Measures
Let P and Q be two probability distributions. These information-theoretic measures are used to capture differences between P and Q. Kullback-Leibler Divergence (KL-Div) Q is called the reference probability distribution 4 more precisely KL is defined if only for all Q(x) st Renyi Divergence (Renyi-Div) Renyi Divergence is a generalisation of the KL Divergence and is also called α-power divergence |D| is the size of the training data and 1 is an indicator function Wasserstein Distance: Wasserstein Distance ( also called the Earth Mover's distance ) is another metric for two probability distributions. Intuitively, it measures the least amount of work done to transport probability mass from one probability distribution to another to make them equal. The work done in this case is measured as the mass transported multiplied by the distance of travel. It is known to be better than Kullback-Leibler Divergence and Jensen-Shannon Divergence when the random variables are high dimensional or otherwise. The Wasserstein metric is defined as Here γ ∈ π(P, Q) where π(P, Q) is the set of all distributions where the marginals are P and Q. Maximum Mean Discrepancy (MMD): MMD is a non-parametric method to estimate the distance between distributions based on Reproducing Kernel Hilbert Spaces (RKHS). Given two random variables X = {x 1 , x 2 , ..., x m } and Y = {y 1 , y 2 , ...., y n } that are drawn from distributions P and Q, the empirical estimate of the distance between distribution P and Q is given by Here φ : X → H are nonlinear mappings of the samples to a feature representation in a RKHS. In this work, we map the contextual word representations of the text to RKHS. The different kinds of kernels we use in this work are given below. We use the default values of σ = 0.05 of the Geom-Loss package (Feydy et al., 2019).
Rational Quadratic Kernel Correlation Alignment (CORAL): Correlation alignment is the distance between the secondorder moment of the source and target samples. If d is the representation dimension, F represents Frobenius norm and Cov S , Cov T is the covariance matrix of the source and target samples, then CORAL is defined as Central Moment Discrepancy (CMD): Central Moment Discrepancy is another metric that measures the distance between source and target distributions. It not only considers the first moment and second moment, but also other higher-order moments. While MMD operates in a projected space, CMD operates in the representation space. If P and Q are two probability distributions and X = {X 1 , X 2 , ...., X N } and Y = {Y 1 , Y 2 , ...., Y N } are random vectors that are independent and identically distributed from P and Q and every component of the vector is bounded by [a, b], CMD is then defined by (13) where E(X) is the expectation of X and c k is the k − th order central moment which is defined as and r 1 + r 2 + r N = k and r 1 ....r N ≥ 0

A.4 Other Measures
Bhattacharya Coefficient: If P and Q are probability distributions, then the Bhattacharya coefficient and Bhattacharya distance are defined as Bhattacharya(P, Q) = x P (x)Q(x) (15) D Bhattacharya = −log(Bhattacharya(P, Q)) (16) Term Vocabulary Overlap (TVO): This measures the proportion of target vocabulary that is also present in the source vocabulary. If V S is the source domain vocabulary and V T is the target domain vocabulary, then the Term Vocabulary Overlap between the source domain (D S ) and the target domain (D T ) is given by Word Vector Variance: Different contexts in which a word is used in two different datasets can be used as an indication of the divergence between two datasets. Let w i src denote the word embedding of word i in source domain and w i trg is the word embedding of the same word in the target domain. Let d be the dimension of the word embedding. For POS and NER we monitor the macro F-Score and for domain discrimination we monitor the accuracy scores and chose the best model after the grid search for all subsequent calculations. For training the models we use the Adam Optimiser (Kingma and Ba, 2014) with the β 1 = 0.9 and β 2 = 0.99 and as 1e-8. We use HuggingFace Transformers (Wolf et al., 2019) for all our experiments.
C Cross-Domain Performances C.1 Parts of speech tagging Table 3 shows the hyper parameters for the best model for POS and Table 4 shows the cross domain performances. Table 5 shows the hyper parameters for the best model for NER and Table 6 shows the cross domain performances. Table 7 shows the hyper parameters for the best model for Sentiment analysis and Table 8 shows the cross domain performances.