To Annotate or Not? Predicting Performance Drop under Domain Shift

Performance drop due to domain-shift is an endemic problem for NLP models in production. This problem creates an urge to continuously annotate evaluation datasets to measure the expected drop in the model performance which can be prohibitively expensive and slow. In this paper, we study the problem of predicting the performance drop of modern NLP models under domain-shift, in the absence of any target domain labels. We investigate three families of methods (\mathcal{H}-divergence, reverse classification accuracy and confidence measures), show how they can be used to predict the performance drop and study their robustness to adversarial domain-shifts. Our results on sentiment classification and sequence labelling show that our method is able to predict performance drops with an error rate as low as 2.15% and 0.89% for sentiment analysis and POS tagging respectively.


Introduction
Building Natural Language Processing models that perform well in the wild is still an open and challenging problem. It is well known that modern machine-learning models can be brittle, meaning that -even when achieving impressive performance on the evaluation set -their performance can degrade significantly when exposed to new examples with differences in vocabulary and writing style Jia and Liang, 2017;Brun and Nikoulina, 2018). This drop in performance when changing from domain D s to domain D t can be due to a variety of causes. It could be because of Co-variate Shift (Shimodaira, 2000;Storkey, 2009), where the input distribution changes, but the conditional distribution does not, i.e. P Ds (y|x) = P Dt (y|x) but P Ds (x) = P Dt (x); or due to Concept Shift, when P Ds (y|x) = P Dt (y|x) and P Ds (x) = P Dt (x),

Value of the Domain-shift
Detection Metric on D t Expected Accuracy Drop on D t Figure 1: In this paper we introduce several domainshift detection metrics (x-axis) and employ them to estimate the performance drop on a new target domain D t by regressing on those metrics and their associated real performance drop (green dots).
or Label Shift, when P Ds (y) = P Dt (y) and P Ds (x|y) = P Dt (x|y) (Zhang et al., 2013;Lipton et al., 2018), or a mix between them (Moreno-Torres et al., 2012;Quionero-Candela et al., 2009). Such changes are often the norm in many realworld applications, creating an urge in the industry to continuously check that models in production are performing well in the wild or when there is a new client or a new type of data. However, continuously sampling and annotating data-points can be prohibitively slow and costly, particularly when a large annotated sample is needed to correctly represent the target joint distribution or when the change can be gradual (Kifer et al., 2004). In this work, we investigate how we can estimate the performance drop of a model when evaluated on a new target domain, without the need of any labeled examples from this target domain. Performing this estimation accurately has an important impact on the decision process of real-time debugging and maintaining machine learning models in production. For instance, such insights can drive the decision to annotate more data for retraining or even adjusting the model accordingly (e.g. performing unsupervised domain adaptation).
We propose a method that takes advantage of several domain-shift detection metrics and employ them to estimate through regression the performance drop on a target domain on which no annotated data is available, The overall approach is schematically depicted in Fig. 1. The relationship between the domain shift metrics and the real performance drop of different domains is precomputed over different existing domains (green dots in Fig. 1), and for a new target domains the performance drop can be estimated simply by evaluating the function learned through simple regression. This process directly yields the value of performance drop the model will suffer when exposed to this target domain examples. We believe is more interpretable and useful to ML practitioners than other intermediate signals such as of outof-distribution detection methods.
This paper introduces the following contributions: • We introduce a new task and methodology for directly predicting performance drop of a model under domain-shift, without the need of labeled examples from the target domain.
• We survey, formalize and evaluate domainshift detection metrics from 3 different families ( §2) and propose new adaptations.
• We benchmark each proposes metric on two tasks of different natures: document classification and sequence labeling ( §3 and §4), and show their robustness under adversarial domain-shift scenarios.

Measuring Performance Drop
We are interested in the problem of measuring the performance drop of classifier C, trained on samples from the source domain D s when applied to the target domain D t samples. In the presence of labeled samples from D t , this could empirically be measured by the difference in test errors between the source and target domain. • H-divergence based metrics: based on the capacity of another classification model to distinguish between samples from D s and D t .
• Confidence-based, using the certainty of the model over its predictions.
• Reverse classification accuracy, where predicted values are used as pseudo-labels over D t .

H-divergence based metrics
Building upon previous work from Kifer et al. (2004) where I is an indicator function. The PAD measure is task-agnostic and measures therefore only the co-variate shift. It has been used before to measure domain discrepancy between datasets Rai et al., 2010) for NLP applications. However, different from the time when PAD was introduced, modern NLP models not only compute a mapping between input and labels, but also infer an intermediate representation. For a given task, this representation is supposed to provide a view of the input that highlights the relevant part that could be helpful for a correct classification in this task. In particular, those intermediate representations should not be sensitive to task-irrelevant features that provide nevertheless strong signals to distinguish between the source and the target domains (yielding high PAD values). As an example, consider the task of named entity extraction on top of the 20 newsgroup dataset: the To: field is a highly discriminating feature for domain classification but arguably irrelevant to the task of extracting named entities.
Following this intuition, we propose a modification to the PAD measure. It is the classification accuracy of discriminating between the intermediate representation coming from -respectivelysource and target domain (we use the last layer of the neural network). We assume that the task classifier C consists of two functions G f and G y . The first projects the input to a hidden representation of size m: G f : X → R m ; while the second is a linear layer that uses this representation to predict the class labels G y : R m → [0, 1] |Y | . Differently from PAD, the domain classifier G * d : R m → [0, 1] takes the hidden representations as an input instead of the original input. The learnable parameters of G f , G y and G * d are θ f , θ y and θ * d respectively. Our proposed metric PAD * is then: θ f , θ y are learned by minimizing the loss function of the task. Afterwards θ f is frozen and θ * d is learned by minimizing the negative log-likelihood loss 3 for the domain discrimination task between D s and D t .

Confidence Based Metrics
While the final decision of classifiers is discrete, the weight given to that decision can be interpreted as the confidence the model has in that decision. This has been the basis of overcoming domain-shift using many self-training techniques which select the most confident examples as new training examples, together with the predicted class as pseudo-label (McClosky et al., 2006). However, modern neural networks are known to give wrongly calibrated confidence scores (Guo et al., 2017) meaning that the associated probability scores to the predicted class label does not reflect its correctness likelihood. A few calibration techniques have been proposed to overcome this problem. For its simplicity and effectiveness we follow the temperature scaling method (Guo et al., 2017). It is a post training method that rescales the logits of any neural network model to soften the softmax by raising the output entropy of the probabilities scores. Given a model trained on the source domain dataset D s , let z be the logits vector produced by the very last layer for a given input, yielding the non-calibrated confidence score q = max k (softmax(z)) k for the predicted class label.
The calibrated confidence score q is calculated then as follows: Where T is a learnable scalar "temperature" parameter. T can be learned by minimizing the negative log likelihood loss over the validation set D val s ; 4 Where 1 [k=y i ] is a one hot vector containing one in front of the true class label y i and zero elsewhere. Accordingly, we introduce two confidence based metrics to measure the domain-shift between the source and the target datasets D s and D t .
1) CONF the drop in average probability scores of the predicted class: 2) CONF CALIB the drop in average calibrated probability scores for the predicted class:

Reverse Classification Accuracy
The idea of reverse classification accuracy is to use a classifier trained on the source domain to pseudo-label the target domain. That new dataset is then used to train a new classifier whose accuracy is measured on held-out data from the source domain. This was used in the past to select among existing models or datasets the best-performing one for a given target domain (Fan and Davidson, 2006;Zhong et al., 2010). In order to use this to create a proxy for domain-shift, we proceed as follows: a task classifier C is trained on the annotated source domain dataset D s which is then run to create pseudo-labels for the unlabeled target data D t . Those pseudo-labels are then used as training data for a reverse classifierĈ -using the same architecture and training algorithm used to obtain C. The performance of both classifiers are compared on a heldout subset D s ∈ D s from the source domain datasets and used to define the RCA measure The RCA measure could be low because of two reasons. It could be due to the domain-shift we try to capture, as a very different distribution would have a major impact in the training data generated on top of D t . Or it could be due to the accumulation of error created by the "back-and-forth" training. If D t follows the same distribution than D s , than the measure would only capture the impact of that accumulation of errors. To remove this source of errors, we also propose another measure RCA * which is the performance difference ofĈ and a classifier C trained in the same way but using as target domain held-out data from the source domain. C is used again to pseudo-label a dataset,  Figure 2: The classifier C is trained on source domain (blue) with true labels (green) and is applied on target domain data (red) to create pseudo labels (grey). In the case of RCA * C is also applied on a held-out source domain data (blue). Pseudo-labels (grey) are then used to train new classifiers (Ĉ and for RCA * also C ). They are later applied on a test set of the source domain to calculate RCA and RCA * . which is this time taken from the same distribution than D s , and this new dataset is then used as training data for C . RCA * is then calculated as follows: A schematic view of these two measures is depicted in Fig. 2. 3 Experiments

Regression of Performance Drop
We present a regression based method that can directly estimate the performance drop of a model trained on D s and tested on D t . This method does not require any labeling in D t however it assumes the availability of a small fixed number of labeled evaluation datasets D o ∈ D \ {D s , D t }. For each one of these fixed evaluation datasets, using simple linear regression we fit a regression line between the drop in the model accuracy and a domain-shift detection metrics of choice ( §2). Afterwards, by calculating the value of this domainshift detection metric on d t we can then use this regression line to predict the performance drop when evaluating the model d t .
In our experiments we report the Mean Absolute Error (MAE) and Max error between our predicted values of performance drop using our method and the actual performance drop on d t .
To put in perspective the predictive power of this regression method on each proposed metric, we implement a baseline (Mean) that instead of learning a regression it always gives the average classification drop over all evaluation datasets in D o . We also experiment with doing linear regression using all metrics proposed in §2 at once (Ensemble).

Datasets
We evaluate our metrics across two tasks of different nature namely sentiment analysis and part-ofspeech (POS) tagging.

Sentiment Analysis
For sentiment analysis we follow Ruder and Plank, 2018) by using the Amazon multidomain reviews dataset in English. 5 Although this dataset contains several domains they still come from the same platform which can restrain some diversity. To alleviate this we combine it with two other datasets namely Yelp 6 and IMDB movie reviews dataset 7 . Although the preprocessed dataset in  is widely used, it only consists of 4 domains and the input documents are reduced to their TF-IDF weights which is not a suitable input for modern NLP architectures. Thus we perform a different prepro-5 http://jmcauley.ucsd.edu/data/amazon/index.html 6 https://www.yelp.com/dataset 7 https://ai.stanford.edu/ amaas/data/sentiment/ cessing across a wider range of domains as follows: After removing redundant reviews, we preprocess the dataset to obtain binary labels such that reviews with 1 to 3 stars are labeled as negative while reviews with 4 or 5 stars are labeled as positive. Finally we randomly sample 21K reviews (10K train, 10K valid and 1K test) from 21 domains. Yelp and IMDB datasets follow the same preprocessing steps and are added as 2 extra domains. This yield a total new dataset with 23 domains for sentiment analysis yielding 506 domainshift scenarios. 8

Part of Speech Tagging
For part of speech tagging we select 4 publicly available 9 Universal Dependencies datasets for English (Nivre et al., 2016). We split the EWT dataset (UD for English web treebank) according to each sub-category, while keeping the rest of the smaller datasets as is. This yields in total 8 domains with roughly comparable sizes (∼ 4K sentences each) yielding in total 56 domain-shift scenarios.

Experiment Setup and Training Details
For each domain-shift scenario, the task model is trained on the source domain training split. Testing is performed on both source and target domains test sets. Simultaneously, we calculate each of our proposed metrics in §2. Note here that some of those metrics such as PAD, PAD * , RCA and RCA * require the text of the target domain test set. None of the proposed metrics require any labels from the target domain, in line with the unsupervised scenario we are considering. The initial word embeddings are a hyperparameter, and we consider random initialization, pretrained GloVe (Pennington et al., 2014) with several dimensions and contextualized word embeddings using ELMo . As model architectures we use Multi-Layer Bi-LSTM (Graves and Schmidhuber, 2005) followed by a multi-layer feed-foward NN and a softmax. In sentiment analysis the feed forward network is applied on the last output of the Bi-LSTM to produce one label prediction for the whole sentence, while for POS tagging it is applied to each output to produce a label prediction for each corresponding token. For training the domain classifiers used to calculate the PAD and PAD * metrics, we use similar model architecture as in sentiment analysis, initialized from scratch in case of PAD or initialized with the weights of the best task model in case of PAD * . Afterwards, they are trained to discriminate between inputs of the source and target domain datasets.
To calculate CONF CALIB, the best performing model is selected and its confidence weights are calibrated using temperature scaling on the source domain validation set. Each model is trained using Adam optimzation (Kingma and Ba, 2015) and early stopping with patience 5 over the source domain validation set. All models and training code have been implemented using AllenNLP  and made publicly available in addition to the Datasets. 10 The detailed hyper-parameters and the test results for each source domain dataset are shown in the appendix. used. 11 This is unsurprising as all measures but PAD are model-specific. Instead of computing overall correlation trends, we decide to evaluate the capacity of each measure to serve as a predictor of the classification drop, as detailed in the next section. Table 1 shows the mean absolute and maximum error values of the regression process that predicts the performance drop of a model trained over d s and evaluated on d t . The baseline that predicts always the mean performance drop achieves on a mean absolute error of 5.2% and max error of 12.77% for sentiment analysis, while this number drops for POS tagging to a mean of 1.06% and max of 1.67% in the worse case. All our proposed metrics improve significantly over that, with PAD * clearly improving over all other in both datasets. Overall, the best performing method is PAD * with 2.15% and 0.88% mean absolute error in prediction of performance drop for sentiment analysis and POS tagging respectively. Learning an ensemble between all metrics does not guarantee to provide the best predictions, which could be due to the small size of the points used for regression.

Impact of Number of Domains
Our proposed method for detecting performance drop assumes the existence of several evaluation datasets from different domains. This is a nonnegligible cost, as having so many evaluation dataset from different domains might not be realistic in many scenarios. In this section we evaluate the impact of having a lower number of source domains from which to learn the classification drop. We sample randomly a smaller number of datasets, and repeat the experiment. In Fig. 4 and Fig. 5 we report the results of that with 5 different runs of sampling. As expected, the error decreases by increasing the number of out of domain test sets. However, prediction error enters an acceptable score with only 3 annotated source domains.

Adversarial Shift
PAD * and PAD are calculated solely from learning to classify between the source and target domains and could therefore be particularly sensitive to task irrelevant domain-shifts i.e. input sig-  nals that can help to differentiate domains apart have has no impact on the task itself. To evaluate the robustness of each of our proposed metrics in this specific scenario, we perform an experiment by applying an adversarial domain-shift. For each domain-shift scenario in the sentiment analysis task we add a different unique tag <SOURCE> and <TARGET> in the beginning of each example in the source and the target domains respectively: this has no impact on the final task classification, but makes it trivial to discriminate between the two domains. The results of re-running the same experiment on this modified dataset are in Table 2. As expected, PAD is greatly affected, not performing better than the baseline which just predicts the mean classification drop. The other two families are less or not at all affected. Surprisingly, PAD * also degrades significantly, despite using a task representation which should learn to discard the useless (for the task prediction) newly introduced token. To understand this better, we analyze the behaviour of the models with different depths. In Fig. 7 the best performing model for a   given depth is used: the results indicate that the capacity for predicting the classification drop using PAD * is greatly influenced only if the model is very shallow. This becomes clearer in Fig. 6 where we repeat the plots from Fig. 3 for this adversarial dataset restricting the hyper-paramater search of the task model to models of a fixed depths. At any model depth the PAD measure is always maximal (1.00) as it learns to differentiate perfectly the two domains. While this is also the case for PAD * in the case of model of depth 2, deeper model are less sensitive to that. This might indicate that the higher layer (which are used as input representation for the domain classification models used to calculate PAD * ) are learning to ignore the domain token as it is irrelevant for the task at hand. The rest of the metrics are more robust with respect to adversarial examples and model depth, which might make them good candidates for cases of severe co-variate shift.

Related Work
A large body of work has tackled the problem of defining, measuring and adapting to domainshift in machine learning and NLP (Quionero-Candela et al., 2009;;  Blitzer et al. (2008) introduce the Proxy Adistance as a proxy to H-divergence which was used by ; Rai et al. (2010) to gain insights about adaptability of representations for domain adaptation and active learning. More recently,  showed that Proxy A-distance is effective as a similarity metric for dataset selection. H-divergence has inspired a large body of work (Ganin et al., 2016;Tzeng et al., 2017) for domain adaptation by minimizing the H-divergence between domains using a domain adversarial loss.
There exists a large line of work on developing domain similarity metrics for data selection under transfer learning (Moore and Lewis, 2010; Plank and van Noord, 2011;Axelrod et al., 2011;Wu and Huang, 2016), however there is no consensus on which similarity measure is suitable for which NLP task. Ruder and Plank (2017) show that a combination of several of those similarity features re-weighted using Bayesian Optimiza-tion performs the best across several tasks. This method however is not fully unsupervised as it requires an annotated validation set from each target domain.
The closest work to ours are Ravi et al. (2008); Van Asch and Daelemans (2010), as they also tackle the problem of predicting performance of models at test time. They introduce a set of domain similarity metric that correlates with the performance of a model on a test set for both partof-speech tagging and parsing. However, many of those measure are built on top of heavily handcrafted features and evaluated through shallow models which are different to the way modern NLP models are built now.
Confidence scores can be a good estimation of domain similarity, however it is well known for modern neural networks that their confidence scores are usually mis-calibrated. Because of that, many self-training techniques rely on an ensemble of models to calculate for bootstrapping indomain examples training examples, such as cotraining and tri-training (Zhou and Li, 2005), tritraining with dis-agreement (Søgaard, 2010) and multi-task tri-training (Ruder and Plank, 2018). In our paper we aim to measure the potential classification drop of a specific model due to domainshift. Training an ensemble of models of the same architecture in hand might be expensive and inefficient for just measuring domain-shift. We try to overcome this problem by calibrating confidence scores (Zadrozny and Elkan, 2002;Guo et al., 2017;DeVries and Taylor, 2018). Using confidence scores of calibrated models has shown a large success in out of distribution detection (Hendrycks and Gimpel, 2017;Liang et al., 2018. The idea of reverse classification accuracy (Fan and Davidson, 2006;Zhong et al., 2010) was first introduced as a part of "Transfer Cross Validation" to select both models and data in a cross validation framework, optimized for transfer learning. More recently, this was adapted as a confidence estimator to predicting segmentation performance in the clinical domain (Valindria et al., 2017). The same idea has been used recently used for evaluating GANs for image generation (Shmelkov et al., 2018).

Conclusion
In this paper we studied the problem of prediction of performance drop due to domain-shift for modern NLP models, having no labeled target domain data but at least few fixed evaluation datasets from other domains. We investigated three family of metrics for measuring domain similarity. In each of them we introduced a novel adaptation on existing metrics to adapt to the different nature of modern NLP models and possibly obtain higher prediction scores of the performance drop. Our evaluation over two NLP tasks show that this drop can be estimated very accurately even when only few other source-domains evaluation datasets are available. In general, the family of H-divergence based measures perform the best. However, they are prone to fail when there is a severe change in the marginal distribution that is task irrelevant. In particular, the well established PAD measure is rendered useless in a setting where we artificially exaggerate that phenomena. Using a task-specific representation is slightly more robust to that problem, although only if a deeper model is used. A strong family of measures that does not have that drawback are confidence-based measures, but they require access to the confidence weights and not only the predicted labels by the model. Unfortunately, our adaptation of the reverse classification accuracy measure RCA * does not obtain higher performance than the simple RCA. This could be due to the fact that datasets sampled from the same domain can still be subjected to sample selection bias. In the future work, we plan to investigate ways to solve that. These conclusions are summarized in Table 3, which we recommend as guideline when deciding what measure to use.