Practical Obstacles to Deploying Active Learning

Active learning (AL) is a widely-used training strategy for maximizing predictive performance subject to a fixed annotation budget. In AL, one iteratively selects training examples for annotation, often those for which the current model is most uncertain (by some measure). The hope is that active sampling leads to better performance than would be achieved under independent and identically distributed (i.i.d.) random samples. While AL has shown promise in retrospective evaluations, these studies often ignore practical obstacles to its use. In this paper, we show that while AL may provide benefits when used with specific models and for particular domains, the benefits of current approaches do not generalize reliably across models and tasks. This is problematic because in practice, one does not have the opportunity to explore and compare alternative AL strategies. Moreover, AL couples the training dataset with the model used to guide its acquisition. We find that subsequently training a successor model with an actively-acquired dataset does not consistently outperform training on i.i.d. sampled data. Our findings raise the question of whether the downsides inherent to AL are worth the modest and inconsistent performance gains it tends to afford.


Introduction
Although deep learning now achieves state-ofthe-art results on a number of supervised learning tasks (Johnson and Zhang, 2016;Ghaddar and Langlais, 2018), realizing these gains requires large annotated datasets (Shen et al., 2018).This data dependence is problematic because labels are expensive.Several lines of research seek to reduce the amount of supervision required to achieve acceptable predictive performance, including semisupervised (Chapelle et al., 2009), transfer (Pan and Yang, 2010), and active learning (AL) (Cohn et al., 1996;Settles, 2012).
In AL, rather than training on a set of labeled data sampled at i.i.d.random from some larger population, the learner engages the annotator in a cycle of learning, iteratively selecting training data for annotation and updating its model.Poolbased AL (the variant we consider) proceeds in rounds.In each, the learner applies a heuristic to score unlabeled instances, selecting the highest scoring instances for annotation. 1 Intuitively, by selecting training data cleverly, an active learner might achieve greater predictive performance than it would by choosing examples at random.
The more informative samples come at the cost of violating the standard i.i.d.assumption upon which supervised machine learning typically relies.In other words, the training and test data no longer reflect the same underlying data distribution.Empirically, AL has been found to work well with a variety of tasks and models (Settles, 2012;Ramirez-Loaiza et al., 2017;Gal et al., 2017a;Zhang et al., 2017;Shen et al., 2018).However, academic investigations of AL typically omit key real-world considerations that might overestimate its utility.For example, once a dataset is actively acquired with one model, it is seldom investigated whether this training sample will confer benefits if used to train a second model (vs i.i.d. data).Given that datasets often outlive learning algorithms, this is an important practical consideration.In contrast to experimental (retrospective) studies, in a real-world setting, an AL practitioner is not afforded the opportunity to retrospectively analyze or alter their scoring function.One would instead need to expend significant resources to validate that a given scoring function performs as intended for a particular model and task.This would require i.i.d.sampled data to evaluate the comparative effectiveness of different AL strategies.However, collection of such additional data would defeat the purpose of AL, i.e., obviating the need for a large amount of supervision.To confidently use AL in practice, one must have a reasonable belief that a given AL scoring (or acquisition) function will produce the desired results before they deploy it (Attenberg and Provost, 2011).
Most AL research does not explicitly characterize the circumstances under which AL may be expected to perform well.Practitioners must therefore make the implicit assumption that a given active acquisition strategy is likely to perform well under any circumstances.Our empirical findings suggest that this assumption is not well founded and, in fact, common AL algorithms behave inconsistently across model types and datasets, often performing no better than random (i.i.d.) sampling (1a).Further, while there is typically some AL strategy which outperforms i.i.d.random samples for a given dataset, which heuristic varies.
Contributions.We highlight important but often overlooked issues in the use of AL in practice.We report an extensive set of experimental results on classification and sequence tagging tasks that suggest AL typically affords only marginal performance gains at the somewhat high cost of noni.i.d.training samples, which do not consistently transfer well to subsequent models.

The (Potential) Trouble with AL
We illustrate inconsistent comparative performance using AL.Consider Figure 1a, in which we plot the relative gains (∆) achieved by a BiLSTM model using a maximum-entropy active sampling strategy, as compared to the same model trained with randomly sampled data.Positive values on the y-axis correspond to cases in which AL achieves better performance than random sampling, 0 (dotted line) indicates no difference between the two, and negative values correspond to cases in which random sampling performs better than AL.Across the four datasets shown, results are decidedly mixed.
And yet realizing these equivocal gains using AL brings inherent drawbacks.For example, acquisition functions generally depend on the underlying model being trained (Settles, 2009(Settles, , 2012)), which we will refer to as the acquisition model.Consequently, the collected training data and the acquisition model are coupled.This coupling is problematic because manually labeled data tends to have a longer shelf life than models, largely because it is expensive to acquire.However, progress in machine learning is fast.Consequently, in many settings, an actively acquired dataset may remain in use (much) longer than the source model used to acquire it.In these cases, a few natural ques-tions arise: How does a successor model S fare, when trained on data collected via an acquisition model A? How does this compare to training S on natively acquired data?How does it compare to training S on i.i.d.data?
For example, if we use uncertainty sampling under a support vector machine (SVM) to acquire a training set D, and subsequently train a Convolutional Neural Network (CNN) using D, will the CNN perform better than it would have if trained on a dataset acquired via i.i.d.random sampling?And how does it perform compared to using a training corpus actively acquired using the CNN?
Figure 1b shows results for a text classification example using the Subjectivity corpus (Pang and Lee, 2004).We consider three models: a Bidirectional Long Short-Term Memory Network (BiLSTM) (Hochreiter and Schmidhuber, 1997), a Convolutional Neural Network (CNN) (Kim, 2014;Zhang and Wallace, 2015), and a Support Vector Machine (SVM) (Joachims, 1998).Training the LSTM with a dataset actively acquired using either of the other models yields predictive performance that is worse than that achieved under i.i.d.sampling.Given that datasets tend to outlast models, these results raise questions regarding the benefits of using AL in practice.
We note that in prior work, Tomanek and Morik (2011) also explored the transferability of actively acquired datasets, although their work did not consider modern deep learning models or share our broader focus on practical issues in AL.

Experimental Questions and Setup
We seek to answer two questions empirically: (1) How reliably does AL yield gains over sampling i.i.d.? And, (2) What happens when we use a dataset actively acquired using one model to train a different (successor) model?To answer these questions, we consider two tasks for which AL has previously been shown to confer considerable benefits: text classification and sequence tagging (specifically NER). 2  To build intuition, our experiments address both linear models and deep networks more representative of the current state-of-the-art for these tasks.We investigate the standard strategy of acquiring data and training using a single model, and also 2 Recent works have shown that AL is effective for these tasks even when using modern, neural architectures (Zhang et al., 2017;Shen et al., 2018), but do not address our primary concerns regarding replicability and transferability.
the case of acquiring data using one model and subsequently using it to train a second model.Our experiments consider all possible (acquisition, successor) pairs among the considered models, such that the standard AL scheme corresponds to the setting in which the acquisition and successor models are same.For each pair (A, S), we first simulate iterative active data acquisition with model A to label a training dataset D A .We then train the successor model S using D A .
In our evaluation, we compare the relative performance (accuracy or F1, as appropriate for the task) of the successor model trained with corpus D A to the scores achieved by training on comparable amounts of native and i.i.d.sampled data.We simulate pool-based AL using labeled benchmark datasets by withholding document labels from the models.This induces a pool of unlabeled data U.In AL, it is common to warm-start the acquisition model, training on some modest amount of i.i.d.labeled data D w before using the model to score candidates in U (Settles, 2009) and commencing the AL process.We follow this convention throughout.
Once we have trained the acquisition model on the warm-start data, we begin the simulated AL loop, iteratively selecting instances for labeling and adding them to the dataset.We denote the dataset acquired by model A at iteration t by D t A ; D 0 A is initialized to D w for all models (i.e., all values of A).At each iteration, the acquisition model is trained with D t A .It then scores the remaining unlabeled documents in U \ D t A according to a standard uncertainty AL heuristic.The top n candidates C t A are selected for (simulated) annotation.Their labels are revealed and they are added to the training set: At the experiment's conclusion (time step T ), each acquisition model A will have selected a (typically distinct) subset of U for training.
Once we have acquired datasets from each acquisition model D A , we evaluate the performance of each possible successor model when trained on D A .Specifically, we train each successor model S on the acquired data D t A for all t in the range [0, T ], evaluating its performance on a held-out test set (distinct from U).We compare the performance achieved in this case to that obtained using an i.i.d.training set of the same size.
We run this experiment ten times, averaging results to create summary learning curves, as shown in Figure 1.All reported results, including i.i.d.baselines, are averages of ten experiments, each conducted with a distinct D w .These learning curves quantify the comparative performance of a particular model achieved using the same amount of supervision, but elicited under different acquisition models.For each model, we compare the learning curves of each acquisition strategy, including active acquisition using a foreign model and subsequent transfer, active acquisition without changing models (i.e., typical AL), and the baseline strategy of i.i.d.sampling.

Tasks
We now briefly describe the models, datasets, acquisition functions, and implementation details for the experiments we conduct with active learners for text classification (4.1) and NER (4.2).

Text Classification
Models We consider three standard models for text classification: Support Vector Machines (SVMs), Convolutional Neural Networks (CNNs) (Kim, 2014;Zhang and Wallace, 2015), and Bidirectional Long Short-Term Memory (BiLSTM) networks (Hochreiter and Schmidhuber, 1997).For SVM, we represent texts via sparse, TF-IDF bag-of-words (BoW) vectors.For neural models (CNN and BiLSTM), we represent each document as a sequence of word embeddings, stacked into an l × d matrix where l is the length of the sentence and d is the dimensionality of the word embeddings.We initialize all word embeddings with pretrained GloVe vectors (Pennington et al., 2014).
We initialize vector representations for all words for which we do not have pre-trained embeddings uniformly at random.For the CNN, we impose a maximum sentence length of 120 words, truncating sentences exceeding this length and padding shorter sentences.We used filter sizes of 3, 4, and 5, with 128 filters per size.For BiL-STMs, we selected the maximum sentence length such that 90% of sentences in D t would be of equal or lesser length. 3We trained all neural models using the Adam optimizer (Kingma and Ba, 2014), with a learning rate of 0.001, β 1 = 0.9, β 1 = 0.999, and = 10 −8 .

Datasets
We perform text classification experiments using four benchmark datasets.We reserve 20% of each dataset (sampled at i.i.d.random) as test data, and use the remaining 80% as the pool of unlabeled data U.We sample 2.5% of the remaining documents randomly from U for each D w .All models receive the same D w for any given experiment.
• Movie Reviews: This corpus consists of sentences drawn from movie reviews.The task is to classify sentences as expressing positive or negative sentiment (Pang and Lee, 2005).
• Subjectivity: This dataset consists of statements labeled as either objective or subjective (Pang and Lee, 2004).
• TREC: This task entails categorizing questions into 1 of 6 categories based on the subject of the question (e.g., questions about people, locations, and so on) (Li and Roth, 2002).The TREC dataset defines standard train/test splits, but we generate our own for consistency in train/validation/test proportions across corpora.
• Customer Reviews: This dataset is composed of product reviews.The task is to categorize them as positive or negative (Hu and Liu, 2004).
For the CRF model we use a set of features including word-level and character-based embeddings, word suffix, capitalization, digit contents, and part-of-speech tags.The BiLSTM-CNN model4 initializes word vectors to pretrained GloVe vector embeddings (Pennington et al., 2014).We learn all word and character level features from scratch, initializing with random embeddings.
Datasets We perform NER experiments on the CoNLL-2003 and OntoNotes-5.0English datasets.We used the standard test sets for both corpora, but merged training and validation sets to form U. We initialize each D w to 2.5% of U.
• CoNLL-2003: Sentences from Reuters news with words tagged as person, location, organization, or miscellaneous entities using an IOB scheme (Tjong Kim Sang and De Meulder, 2003).The corpus contains 301,418 words.
• OntoNotes-5.0:A corpus of sentences drawn from a variety of sources including newswire, broadcast news, broadcast conversation, and web data.Words are categorized using eighteen entity categories annotated using the IOB scheme (Weischedel et al., 2013).The corpus contains 2,053,446 words.

Acquisition Functions
We evaluate these models using three common active learning acquisition functions: classical uncertainty sampling, query by committee (QBC), and Bayesian active learning by disagreement (BALD).
Uncertainty Sampling For text classification we use the entropy variant of uncertainty sampling, which is perhaps the most widely used AL heuristic (Settles, 2009).Documents are selected for annotation according to the function where x are instances in the pool U, j indexes potential labels of these (we have elided the in-stance index here) and P (y j |x) is the predicted probability that x belongs to class y j (this estimate is implicitly conditioned on a model that can provide such estimates).For SVM, the equivalent form of this is to choose documents closest to the decision boundary.
For the NER task we use maximized normalized log-probability (MNLP) (Shen et al., 2018) as our AL heuristic, which adapts the least confidence heuristics to sequences by normalizing the log probabilities of predicted tag sequence by the sequence length.This avoids favoring selecting longer sentences (owing to the lower probability of getting the entire tag sequence right).
Documents are sorted in ascending order according to the function max y 1 ,...,yn Where the max over y assignments denotes the most likely set of tags for instance x and n is the sequence length.Because explicitly calculating the most likely tag sequence is computationally expensive, we follow (Shen et al., 2018) in using a greedy decoding (i.e., beam search with width 1) to determine the model's prediction.Query by Committee For our QBC experiments, we use the bagging variant of QBC (Mamitsuka et al., 1998), in which a committee of n models is assembled by sampling with replacement n sets of m documents from the training data (D t at each t).Each model is then trained using a distinct resulting set, and the pool documents that maximize their disagreement are selected.We use 10 as our committee size, and set m as equal to the number of documents in D t .
For the text classification task, we compute disagreement using Kullback-Leibler divergence (McCallum and Nigamy, 1998)  ments for annotation according to the function where x are instances in the pool U, j indexes potential labels of these instances, and C is the committee size.P c (y j |x) is the probability that x belongs to class y j as predicted by committee member c.P C (y j |x) represents the consensus probability that x belongs to class y j , For NER, we compute disagreement using the average per word vote-entropy (Dagan and Engelson, 1995), selecting sequences for annotation which maximize the function where n is the sequence length, C is the committee size, and V (y i , m) is the number of committee members who assign tag m to word i in their most likely tag sequence.We do not apply the QBC acquisition function to the OntoNotes dataset, as training the committee for this larger dataset becomes impractical.
Bayesian AL by Disagreement We use the Monte Carlo variant of BALD, which exploits an interpretation of dropout regularization as a Bayesian approximation to a Gaussian process (Gal et al., 2017b;Siddhant and Lipton, 2018).This technique entails applying dropout at test time, and then estimating uncertainty as the disagreement between outputs realized via multiple passes through the model.We use the acquisition function proposed in (Siddhant and Lipton, 2018), which selects for annotation those instances that maximize the number of passes through the model that disagree with the most popular choice: where x are instances in the pool U, y i x is the class prediction of the ith model pass on instance x, and T is the number of passes taken through the model.Any ties are resolved using uncertainty sampling over the mean predicted probabilities of all T passes.
In the NER task, agreement is measured across the entire sequence.Because this acquisition function relies on dropout, we do not consider it for non-neural models (SVM and CRF).

Results
We compare transfer between all possible (acquisition, successor) model pairs for each task.We report the performance of each model under all acquisition functions both in tables compiling results (Table 1 and Table 2 for classification and NER, respectively) and graphically via learning curves that plot predictive performance as a function of train set size (Figure 2).
We report additional results, including all learning curves (for all model pairs and for all tasks), and tabular results (for all acquisition functions) in the Appendix.We also provide in the Appendix plots resembling 1a for all (model, acquisition function) pairs that report the difference between performance under standard AL (in which acquisition and successor model are the same) and that under commensurate i.i.d.data, which affords further analysis of the gains offered by standard AL.For text classification tasks, we report accuracies; for NER tasks, we report F1.
To compare the learning curves, we select incremental points along the x-axis and report the performance at these points.Specifically, we report results with training sets containing 10% and 20% of the training pool.

Discussion
Results in Tables 1 and 2  AL thus seems to yield modest (though inconsistent) improvements over i.i.d.random sampling, but our results further suggest that this comes at an additional cost: the acquired dataset may not generalize well to new learners.Specifically, models trained on foreign actively acquired datasets tend to underperform those trained on i.i.d.datasets.We observe this most clearly in the classification task, where only a handful of (acquisition, successor, acquisition function) combinations lead to performance greater than that achieved using i.i.d.data.Specifically, only 37.5% of the tabulated data points representing dataset transfer (in which acquisition and successor models differ) outperform the i.i.d.baseline.
Results for NER are more favorable for AL.For this task we observe consistent improved performance versus the i.i.d.baseline in both standard AL data points and transfer data points.These results are consistent with previous findings on transferring actively acquired datasets for NER (Tomanek and Morik, 2011).
In standard AL for text classification, the only (model, acquisition function) pairs that we observe to produce better than i.i.d.results with any regularity are uncertainty with SVM or CNN, and BALD with CNN.When transferring actively acquired datasets, we do not observe consistently better than i.i.d.results with any combination of acquisition model, successor model, and acquisition function.The success of AL appears to depend very much on the dataset.For example, AL methods -both in the standard and acquisition/successor settings -perform much more reliably on the Subjectivity dataset than any other.In contrast, AL performs consistently poorly on the TREC dataset.
Our findings suggest that AL is brittle.During experimentation, we also found that performance often depends on factors that one may think are minor design decisions.For example, our setup largely resembles that of Siddhant and Lipton ( 2018), yet initially we observed large discrepancies in results.Digging into this revealed that much of the difference was due to our use of word2vec (Mikolov et al., 2013) rather than GloVe (Pennington et al., 2014) for word embedding initializations.That small decisions like this can result in relatively pronounced performance differences for AL strategies is disconcerting.
A key advantage afforded by neural models is representation learning.A natural question here is therefore whether the representations induced by the neural models differs as a function of the acquisition strategy.To investigate this, we measure pairwise distances between instances in the learned feature space after training.Specifically, for each test instance we calculate its cosine similarity to all other test instances, inducing a ranking.We do this in the three different feature spaces learned by the CNN and LSTM models, respectively, after sampling under the three acquisition models.
We quantify dissimilarities between the rankings induced under different representations via Spearman's rank correlation coefficients.We re-peat this for all instances in the test set, and average over these coefficients to derive an overall similarity measure, which may be viewed as quantifying the similarity between learned feature spaces via average pairwise similarities within them.As reported in Table 4, despite the aforementioned differences in predictive performance, the learned representations seem to be similar.In other words, sampling under foreign acquisition models does not lead to notably different representations.

Conclusions
We extensively evaluated standard AL methods under varying model, domain, and acquisition function combinations for two standard NLP tasks (text classification and sequence tagging).We also assessed performance achieved when transferring an actively sampled training dataset from an acquisition model to a distinct successor model.Given the longevity and value of training sets and the frequency at which new ML models advance the state-of-the-art, this should be an anticipated scenario: Annotated data often outlives models.
Our findings indicate that AL performs unreliably.While a specific acquisition function and model applied to a particular task and domain may be quite effective, it is not clear that this can be predicted ahead of time.Indeed, there is no way to retrospectively determine the relative success of AL without collecting a relatively large quantity of i.i.d.sampled data, and this would undermine the purpose of AL in the first place.Further, even if such an i.i.d.sample were taken as a diagnostic tool early in the active learning cycle, relative success early in the AL cycle is not necessarily indicative of relative success later in the cycle, as illustrated by Figure 1a.
Problematically, even in successful cases, an actively sampled training set is linked to the model used to acquire it.We have found that training successor models with this set will often result in performance worse than that attained using an equivalently sized i.i.d.sample.Results are more favorable to AL for NER, as compared to text classification, which is consistent with prior work (Tomanek and Morik, 2011).
In short, the relative performance of individual active acquisition functions varies considerably over datasets and domains.While AL often does yield gains over i.i.d.sampling, these tend to be marginal and inconsistent.Moreover, this comes at a relatively steep cost: The acquired dataset may be disadvantageous for training subsequent models.Together these findings raise serious concerns regarding the efficacy of active learning in practice.

Appendices A Experimental Results
Below, we present full results for all our experiments in the form of tabular results and learning curves.Tables 5 and 6 enumerate performance metrics for all source, successor, acquisition function combinations after acquiring 10% and 20% of the pool.Figure 3 shows the learning curves for all combinations.We report all average Spearman's rank correlation coefficients in Table 7.  (1) SVM on Movie Reviews dataset using max entropy (2) CNN on Movie Reviews dataset using max entropy (3) BiLSTM on Movie Reviews dataset using max entropy (5) CNN on Movie Reviews dataset using QBC  (19) SVM on TREC dataset using max entropy (20) CNN on TREC dataset using max entropy (21) BiLSTM on TREC dataset using max entropy (22) SVM on TREC dataset using QBC (23) CNN on TREC dataset using QBC (24) BiLSTM on TREC dataset using QBC (25) SVM on TREC dataset using BALD (26) CNN on TREC dataset using BALD

Acquisition model
Performance of AL relative to i.i.d.across corpora.Transferring actively acquired training sets.

Figure 1 :
Figure 1: We highlight practical issues in the use of AL.(a) AL yields inconsistent gains, relative to a baseline of i.i.d.sampling, across corpora.(b) Training a BiLSTM with training sets actively acquired based on the uncertainty of other models tends to result in worse performance than training on i.i.d.samples.
Figure 2: Sample learning curves for the text classification task on the Movie Reviews dataset and the NER task on the OntoNotes dataset using the maximum entropy acquisition function (we report learning curves for all models and datasets in the Appendix).Individual plots correspond to successor models.Each line corresponds to an acquisition model, with the blue line representing an i.i.d.baseline.

Figure 3 :
Figure 3: This appendix contains the full set of collected learning curves for the text classification and NER.Error bars represent one standard deviation.
CNN on Customer Review dataset using max entropy acquisition model i.i.d.(45) CRF on CoNLL dataset using BALD acquisition model i.i.d.(46) BiLSTM-CNN on CoNLL dataset using BALD ∆ for BiLSTM-CNN using BALD

Table 2 :
F1 measurements for the NER task, with training sets comprising 10% and 20% of the training pool.
Table 1: Text classification accuracy, evaluated for each combination of acquisition and successor models using uncertainty sampling.Accuracies are reported for training sets composed of 10% and 20% of the document pool.Colors indicate performance relative to i.i.d.baselines: Blue indicates that a model fared better, red that it performed worse, and black that it performed the same.

Table 3 :
Text classification dataset statistics.

Table 4 :
demonstrate that standard AL -where the acquisition and successor models are one and the same -performs incon-Average Spearman's rank correlation coefficients (over five runs) of cosine distances between test set representations learned with native active learning and distances between those learned with transferred actively acquired datasets, at the end of the AL process.Uncertainty is used as the acquisition function in all cases.

Table 5 :
Text classification accuracy, evaluated for each combination of acquisition and successor models using uncertainty sampling, QBC, and BALD.Accuracies are reported for training sets composed of 10% and 20% of the document pool.Colors indicate performance relative to i.i.d.baselines: Blue implies that a model fared better, red that it performed worse, and black that it performed the same.

Table 6 :
F1 measurements for the NER task, with training sets comprising 10% and 20% of the training pool.

Table 7 :
Average Spearman's rank correlation coefficients of cosine distances between test set representations learned with native active learning and distances between those learned with transferred actively acquired datasets.