Active Sentence Learning by Adversarial Uncertainty Sampling in Discrete Space

Active learning for sentence understanding aims at discovering informative unlabeled data for annotation and therefore reducing the demand for labeled data. We argue that the typical uncertainty sampling method for active learning is time-consuming and can hardly work in real-time, which may lead to ineffective sample selection. We propose adversarial uncertainty sampling in discrete space (AUSDS) to retrieve informative unlabeled samples more efficiently. AUSDS maps sentences into latent space generated by the popular pre-trained language models, and discover informative unlabeled text samples for annotation via adversarial attack. The proposed approach is extremely efficient compared with traditional uncertainty sampling with more than 10x speedup. Experimental results on five datasets show that AUSDS outperforms strong baselines on effectiveness.


Introduction
Deep neural models become popular in natural language processing (Peters et al., 2018;Radford et al., 2018;Devlin et al., 2018). Neural models usually consume massive labeled data, which requires a huge quantity of human labors. But data are not born equal, where informative data with high uncertainty are decisive to decision boundary and are worth labeling. Thus selecting such worth-labeling data from unlabeled text corpus for annotation is an effective way to reduce the human labors and to obtain informative data.
Active learning approaches are a straightforward choice to reduce such human labors. Previous works, such as uncertainty sampling (Lewis and Gale, 1994), needs to traverse all unlabeled data to find informative unlabeled samples, which are always near the decision boundary with large entropy. However, the traverse process is very time-consuming, thus cannot be executed frequently (Settles and Craven, 2008). A common choice is to perform the sampling process after every specific period, and it samples and labels informative unlabeled data then trains the model until convergence (Deng et al., 2018).
We argue that infrequently performing uncertainty sampling may lead to the "ineffective sampling" problem. Because in the early phase of training, the decision boundary changes quickly, which makes previously collected samples less effective after several updates of the model. Ideally, uncertainty sampling should be performed frequently in the early phase of model training.
In this paper, we propose the adversarial uncertainty sampling in discrete space (AUSDS) to address the ineffective sampling problem for active sentence learning by introducing more frequent sampling with significantly lower costs. Specifically, we propose to leverage the adversarial attack (Goodfellow et al., 2014;Kurakin et al., 2016) to the selecting of informative samples with high uncertainty, which significantly narrows down the search space. Fig. 1 shows the difference between uncertainty sampling and AUSDS. The typical uncertainty sampling ( Fig. 1.a) traverses all the unlabeled samples to obtain samples of high uncertainty for each sampling run, which is costly with time complexity (O(Unlabeled Data Size). AUSDS ( Fig. 1.b) first projects a labeled text to the decision boundary, denoted as an adversarial data point, and searches nearest neighbors of this point. The computational cost of AUSDS is significantly smaller than typical uncertainty sampling with the time complexity O(Batch Size). But it is non-trivial for AUSDS to perform adversarial attacks, which requires adversarial gradients on sentences, since texts live in a discrete space. We propose to include a pre-trained neural encoder, such as BERT (Devlin et al., 2018), to map unlabeled sentences into a continuous space, over which the adversarial attack is performed. Since not every adversarial data point in the encoding space can be mapped back to one of the unlabeled sentences, we propose to use the k-nearest neighbor (KNN) algorithm (Altman, 1992) to find the most similar unlabeled sentences (the adversarial samples) to the adversarial data points.Besides, empirically, we mix some random samples into the uncertainty samples to alleviate the sampling bias issue mentioned by (Huang et al., 2010). Finally, the mixed samples are sent to an oracle annotator to obtain their label and are appended to the labeled data set.
We deploy AUSDS for active sentence learning and conduct experiments on five datasets across two NLP tasks, namely sequence classification and sequence labeling. Experimental results show that AUSDS outperforms random sampling and uncertainty sampling strategies.
Our contributions are summarized as follows: • We propose AUSDS for active sentence learning, which first introduces the adversarial attack for sentence uncertainty sampling, alleviating the ineffective sampling problem.
• We propose to map sentences into the pretrained LM encoding space, which makes adversarial uncertainty sampling available in the discrete sentence space.
• Experimental results demonstrate that our active sentence learning framework by AUSDS, which we call AUSDS learning framework, outperforms strong baselines in sampling effectiveness with acceptable running time.

Related Work
This work focuses on reducing the labeled data size with the help of pre-trained LM in solving sentence learning tasks. The proposed AUSDS approach is related to two different research topics, active learning and adversarial attack.

Active Learning
Active learning algorithms can be categorized into three scenarios, namely membership query synthesis, stream-based selective sampling, and poolbased active learning (Settles, 2009). Our work is more related to pool-based active learning, which assumes that there is a small set of labeled data and a large pool of unlabeled data available (Lewis and Gale, 1994). To reduce the demand for more annotations, the learner starts from the labeled data and selects one or more queries from the unlabeled data pool for the annotation, then learns from the new labeled data and repeats. The pool-based active learning scenario has been studied in many real-world applications, such as text classification (Lewis and Gale, 1994;Hoi et al., 2006), information extraction (Settles and Craven, 2008) and image classification (Joshi et al., 2009). Among the query strategies of existing active learning approaches, the uncertainty sampling strategy (Joshi et al., 2009;Lewis and Gale, 1994) is the most popular and widely used. The basic idea of uncertainty sampling is to enumerate the unlabeled samples and compute the uncertainty measurement like information entropy for each sample. The enumeration and uncertainty computation makes the sampling process costly and cannot be performed frequently, which induced the ineffective sampling problem.
There are some works that focus on accelerating the costly uncertainty sampling process. Jain et al. (2010) propose a hashing method to accelerate the sampling process in sub-linear time. Deng et al. (2018) propose to train an adversarial discriminator to select informative samples directly and avoid computing the rather costly sequence entropy. Nevertheless, the above works are still computationally expensive and cannot be performed frequently, which means the ineffective sampling problem still exists.

Adversarial Attack
Adversarial attacks are originally designed to approximate the smallest perturbation for a given latent state to cross the decision boundary (Goodfellow et al., 2014;Kurakin et al., 2016). As machine learning models are often vulnerable to adversarial samples, adversarial attacks have been used to serve as an important surrogate to evaluate the robustness of deep learning models before they are deployed (Biggio et al., 2013;Szegedy et al., 2013). Existing adversarial attack approaches can be categorized into three groups, which are onestep gradient-based approaches (Goodfellow et al., 2014;Rozsa et al., 2016), iterative methods (Kurakin et al., 2016) and optimization-based methods (Szegedy et al., 2013). Inspired by the similar goal of adversarial attacks and uncertainty sampling, in this paper, instead of considering adversarial attacks as a threat, we propose to combine these two approaches for achieving real-time uncertainty sampling. Some works share a similar but different idea with us. Li et al. (2018) introduce active learning strategies into black-box attacks to enhance query efficiency. Pal et al. (2020) also use active learning strate-gies to reduce the number of queries for model extraction attacks. Zhu and Bento (2017) propose to train Generative Adversarial Networks to generate samples by minimizing the distance to the decision boundary directly, which is in the query synthesis scenario different from us. Ducoffe and Precioso (2018) also introduce adversarial attacks into active learning by augmenting the training set with adversarial samples of unlabeled data, which is infeasible in discrete space. Note that none of the works above share the same scenario with our problem setting.

Active Sentence Learning with AUSDS
We propose AUSDS learning framework, an efficient and effective computational framework for active sentence learning. The overview of the learning framework is shown in Fig. 2. The learning framework consists of two blocks, a training block and a sampling block AUSDS. The training block learns knowledge from the labeled data, whereas the sampling block retrieves valuable unlabeled samples, whose latent states are close to the decision boundary over the latent space, from the unlabeled text corpus. Note that the definition of latent spaces can be different across encoders and tasks. The samples retrieved by the sampling block will be further sent to an oracle annotator to obtain their label, and the new samples with labels are also appended to the labeled data.
In this section, we first introduce AUSDS method by showing how AUSDS select samples that are critical to the decision boundary over the latent space. Then we present the computational procedure of the full-fledged framework in detail.
Algorithm 1 Active Sentence Learning with Adversarial Uncertainty Sampling in Discrete Space Input: an unlabeled text corpus T 0 , an oracle O, a labeled data D 0 = {(s, O(s))|s ∈ S 0 , a small initial text corpus}, pre-trained LM f e , fine-tuning interval j, and fine-tuning step k.
Generate adversarial data points A ⊂ H using the adversarial attack algorithm; 8: Inject S a with random samples S r , where |S a | : |S r | = p : 1 − p; 10: Select top-k ranked samples S add from S a w.r.t. the information entropy; 11: Sample a training batch B i+1 from Q and D i+1 by the ratio of q : 1 − q;

15:
if i mod j = 0 then 16: Fine-tune f with D i+1 for k steps; i ← i + 1 20: end while 3.1 AUSDS AUSDS first defines a latent space, over which sentences are distinguishable according to the model's decision boundary. The latent space is usually determined by the encoder architecture and the downstream task. We detail the latent space definition of specific encoders and tasks in Sec. 4.1.
At first, we sample a batch of labeled texts and compute their representation as well as their gradients in the latent space. Using the latent states and their gradients, we perform adversarial attacks to generate adversarial data points A near the decision boundary in the latent space. Adversarial attacks are performed using the following existing approaches: • Fast Gradient Value (FGV) (Rozsa et al., 2016): a one-step gradient-based approach with high efficiency. The adversarial data points are generated by: (1) where λ is a hyper parameter, and F d is the cross entropy loss on x.
• DeepFool (Moosavi-Dezfooli et al., 2016): an iterative approach to find the minimal per-turbation that is sufficient to change the estimated label.
• C&W (Carlini and Wagner, 2017): an optimization-based approach with the optimization problem defined as: is a manually designed function, satisfying g(x) ≤ 0 if and only if x's label is a specific target label. D is a distance measurement like Minkowski distance.
FGV is efficient in the calculation, whereas the other two methods typically find more precise adversarial data points but with larger computational costs. We use all of them in our experimental part to show the effectiveness of the AUSDS.
In our sentence learning scenario, the adversarial data points A cannot be grounded on real natural language text samples. Thus we perform knearest neighbor (KNN) search (Altman, 1992) to find unlabeled text samples whose latent states are k-nearest to the adversarial data points A.
We implement the KNN search using Faiss 1 (Johnson et al., 2017), an efficient similarity search algorithm with GPUs. The computational cost of KNN search results from two processes, including constructing a sample mapper M between text and latent space, and searching similar latent states of adversarial data points. The sampler mapper M here is constructed as a hash map, which is of high computational efficiency, to memorize the mapping between an unlabeled text s and its latent representation x. The sample mapper is only reconstructed when the encoder is updated, and infrequent encoder updates contribute to efficiency. Besides, the searching process is also fast (100× faster than generating A) thanks to Faiss. Thus it is possible to performed AUSDS frequently at batch-level without harming computation.
After acquiring adversarial samples S a using KNN search, we mix S a with random samples S r drawn from unlabeled text corpus T i by the ratio of p : 1 − p, where p is a hyper-parameter determined on the development set. The motivation of appending random samples is to balance exploration and exploitation, thus avoiding the model continuously retrieve samples in a small neighborhood.
We perform top-k ranking over the information entropy of the mixed samples to further retrieve samples with higher uncertainty. Since the size of the mixed samples is comparable to the batch size, the computation cost is acceptable. The remaining samples are further sent to an oracle annotator O to obtain their labels.

Active Learning Framework
The overall procedure of the proposed framework equipped with AUSDS is outlined in Algorithm 1 Initialization The initialization stage is shown in Algorithm 1 line 1-4. We first initialize our encoder f e with the pre-trained LM, which can be BERT BASE (Devlin et al., 2018) or ELMo (Peters et al., 2018). The decoder here is built upon the latent space and is randomly initialized. After building up the neural model architecture, we train only the decoder on existing labeled data D 0 to compute an initial decision boundary on the latent space. Meanwhile, we construct an initial discrete sample mapper M used for the sampling block. Finally, we sample a training batch B 0 from labeled data corpus D 0 , and set current training step i to 0.
Training The training stage is shown in Algorithm 1 line 6. With the defined decoders f d and a training batch B i , we train the decoder with a cross entropy loss (Fig. 2.b). Note that during the training process, we freeze the encoder as well as the latent space, where a frozen latent space contributes to computational efficiency without reconstructing the mapper M .

Sampling
The sampling stage is shown in Algorithm 1 line 7-14. As is shown in Sec. 3.1, given the gradients on the current batch B i w.r.t. latent states during training, the sampling process generates the adversarial samples S a and labels the samples with high uncertainty from a mixture of S a and randomly injected unlabeled data S r . The labeled samples Q are removed from the unlabeled text corpus and inserted into labeled data, resulting in T i+1 and D i+1 respectively. Then we create a new training batch consist of samples from Q and D i+1 with a ratio of q : 1 − q, which favors the newly selected data Q, because the newly selected ones are considered as more critical to the current decision boundary.

Fine-Tuning
The fine-tuning stage is shown in Algorithm 1 line 15-18. We fine-tune the encoder for k steps after j batches are trained. During the fine-tuning process, both of the encoder and the decoder are trained on the accumulated labeled data set D i+1 . The encoder is also fine-tuned for enhancing overall performance. Experiments show that the final performance is harmed a lot without updating the encoder. Then we update the mapper M for the future KNN search, because the finetuning of the encoder corrupts the projection from texts to latent spaces, which requires renewal of the sampler mapper M . The algorithm terminates until the unlabeled text corpus T i is used up.

Experiments
We evaluate the AUSDS learning framework on sequence classification and sequence labeling tasks. For the oracle labeler O, we directly use the labels provided by the datasets. In all the experiments, we take average results of 5 runs with different random seeds to alleviate the influence of randomness.
The train/development/test sets follow the original settings in those papers. We use accuracy for sequence classification and f1-score for sequence labeling as the evaluation metric.
Baseline Approaches. We use two common baseline approaches in NLP active learning to compare with our framework, namely random sampling (RM) and entropy-based uncertainty sampling (US). For sequence classification tasks, we adopt the widely used Max Entropy (ME) (Berger et al., 1996) as uncertainty measurement: where c is the number of classes. For sequence labeling tasks, we use total token entropy (TTE) (Settles and Craven, 2008) as uncertainty measurement: where N is the sequence length and l is the number of labels.

Latent Space Definition
We use the adversarial attack in our AUSDS learning framework to find informative samples, which rely on a well-defined latent space. Two types of latent spaces are defined here based on the encoder architectures and tasks: 1. For pre-trained LMs like BERT (Devlin et al., 2018), which has an extra token [CLS] for sequence classification, we directly use its latent state x as the representation of the whole sentence in the latent space H.
2. For the other circumstances where no such special token can be used, a mean-pooling operation is applied to the encoder output, i.e. x = 1 n n t=1 h t , where h t denotes the contextual word representation of the t th token produced by the encoder. The latent space H is spanned by all the latent states. Implementation Details. We implement our frameworks based on BERT BASE model 2 and ELMo 3 . The configurations of the two models are the same as reported in (Devlin et al., 2018) and (Peters et al., 2018) respectively. The implementation of the KNN search is introduced in section 3.3. For the rest hyperparameters in our framework, 1) the batch size and the size of Q is set as 32 (16 on MRPC dataset); 2) the fine-tuning interval j and the fine-tuning step size k are set as 50 steps; 3) the ratio q is set as 0.3. All the tuning experiments are performed on the dev sets of five datasets. The accumulated labeled data set D is initialized the same for different approaches, taking 0.1% of the whole unlabeled data (0.5% for MRPC because the dataset is relatively small).

Sampling Effectiveness
AUSDS can achieve higher sampling effectiveness than uncertainty sampling due to the sampling bias problem. The main criteria to evaluate an active learning approach is the sampling effectiveness, namely the model performance with a limited amount of unlabeled data being sampled and labeled. Our AUSDS learning framework is compared with the two baselines using the same amount of labeled data. The limitations are set as 2%, 4%, 6%, 8%, and 10% of all labeled data in each dataset. We only include at most 10% of the whole training data labeled, because active learning focuses on training with a quite limited amount of labeled data by selecting more valuable examples to label. It makes no difference whether to perform active learning or not with enough labeled data available. We believe that with less labeled data, the performance gap, namely the difference of sampling effectiveness is more obvious.
We propose training from scratch setting to better evaluate the sampling effectiveness, in which models are trained from scratch using the labeled data sampled by different approaches with various labeled data sizes. We argue that simply training the model until convergence after each sampling step, which we call continuous training setting, can easily induce the problem of sampling bias (Huang  Figure 3: The margin of outputs on samples selected by different sampling strategies on SST-5. The margin denotes for differences between the largest and the second-largest output probabilities on different classes. The lower the margin is, the closer the sample is located to the decision boundary. Fig. (a) shows the average margin of each sampling step during training. The margins of samples selected by RM and US on whole unlabeled data are also plotted as references. Fig. (b) shows the margin distribution of samples selected from sampling step 800 to 1000, where the average uncertainty becomes steady. US in Fig. (b) is omitted for better visualization. et al., 2010). Biased models in the early training phase lead to worse performance even after more informative samples are given. Thus the performance of models during sampling cannot reflect the real informativeness of selected samples. The from-scratch training results are shown in Table 3. Our framework outperforms the random baselines consistently because it selects more informative samples for identifying the shape of the decision boundary. Also, it outperforms the common uncertainty sampling in most cases with the same labeled data size limits because the frequent sampling processes in our approach alleviate the sampling bias issue. Uncertainty sampling suffers the sampling bias problem because of frequent variation of the decision boundary in the early phase of training, which results in ineffective sampling. The decision boundary is merely determined by a small number of labeled examples in the early phase. And the easily biased decision boundary may lead to the sampling of high uncertainty samples given the current model state but not that representative to the whole unlabelled data. With the overall results on the five standard benchmarks of 2 NLP tasks, we observe that our AUSDS can achieve better sampling effectiveness with DeepFool for sequence classification and FGV for sequence labeling. The results of CW are also included for completeness and comparison.
To prove that our AUSDS framework does not heavily depend on BERT, we conduct experiments on SST-2 with ELMo as the encoder, which has a different network structure. The results in Table 4 show that in this setting, our AUSDS framework still achieves higher sampling effectiveness, while the original uncertainty sampling gets stuck in a more severe sampling bias problem. The results in this experiment can also be evidence of the generalization ability of our framework to other pre-trained LM encoding space.

Computational Efficiency
AUSDS is computationally more efficient than uncertainty sampling. Our AUSDS is computationally efficient enough to be performed at batch-level, thus achieving real-time effective sampling. The average sampling speeds of different approaches are compared w.r.t. US (Table 2).
We observe that uncertainty sampling can hardly work in a real-time sampling setting because of the costly sampling process. Our AUSDS are more than 10x faster than common uncertainty sampling. The larger the unlabeled data pool is, the more significant the acceleration is. Our framework spends longer computation time, compared with the random sampling baseline, but still fast enough for real-time batch-level sampling. Moreover, the experimental results on Sampling Effectiveness in Sec. 4.2 show that the extra computation for adversarial samples is worthy with obvious performance enhancement on the same amount of labeled data.

Samples Uncertainty
AUSDS can actually select examples with higher uncertainty. We plot the margins of outputs of samples selected with different sampling strategies on SST-5 in Fig. 3. We use margin as the measurement of the distance to the decision boundary. Lower margins indicate positions closer to the decision boundary. As shown in Fig. 3(a), the samples selected by our AUSDS with different attack approaches achieve lower average margins during sampling. Samples from step 800 to 1000 are collected to estimate the margin distribution, as shown in Fig. 3(b). It is shown that our AUSDS has better capability to capture the samples with higher uncertainty as their margin distributions are more to the left. The uncertainty sampling performed on the whole unlabeled data gets the most uncertain samples. However, it is very time-consuming and can not be applied frequently.
In short, AUSDS achieves better sampling effectiveness in comparison with US because the more efficient batch-level sampling alleviates the problem of sampling bias. Adversarial attacks can be an effective way to find critical data points near the decision boundary.

Conclusion
Uncertainty sampling is an effective way of reducing the labeled data size in sentence learning. But uncertainty sampling of high latency may lead to an ineffective sampling problem. In this study, we propose adversarial uncertainty sampling in discrete space for active sentence learning to address the ineffective sampling problem. The proposed AUSDS is more efficient than traditional uncertainty sampling by leveraging adversarial attacks and projecting discrete sentences into pre-trained LM space. Experimental results on five datasets show that the proposed approach outperforms strong baselines in most cases, and achieve better sampling effectiveness.