SeqMix: Augmenting Active Sequence Labeling via Sequence Mixup

Active learning is an important technique for low-resource sequence labeling tasks. However, current active sequence labeling methods use the queried samples alone in each iteration, which is an inefficient way of leveraging human annotations. We propose a simple but effective data augmentation method to improve the label efficiency of active sequence labeling. Our method, SeqMix, simply augments the queried samples by generating extra labeled sequences in each iteration. The key difficulty is to generate plausible sequences along with token-level labels. In SeqMix, we address this challenge by performing mixup for both sequences and token-level labels of the queried samples. Furthermore, we design a discriminator during sequence mixup, which judges whether the generated sequences are plausible or not. Our experiments on Named Entity Recognition and Event Detection tasks show that SeqMix can improve the standard active sequence labeling method by $2.27\%$--$3.75\%$ in terms of $F_1$ scores. The code and data for SeqMix can be found at https://github.com/rz-zhang/SeqMix


Introduction
Many NLP tasks can be formulated as sequence labeling problems, such as part-of-speech (POS) tagging (Zheng et al., 2013), named entity recognition (NER) (Lample et al., 2016), and event extraction (Yang et al., 2019). Recently, neural sequential models (Lample et al., 2016;Akbik et al., 2018;Vaswani et al., 2017) have shown strong performance for various sequence labeling task. However, these deep neural models are label hungrythey require large amounts of annotated sequences to achieve strong performance. Obtaining large amounts of annotated data can be too expensive for practical sequence labeling tasks, due to tokenlevel annotation efforts.
Active learning is an important technique for sequence labeling in low-resource settings. Active sequence labeling is an iterative process. In each iteration, a fixed number of unlabeled sequences are selected by a query policy for annotation and then model updating, in hope of maximally improving model performance. For example, Tomanek et al. (2007); Shen et al. (2017) select query samples based on data uncertainties; Hazra et al. (2019) compute model-aware similarity to eliminate redundant examples and improve the diversity of query samples; and Fang et al. (2017); Liu et al. (2018) use reinforcement learning to learn query policies. However, existing methods for active sequence labeling all use the queried samples alone in each iteration. We argue that the queried samples provide limited data diversity, and using them alone for model updating is inefficient in terms of leveraging human annotation efforts.
We study the problem of enhancing active sequence labeling via data augmentation. We aim to generate augmented labeled sequences for the queried samples in each iteration, thereby introducing more data diversity and improve model generalization. However, data augmentation for active sequence labeling is challenging, because we need to generate sentences and token-level labels jointly. Prevailing generative models (Zhang et al., 2016;Bowman et al., 2016) are inapplicable because they can only generate word sequences without labels. It is also infeasible to apply heuristic data augmentation methods such as context-based words substitution (Kobayashi, 2018), synonym replacement, random insertion, swap, and deletion (Wei and Zou, 2019), paraphrasing (Cho et al., 2019) or back translation , because label composition is complex for sequence labeling. Directly using these techniques to manipulate tokens may inject incorrectly labeled sequences into training data and harm model performance.
We propose SeqMix, a data augmentation method for generating sub-sequences along with their labels based on mixup (Zhang et al., 2018). Under the active sequence labeling framework, Se-qMix is capable of generating plausible pseudo labeled sequences for the queried samples in each iteration. This is enabled by two key techniques in SeqMix: (1) First, in each iteration, it searches for pairs of eligible sequences and mixes them both in the feature space and the label space.
(2) Second, it has a discriminator to judge if the generated sequence is plausible or not. The discriminator is designed to compute the perplexity scores for all the generated candidate sequences and select the low-perplexity sequences as plausible ones.
We show that SeqMix consistently outperforms standard active sequence labeling baselines under different data usage percentiles with experiments on Named Entity Recognition and Event Detection tasks. On average, it achieves 2.95%, 2.27%, 3.75% F 1 improvements on the CoNLL-2003, ACE05 and WebPage datasets. The advantage of SeqMix is especially prominent in low-resource scenarios, achieving 12.06%, 8.86%, 16.49% F 1 improvements to the original active learning approach on the above three datasets. Our results also verify the proposed mixup strategies and the discriminator are vital to the performance of SeqMix.

Problem Definition
Many NLP problems can be formulated as sequence labeling problems. Given an input sequence, the task is to annotate it with token-level labels. The labels often consist of a position prefix provided by a labeling schema and a type indicator provided by the specific task. For example, in the named entity recognition task, we can adopt the BIO (Beginning, Inside, Outside) tagging scheme (Màrquez et al., 2005) to assign labels for each token: the first token of an entity mention with type X is labeled as B-X, the tokens inside that mention are labeled as I-X and the non-entity tokens are labeled as O.
Consider a large unlabeled corpus U, traditional active learning starts from a small annotated seed set L, and utilizes a query function ψ(U, K, γ(·)) to obtain K most informative unlabeled samples X = {x 1 , . . . , x K } along with their labels Y = {y 1 , · · · , y K }, where γ(·) is the query policy. Then, we remove X from the unlabeled data U and repeat the above procedure until the satisfactory performance achieved or the annotation capacity reached.
In SeqMix, we aim to further exploit the annotated set X , Y to generate augmented data X * , Y * . Then the labeled dataset is expanded as L = L ∪ X , Y ∪ X * , Y * . Formally, we define our task as: (1) construct a generator φ(·) to implement sequence and label generation based on the actively sampled data X and its label Y, (2) set a discriminator d(·) to yield the filtered generation, then (3) augment the labeled set as L = L ∪ X , Y ∪ d(φ(X , Y)).

Active Learning for Sequence Labeling
Active sequence labeling selects K most informative instances ψ (·, K, γ(·)) in each iteration, with the hope of maximally improving model performance with a fixed labeled budget. With the input sequence x of length T , we denote the model output as f (·|x; θ). Our method is generic to any query policies γ(·). Below, we introduce several representative policies.
Least Confidence (LC) Culotta and McCallum (2005) measure the uncertainty of sequence models by the most likely predicted sequence. For a CRF model (Lafferty et al., 2001), we calculate γ with the predicted sequential label y * as where y * is the Viterbi parse. For BERT (Devlin et al., 2019) with a token classification head, we adopt a variant of the least confidence measure: where P (y t |x; θ) = softmax(f (y t |x; θ)).
Normalized Token Entropy (NTE) Another uncertainty measure for the query policy is normalized entropy (Settles and Craven, 2008), defined as: where P m (y t |x, θ) = [softmax(f (y t |x; θ))] m . Disagreement Sampling Query-by-committee (QBC) (Seung et al., 1992), is another approach for specifying the policy, where the unlabeled data can be sampled by the disagreement of the base models. The disagreement can be defined in several ways, here we take the vote entropy proposed by (Dagan and Engelson, 1995). Given a committee consist of C models, the vote entropy for input x is: where V m (y t ) is the number of models that predict the t-th token x t as the label m.

Overview
Given a corpus for sequence labeling, we assume the dataset contains a small labeled set L and a large unlabeled set U initially. We start from augmenting the seed set L with SeqMix. First, we adopt a pairing function ζ(·) to find paired samples by traversing L. Next, we generate mixed-labeled sequences via latent space linear interpolation with one of the approaches mentioned in Section 3.2. To ensure the semantic quality of the generated sequences, we use a discriminator d(·) to measure the perplexity of them and filter low-quality sequences out. Then we generate the extra labeled sequences L * = SeqMix(L, α, ζ(·), d(·)) and get the augmented training set L = L ∪ L * . The sequence labeling model θ is initialized on this augmented training set L.
After that, the iterative active learning procedure begins. In each iteration, we actively select instances from U with a query policy γ(·) (Section 2.2) to obtain the top K samples X = ψ(U, K, γ(·)). The newly selected samples will be labeled with Y, and the batch of samples X , Y will be used for SeqMix. Again, we generate L * = SeqMix( X , Y , α, ζ(·), d(·)) and expand the training set as L = L ∪ L * . Then we train the model θ on the newly augmented set L. The iterative active learning procedure terminates when a fixed number of iterations are reached. We summarize the above procedure in Algorithm 1.

Sequence Mixup in the Embedding Space
Mixup (Zhang et al., 2018) is a data augmentation method that implements linear interpolation in the input space. Given two input samples x i , x j along Algorithm 1 The procedure of active sequence labeling augmentation via SeqMix Input: Labeled seed set L; Unlabeled set U; Query function ψ(·, K, γ(·)); The sequence labeling model θ; Beta distribution parameter α; Pairing function ζ(·); Discriminator function d(·). // seed set augmentation // active learning iterations with augmentation for round in active learning rounds do Output: The sequence model trained with active data augmentation: θ with the labels y i , y j , the mixing process is: where λ ∼ Beta(α, α) is the mixing coefficient. Through linear combinations on the input level of paired examples and their labels, Mixup regularizes the model to present linear behavior among the training data. Mixup is not directly applicable to generate interpolated samples for text data, because the input space is discrete. To overcome this, SeqMix performs token-level interpolation in the embedding space and selects a token closest to the interpolated embedding. Specifically, SeqMix constructs a table of tokens W and their corresponding contextual embeddings E 1 . Given two sequences i , · · · , e T i } and e x j = {e 1 j , · · · , e T j }, the t-th mixed token is the token whose embedding e t is closest to the mixed embedding: 1 The construction of {W, E} are discussed in Appendix.
To get the corresponding w t , we can query the table {W, E} using e t . The label generation is straightforward. For two label sequences y i = {y 1 i , · · · , y T i } and y j = {y 1 j , · · · , y T j }, we get the t-th mixed label as: where y t i and y t j are one-hot encoded labels. Along with the above sequence mixup procedures, we also introduce a pairing strategy that selects sequences for mixup. The reason is that, in many sequence labeling tasks, the labels of interest are scarce. For example, in the NER and event detection tasks, the "O" label is dominant in the corpus, which do not refer to any entities or events of interest. We thus define the labels of interest as valid labels, e.g., the non-"O" labels in NER and event detection, and design a sequence pairing function to select more informative parent sequences for mixup. Specifically, the sequence pairing function ζ(·) is designed according to valid label density. For a sequence, its valid label density is defined as η = n s , where n is the number of valid labels and s is the length of the sub-sequence. We set a threshold η 0 for ζ(·), and the sequence will be considered as an eligible candidate for mixup only when η ≥ η 0 .
Based on the above token-level mixup procedure and the sequence pairing function, we propose three different strategies for generating interpolated labeled sequences. These strategies are shown in Figure 1 and described below: Whole-sequence mixup As the name suggests, whole-sequence mixup (Figure 1(a)) performs sequence mixing at the whole-sequence level. Given two sequences x i , y i , x j , y j ∈ L, they must share the same length without counting padding words. Besides, the paring function ζ(·) requires that both the two sequences satisfy η ≥ η 0 . Then we perform mixup at all token positions, by employing Equation 7 to generate mixed tokens and Equation 8 to generate mixed labels (note that the mixed labels are soft labels).
Sub-sequence mixup One drawback of the whole-sequence mixup is that it indiscriminately mixes over all tokens, which may include incompatible subsequences and generate implausible sequences. To tackle this, we consider sub-sequence mixup (Figure 1(b)) to mix sub-sequences of the parent sequences. It scans the original samples with a window of fixed-length s to look for Algorithm 2 The generation procedure of SeqMix Input: Labeled set L = X , Y ; Beta distribution parameter α; Pairing function ζ(·); Discriminator function d(·); Number of expected generation N .
// mixup the target sub-sequences for t = 1, · · · , T do Calculate e t by Eq. (7); Get corresponding token w t for e t ; Calculate y t by Eq. (8).
Output: Generated sequences and labels L * paired sub-sequences. Denote the sub-sequences of Then the subsequences x isub and x jsub are mixed as Figure 1(b). The mixed sub-sequence and labels will replace the original parts of the parents samples, and the other parts of the parent samples remain unchanged. In this way, sub-sequence mixup is expected to keep the syntax structure of the original sequence, while providing data diversity.

Mixup in the Embedding Space Mixup in the Label Space
(c) Label-constrained sub-sequence mixup Figure 1: Illustration of the three variants of SeqMix. We use s = 5, η 0 = 3 5 for whole-sequence mixup and s = 3, η 0 = 2 3 for sub-sequence mixup and label-constrained sub-sequence mixup. The solid red frames indicate paired sequences or sub-sequences, and the red dotted frames indicate generated sequence or sub-sequence. In the original sequences, the parts not included in the solid red frames will be unchanged in the generated sequences. For the mixup in the embedding space, we take the embedding in E which is closest to the raw mixed embedding as the generated embedding. For the mixup in the label space, the mixed label can be used as the pseudo label.
version is called label-constrained sub-sequence mixup.
Comparing the three variants, label-constrained sub-sequence mixup gives the most restrictions to pairing parent samples, sub-sequence mixup sets the sub-sequence-level pattern, while wholesequence mixup just requires η ≥ η 0 for the sequences with the same length.

Scoring and Selecting Plausible Sequences
During sequence mixup, the mixing coefficient λ determines the strength of interpolation. When λ approximates 0 or 1, the generated sequence will be similar to one of the parent sequences, while the λ around 0.5 produces relatively diverse generation. However, generating diverse sequences means lowquality sequences can be generated, which can provide noisy contextual information and hurt model performance.
To maintain the quality of mixed sequences, we set a discriminator to score the perplexity of the sequences. The final generated sequences will consist of only the sequences that pass the sequence quality screening. For screening, we utilize a language model GPT-2 (Radford et al., 2019) to score sequence x by computing its perplexity: where T is the number of tokens before padding, w i is the i-th token of sequence x. Based on the perplexity and a score range [s 1 , s 2 ], the discriminator can give judgment for sequence x: The lower the perplexity score, the more natural the sequence. However, the discriminator should also consider the regularization effectiveness and the generation capacity. Hence, a blind low perplexity setting is undesirable. The overall sequence mixup and selection procedure is illustrated in Algorithm 2.

Experiment Setup
Datasets. We conduct experiments on three sequence labeling datasets for the named entity recognition (NER) and event detection tasks.
(1) CoNLL-03 (Tjong Kim Sang and De Meulder, 2003) is a corpus for NER task. It provides four named entity types: persons, locations, organizations, and miscellaneous. 2 (2) ACE05 is a corpus for event detection. It provides 8 event types and 33 subtypes. We study the event trigger detection problem, which aims to identify trigger tokens in a sentence.
(3) Webpage (Ratinov and Roth, 2009) is a NER corpus with 20 webpages related to computer science conference and academic websites. It inherits the entity types from CoNLL-03. Data Split. To investigate low-resource sequence labeling, we randomly take 700 labeled sentences from the original CoNLL-03 dataset as the training set. For ACE05 and WebPage dataset, the annotation is sparse, so we conduct experiments on their original dataset without further slicing.
We set 6 data usage percentiles for the training set in each corpus. The sequence model is initialed on a small seed set, then it performs five iterates of active learning. For the query policy, we use random sampling and the three active learning policies mentioned in Section 2.2. The machine learning performance is evaluated by F 1 score for each data usage percentile. Parameters. We use BERT-base-cased for the NER task as the underlying model, and BERT-basemultilingual-cased for the event trigger detection task. We set the max length as 128 to pad the varying-length sequences. The learning rate of the underlying model is 5e-5, and the batch size is 32. We trained them for 10 Epochs at each data usage percentile. For the parameters of SeqMix, we set the α = 8 to sample λ from Beta(α, α). We use the sub-sequence window length s = {5, 5, 4}, the valid label density η 0 = {0.6, 0.2, 0.5} for CoNLL-03, ACE05 and Webpage, respectively. The augment rate is set as 0.2, and the discriminator score range is set as (0, 500). We also perform a detailed parameter study in Section 4.4.

Results
The main results are presented in Figure 2, where we use NTE sampling as the default active learning policy. From the result, it is clear that our method achieves the best performance consistently at each data usage percentile for all three datasets. The best SeqMix method (sub-sequence mixup with NTE sampling) outperforms the strongest active learning baselines by 2.95% on CoNLL-03, 2.27% on ACE05 and 3.75% on WebPage in terms of F 1 score on average. Moreover, the augmentation advantage is especially prominent for the seed set initialization stage where we only have a very limited number of labeled data. Through the augmentation, we improve the model performance from 68.65% to 80.71%, where the seed set is 200 labeled sequences and the augmentation provides extra 40 data points for CoNLL-03. The improvement is also significant on ACE05 (40.65% to 49.51%), and WebPage (55.18% to 71.67%), which indicates that our SeqMix can largely resolve the label scarcity issue in low-resource scenarios.
We also perform statistical significance tests for  the above results. We use Wilcoxon Signed Rank Test (Wilcoxon, 1992), a non-parametric alternative to the paired t-test. This significance test fits our task as F-score is generally assumed to be not normally distributed (Dror et al., 2018), and nonparametric significance tests should be used in such a case. The results show that sub-sequence mixup and label-constrained sub-sequence mixup can provide a statistical significance (the confidence level α = 0.05 and the number of data points N = 6) for all the comparisons with active learning baselines on used datasets. The whole-sequence mixup passes the statistical significance test with α = 0.1 and N = 6 on CoNLL-03 and WebPage, but fails on ACE05. Among all the three SeqMix variants, subsequence mixup gives the overall best performance (label-constrained sub-sequence mixup achieves very close performance with sub-sequence mixup on ACE05 dataset), but whole-sequence mixup does not yield a consistent improvement to the original active learning method. This is because the whole-sequence mixup may generate semantically poor new sequences. Instead, the sub-sequencelevel process reserves the original context information between the sub-sequence and the other parts of the whole sequence. Meanwhile, the updated sub-sequences inherit the original local informativeness, and introduce linguistic diversity to enhance the model's generalization ability.
To justify that SeqMix can provide improvement to the active learning framework with various query policies, we employ different query policies with SeqMix augmentation under the same experiment setting as Figure 2(a). From Figure 3, we find that there is a consistent performance improvement when employing SeqMix with different query policies. As SeqMix achieves {2.46%, 2.85%, 2.94%} performance gain for random sampling, LC sampling and NTE sampling respectively.

Effect of Discriminator
To verify the effectiveness of the discriminator, we conduct the ablation study on a subset of CoNLL-   03 with 700 labeled sequences. We use subsequence mixup with NTE sampling as the backbone and change the perplexity score range of the discriminator. We start from the seed set with 200 labeled data, then actively query 100 data in each learning round and repeat 5 rounds in total. The result in Table 1 demonstrates the discriminator provides a stable improvement for the last four data usage percentiles, and the discriminator with score range (0, 500) can boost the model by 1.07% F 1 score, averaged by all the data usage percentiles. The comparison between 3 different score thresholds demonstrates the lower the perplexity, the better the generation quality. As a result, the final F 1 score becomes higher with the better generated tokens. Actually, we can further narrow down the score range to get more performance improvement in return, but the too strict constraints will slow down the generation in practice and reduce the number of generated samples.

Parameter Study
In this subsection, we study the effect of several key parameters.
Augment rate r. We vary the augment rate r = |L * | |ψ(U ,K,γ(·))| in {0.2, 0.4, 0.6, 0.8, 1.0} and keep the number of initial data usage same to investigate the effect of augment rate for data augmentation. Table 2 shows that r ≤ 0.6 can provide better F 1 improvement. The model with r = 0.2 surpasses the model with r = 1.0 by 0.73%, evaluated by the average F 1 score for all the data usage percentiles. This result indicates that the model appreciates moderate augmentation more. However, the performance variance based on the augment rate is not prominent compared to the improvement provided by SeqMix to the active learning framework.
Valid tag density η 0 . We search the valid tag density η 0 as Section 3.2 defined by varying the sub-sequence window length s and the required number of valid tag n within the window. The results in Figure 4(a) illustrate the combination (s = 5, n = 3) outperforms other settings. When s is too small, the window usually truncates the continuous clause, thus cutting off the local syntax or semantic information. When s is too large, sub-sequence mixup tends to behave like wholesequence SeqMix, where the too long sub-sequence generation can hardly maintain the rationality of syntax and semantics as before. The high η 0 with long window length may result in an insufficient amount of eligible parent sequences. Actually, even with a moderate augment rate α = 0.2, the combination (s = 6, n = 5) has been unable to provide enough generation.
Mixing parameter α. We show the performance with different α in Figure 4(b). The parameter α decides the distribution λ ∼ Beta(α, α), and the coefficient λ directly involved the mixing of tokens and labels. Among the values {0.5, 1, 2, 4, 8, 16}, we observed α = 8 presents the best performance. It outperforms the second-best parameter setting 0.49% by average. From the perspective of Beta distribution, larger α will make the sampled λ more concentrated around 0.5, which assigns more balance weights to the parent samples to be mixed. In this way, the interpolation produces encoded token with further distance to both the parent samples, thus introduces a more diverse generation.

Case Study
Figure 5 presents a generation example via subsequence mixup. For the convenience of presentation, we set the length of sub-sequence s = 3 and the valid label density threshold η 0 = 2 3 . The two input sequences got paired for their eligible sub-sequences "COLORADO 10 St" and "Slovenia , Kwasniewski". The subsequences are mixed by λ = 0.39 in this case, which is sampled from Beta(α, α). Then the generated sub-sequence "Ohio ( novelist" replaces the original parts in the two input sequences. Among the generated tokens, "Ohio" inherits the label B-ORG from "COLORADO" and the label B-LOC from "Slovenia", and the distribution Beta(α, α) assigns the two labels with weights λ = 0.39 and (1 − λ) = 0.61. The open parenthesis is produced by the mixing of a digit and a punctuation mark, and keeps the label O shared by its parents. Similarly, the token "novelist" generated by "St" and "Kwasniewski" gets a mixed label from B-ORG and B-PER.
The discriminator then evaluates the two generated sequences. The generated sequence i is not reasonable enough intuitively, and its perplexity score 877 exceeds the threshold, so it is not added into the training set. The generated sequence j retains the original syntax and semantic structure much better. Although the open parenthesis seems strange, it plays a role as the comma in the original sequence to separate two clauses. This generation behaves closely to a normal sequence and earns 332 perplexity score, which permits its incorporation into the training set.

Related Work
Active Sequence Labeling Sequence labeling has been studied extensively for different NLP problems. Different neural architectures has been proposed (Huang et al., 2015;Lample et al., 2016;Peters et al., 2018;Akbik et al., 2018) in recent years, which have achieved state-of-the-art performance in a number of sequence labeling tasks. However, these neural models usually require exhaustive human efforts for generating labels for each token, and may not perform well in lowresource settings. To improve the performance of low-resource sequence labeling, several approaches have been applied including using semi-supervised methods Chen et al., 2020b), external weak supervision (Lison et al., 2020;Liang et al., 2020;Ren et al., 2020;Zhang et al., 2019;Yu et al., 2020) and active learning (Shen et al., 2017;Hazra et al., 2019;Liu et al., 2018;Fang et al., 2017;Gao et al., 2019). In this study, we mainly focus on active learning approaches which select samples based on the query policy design. So far, various uncertainty-based (Scheffer et al., 2001;Culotta and McCallum, 2005;Kim et al., 2006) and committee-based approaches (Dagan and Engelson, 1995) have been proposed for improving the sample efficiency. More recently, Shen et al. (2017); Hazra et al. (2019); Liu et al. (2018);Fang et al. (2017) further improve the aforementioned active learning approaches to improve the sampling diversity as well as the model's generalization ability on low-resource scenarios. These works mainly claim the sample efficiency provided by the active learning approach but do not study data augmentation for active sequence labeling.
Interpolation-based Regularizations Mixup implements interpolation in the input space to regularize models (Zhang et al., 2018)  the Mixup variants (Verma et al., 2019;Summers and Dinneen, 2019;Guo et al., 2019b) turn to perform interpolation in the hidden space to capture higher-level information. Guo et al. (2019a);Chen et al. (2020a) apply hidden-space Mixup for text classification. These works, however, have not explored how to perform mixup for sequences with token-level labels, nor do they consider the quality of the mixed-up samples.
Text Augmentation Our work is also related to text data augmentation. Zhang et al. (2015); Wei and Zou (2019) utilize heuristic approaches including synonym replancement, random insertion, swap and deletion for text augmentation, Kafle et al. (2017); Silfverberg et al. (2017) employ heuristic rules based on specific task, Hu et al. (2017) propose to augment text data in an encoder-decoder manner. Very recently, (Anaby-Tavor et al., 2020;Kobayashi, 2018) harness the power of pre-trained language models and augmenting the text data based on contextual patterns. Although these methods can augment the training set and improve the performance of text classification model, they fail to generate sequences and labels simultaneously, thus cannot be adapted to our problem where tokenlevel labels are required during training. Instead, in our study, we propose a new framework SeqMix for data augmentation to facilitate sequence labeling task. Our method can generate token-level labels and preserve the semantic information in the augmented sentences. Moreover, it can be naturally combined with existing active learning approaches and further promote the performance.

Conclusion
We proposed a simple data augmentation method SeqMix to enhance active sequence labeling. By performing sequence mixup in the latent space, Se-qMix improves data diversity during active learning, while being able to generate plausible augmented sequences. This method is generic to different active learning policies and various sequence labeling tasks. Our experiments demonstrate that SeqMix can improve active learning baselines consistently for NER and event detection tasks; and its benefits are especially prominent in low-data regimes. For future research, it is interesting to enhance SeqMix with language models during the mixup process, and harness external knowledge for further improving diversity and plausibility.

A.1 Dataset Collection
Here we list the link to datasets used in our experiments.
• ACE05: We are unable to provide the downloadable version due to it is not public. This corpus can be applied through the website of LDC: https://www.ldc.upenn.edu/ collaborations/past-projects/ ace.
• Webpage: Please refer the link in the paper (Ratinov and Roth, 2009).

A.2 Dataset Split
All the mentioned dataset has been split into train/validate/test set in the released version. We keep consistent with the validation set and the test set in our experiment. For the active learning paradigm, we split the training set as Table 3.
The active learners are initialized on the seed set, then they implement 5 active learning rounds.

B Baseline Settings
For the baselines, we take random sampling and 3 active learning approaches -LC sampling, NTE sampling, and QBC sampling as Section 2.2.

C Implementation Details of SeqMix
We implement bert-base-cased as the underlying model for the NER task and bert-base-multilingualcased as the underlying model for the event detection task. We use the model from Huggingface Transformer codebase 3 , and the repository 4 to finetune our model for sequence labeling task.

C.1 Number of Parameters
In our model, we use bert-base-cased and bertbase-multilingual-cased both of them occupy 12layer, 768-hidden, 12-heads with 110M parameters.

C.3 SeqMix Details
In Section 3.2, we construct a table of tokens W and their corresponding contextual embedding E.
For our underlying BERT model, we use the vocabulary provided by the tokenizer to build up W, and the embedding initialized on the training set as E.
We also need to construct a special token collection to exclude some generation in the process of sequence mixing. For example, BERT places token [CLS] and [SEP] at the starting position and the ending position for sentence, and pad the inputs with [PAD]. We exclude these disturbing tokens and the parent tokens.

C.4 Parameter Settings
The key parameters setting in our framework are stated here: (1) The number of active learning round is 5 for all the three datasets, but the size of seed set and the number of samples in each round differs from the dataset. We list the specific numbers as Table 3. (2) The sub-sequence window length s and the valid label density threshold η 0 vary from the datasets. For CoNLL-03, s = 5, η 0 = 0.6; for ACE05, s = 5, η 0 = 0.2; for Web-Page, s = 4, η 0 = 0.5. (3) We set α = 8 for the Beta distribution. (4) The discriminator score range is set as (0, 500) for all the datasets. (5) For BERT configuration, we choose 5e-5 for learning rate, 128 for padding length, 32 for batch size, 0.1 for dropout rate, 1e-8 for in Adam. At each data usage point, we train the model for 10 Epochs. (6) We set C = 3 for the QBC query policy.

D Details of Experiments
We take following criteria to evaluate the sequence labeling task. A named entity is correct only if it is an exact match of the corresponding entity in the data file. An event trigger is correct only if the span and type match with golden labels. Based on the above metric, we evaluate F 1 score in our experiments. Table 4 to Table 6 shows the model performance on the validation set. The data usage in these tables    refers to the number of labeled data, excluding the augmentation data. Sub-sequence mixup is trained with (1+α) times data, where the α denotes the augment rate. Note that WebPage is a very limited dataset, there is a big difference between the performance on the validation set and the test set. We average each experiment by 5 times.

D.2 Computing Infrastructure
We implement our system on Ubuntu 18.04.3 LTS system. We run our experiments on an Intel(R) Xeon(R) CPU @ 2.30GHz and NVIDIA Tesla P100-PCIe with 16 GB HBM2 memory. The NVIDIA-SMI version is 418.67 and the CUDA version is 10.1.

D.3 Average Runtime
For the 5-round active learning with SeqMix augmentation, our program runs about 500 seconds for WebPage dataset, 1700 seconds for the CoNLL slicing dataset, and 3.5 hours for ACE 2005. If the QBC query policy used, all the runtime will be multiplied about 3 times.

D.4 Hyper parameter Search
For the discriminator score range, we first examine the perplexity score distribution of the CoNLL training set. Then determine an approximate score range (0, 2000) first. We linearly split score ranges below 2000 to conduct parameter study and report  Table 6: Validation F 1 of WebPage the representative ranges in Section 4.3. Given the consideration to the generation speed and the augment rate setting, we finally choose 500 as the upper limit rather than a too narrow score range setting.
For the mixing coefficient λ, we follow (Zhang et al., 2018) to sample it from Beta(α, α) and explore α ranging from [0.5, 16]. We present this parameter study in Section 4.4. The result shows different α did not influence the augmentation performance much.
For the augment rate and the valid tag density, we also have introduced the parameter study in Section 4.4.