Local Additivity Based Data Augmentation for Semi-supervised NER

Named Entity Recognition (NER) is one of the first stages in deep language understanding yet current NER models heavily rely on human-annotated data. In this work, to alleviate the dependence on labeled data, we propose a Local Additivity based Data Augmentation (LADA) method for semi-supervised NER, in which we create virtual samples by interpolating sequences close to each other. Our approach has two variations: Intra-LADA and Inter-LADA, where Intra-LADA performs interpolations among tokens within one sentence, and Inter-LADA samples different sentences to interpolate. Through linear additions between sampled training data, LADA creates an infinite amount of labeled data and improves both entity and context learning. We further extend LADA to the semi-supervised setting by designing a novel consistency loss for unlabeled data. Experiments conducted on two NER benchmarks demonstrate the effectiveness of our methods over several strong baselines. We have publicly released our code at https://github.com/GT-SALT/LADA.


Introduction
Named Entity Recognition (NER) that aims to detect the semantic category of entities (e.g., persons, locations, organizations) in unstructured text (Nadeau and Sekine, 2007), is an essential prerequisite for many NLP applications. Being one of the most fundamental and classic sequence labeling tasks in NLP, there have been extensive research from traditional statistical models like Hidden Markov Models (Zhou and Su, 2002) and Conditional Random Fields (Lafferty et al., 2001a), to neural network based models such as LSTM-CRF (Lample et al., 2016a) and BLSTM-CNN-CRF (Ma and Hovy, 2016), and to recent pre- * Equal contribution. training and fine-tuning methods like ELMO (Peters et al., 2018a), Flair (Akbik et al., 2018) and BERT (Devlin et al., 2019). However, most of those models still heavily rely on abundant annotated data to yield the state-of-the-art results (Lin et al., 2020), making them hard to be applied into new domains (e.g., social media, medical context or low-resourced languages) that lack labeled data. Different kinds of data augmentation approaches have been designed to alleviate the dependency on labeled data for many NLP tasks, and can be categorized into two broad classes: (1) adversarial attacks at token-levels such as word substitutions (Kobayashi, 2018;Wei and Zou, 2019) or adding noise (Lakshmi Narayan et al., 2019), (2) paraphrasing at sentence-levels such as back translations (Xie et al., 2019) or submodular optimized models (Kumar et al., 2019). The former has already been used for NER but struggles to create diverse augmented samples with very few word replacements. Despite being widely utilized in many NLP tasks like text classification, the latter often fails to maintain the labels at the token-level in those paraphrased sentences, thus making it difficult to be applied to NER.
We focus on another type of data augmentations called mixup (Zhang et al., 2018), which was originally proposed in computer vision and performed linear interpolations between randomly sampled image pairs to create virtual training data. Miao et al. (2020); Chen et al. (2020b) adapted the idea to textual domains and have applied it to the preliminary task of text classification. However, unlike classifications where each sentence only has one label, sequence labeling tasks such as NER usually involve multiple interrelated labels in a single sentence. As we found in empirical experiments, it is challenging to directly apply such mixup technique to sequence labeling, and improper interpolations may mislead the model. For instance, random sam-pling in mixup may inject too much noise by interpolating data points far away from each other, hence making it fail on sequence labeling.
To fill this gap, we propose a novel method called Local Additivity based Data Augmentation (LADA), in which we constrain the samples to mixup to be close to each other. Our method has two variations: Intra-LADA and Inter-LADA. Intra-LADA interpolates each token's hidden representation with other tokens from the same sentence, which could increase the robustness towards word orderings. Inter-LADA interpolates each token's hidden representation in a sentence with each token from other sentences sampled from a weighted combination of k-nearest neighbors sampling and random sampling, the weight of which controls the delicate trade-off between noise and regularization. To further enhance the performance of learning with limited labeled data, we extend LADA to the semi-supervised setting, i.e., Semi-LADA, by designing a novel consistency loss between unlabeled data and its local augmentations. We conduct experiments on two NER datasets to demonstrate the effectiveness of our LADA based models over state-of-the-art baselines. Zhang et al. (2018) proposed a data augmentation technique called mixup, which trained an image classifier on linear interpolations of randomly sampled image data. Given a pair of data points (x, y) and (x , y ), where x denotes an image in raw pixel space, and y is the label in a one-hot representation, mixup creates a new sample by interpolating images and their corresponding labels:

Background
where λ is drawn from a Beta distribution. mixup trains the neural network for image classification by minimizing the loss on the virtual examples. In experiments, the pairs of images data points (x, y) and (x,ỹ) are randomly sampled. By assuming all the images are mapped to a low dimension manifold through a neural network, linearly interpolating them creates a virtual vicinity distribution around the original data space, thus improving the generalization performance of the classifier trained on the interpolated samples.
Prior work like Snippext (Miao et al., 2020), MixText (Chen et al., 2020b) and AdvAug (Cheng et al., 2020) generalized the idea to the textual domain by proposing to interpolate in output space (Miao et al., 2020), embedding space (Cheng et al., 2020), or general hidden space (Chen et al., 2020b) of textual data and applied the technique to NLP tasks such as text classifications and machine translations and achieved significant improvements.

Method
Based on the above interpolation based data augmentation techniques, in Section 3.1, we introduced a Local Additivity based Data Augmentation (LADA) for sequence labeling, where creating augmented samples is much more challenging. We continue to describe how to utilize unlabeled data with LADA for semi-supervised NER in Section 3.4.

LADA
For a given sentence with n tokens x = {x 1 , ..., x n }, denote the corresponding sequence label as y = {y 1 , ..., y n }. In this paper, we use NER as the working example to introduce our model, in which the labels are the entities types. We randomly sample a pair of sentences from the corpus, (x, y) and (x , y ), and then compute the interpolations in the hidden space using a L-layer encoder F(.; θ). The hidden representations of x and x up to the m-th layer are given by: Here h l = {h 1 , ..., h n } refer to the hidden representations at the l-th layer and is the concatenation of token representations at all positions. We use h 0 , h 0 to denote the word embedding of x and x respectively. At the m-th layer, the hidden representations for each token in x are linearly interpolated with each token in x by a ratio λ: where the mixing parameter λ is sampled from a Beta distribution, i.e., λ ∼ Beta(α, α). Thenh m is fed to the upper layers: h L can be treated as the hidden representations of a virtual samplex, i.e.,h L = F(x; θ).
In the meanwhile, their corresponding labels are linearly added with the same ratio: y i =λy i + (1 − λ)y ĩ y ={ỹ 1 , ...,ỹ n }. The hidden representationsh L are then fed into a classifier p(:, φ) and the loss over all positions is minimized to train the model: (1) Here P mix (x |x) defines the probability of sampling (x , y ) to mix with (x, y). The overall diagram is shown in Figure 1. Let S = {(x, y)} be the corpus of data samples, then according to Chen et al. (2020b), Note that P mix (x |x) is a uniform distribution that is independent of x. Even though x can be far away from x in the Euclidean space, they are mapped into a low-dimension manifold through a neural network. Interpolating them in the hidden space regularizes the model to perform linearly in the low-dimensional manifold, hence greatly improves tasks such as classification. However, we found empirically in experiments that the above random sampling strategy failed on sequence labeling like NER, leading to worse modeling results than purely supervised learning. Intuitively, sequence labeling is more complicated than sentence classification as it requires learning much more fine-grained information. Labeling a token depends on not only the token itself but also the context. We hypothesize that mixing the sequence x with x changes the context for all tokens and injects too much noise, hence making learning the labels for the tokens challenging. In other words, the relative distance between x and x in the manifold mapped by neural networks is further in sequence labeling than sentence classification (demonstrated in Figure 2), which is intuitively understandable as every data point in sentence classification is the pooling over all the tokens in one sentence while every token is a single data point in sequence labeling. Randomly mixing data points far away from each other introduces more noise for sequence labeling. To overcome this problem, we introduce a local additivity based data augmentation approach with two variations, in which we constrain x to be close to x: 3.2 Intra-LADA As stated above, mixing two sequences not only changes the local token representations but also affects the context required to label tokens. To reduce the noises from unrelated sentences, the most direct way is to construct x using the same tokens from x but changing the orders and perform interpolations between them. quence labeling. The dimension of data manifold for sequence labeling is higher than sentence classification, hence the distance between data samples is larger. We constraint x to be close to x in creating interpolated data in LADA.
Let Q = Permutations((x, y)) be the set including all possible permutations of x, then In this case, each token x i in x is actually interpolated with another token x j in x, while the context is unaltered. By sampling from P Intra , we are essentially turning sequence level interpolation to token level interpolation, thus greatly reducing the complexity of the problem. From another perspective, Intra-LADA generates augmentations with different sentence structures using the same word set, which could potentially increase the model's robustness towards word orderings. Intra-LADA restraints the context from changing, which could be limited in generating diverse augmented data. To overcome that, we propose Inter-LADA, where we sample a different sentence from the training set to perform interpolations.

Inter-LADA
Instead of interpolating within one sentence, Intra-LADA samples a different sentence x from the training set to interpolate with x. To achieve a trade-off between noise and regularization, we sample x through a weighted combination of two strategies: k-nearest neighbors (kNNs) sampling and random sampling: where µ is the weight of combining two distributions. To get the kNNs, we use sentence-BERT (Reimers and Gurevych, 2019) to map each sentence x into a hidden space, then collect each sentence's kNNs using l 2 distance. For each sentence x, we sample x to mix up from the kNNs with probability µ and the whole training corpus with a probability 1 − µ. When x is sampled from the whole training corpus, it may be unrelated to x, introducing large noise but also strong regularization on the model. When x is sampled from the kNNs, x shares similar, albeit different, context with x, thus achieving good signal to noise ratio. By treating µ as a hyper-parameter, we can control the delicate trade-off between noise and diversity in regularizing the model.
To examine why sampling sentences from kNNs decreases the noise and provides meaningful signals to training, we analyze an example with its kNNs in Table 1: (1) As it shows, kNNs may contain the same entity words as the original sentence, but in different contexts. The entity types in the neighbor sentences are also changed corresponding to contexts. For example, entity Israel in the third neighbor becomes an organization when surrounded by Radio while it is a location in the original sentence.
(2) Contexts from neighbor sentences can help detect the entities of the same type in a given sentence. For example, Lebanon in the second neighbor shares the same type as Israel in the original sentence. Lebanon can resort to the context of the original sentence to detect its entity type. (3) Neighbor sentences may contain the same words but in different forms. For example, the Israeli in the first neighbor sentence is a different form of Israel, which is miscellaneous while Israel is a location in the example sentence. Interpolation with such an example can improve models' ability to recognize words of different forms and their corresponding types.
In summary, Inter-LADA can improve both entity learning and context learning by interpolating more diverse data. Note that although we use NER as a working example , LADA can be applied to any sequence labeling models.

Semi-supervised LADA
To further improve the performance of learning with less labeled data, we propose a novel LADAbased approach specifically for unlabeled data. Instead of looking for nearest neighbors, we use backtranslation techniques to generate paraphrases of an unlabeled sentence x u in constructing x u . The paraphrase x u , generated via translating x u to an intermediate language and then translating it back, describes the same content as x u and should be close to x u semantically. However, there is no Sentence Israel plays down fears of war with Syria. Fears of an Israeli operation causes the redistribution of Syrian troops locations in Lebanon . Parliament Speaker Berri: Israel is preparing for war against Syria and Lebanon .

Neighbours
Itamar Rabinovich , who as Israel's ambassador to Washington conducted unfruitful negotiations with Syria , told Israel Radio looked like Damascus wanted to talk rather than fight . guarantee that the same entity would appear in the same position in x u and x u . In fact, the number of tokens in x u and x u may not even be the same. For instance, for the sentence "Rare Hendrix song draft sells for almost $17,000" and its paraphrased sentence "A rare Hendrix song design is selling for just under $17,000", although some words are different, the entity Hendrix keeps unchanged, and there are no extra entities added. That is, both contain one and only one entity (Hendrix) of the same type (Person). Nevertheless, we empirically found that most paraphrases contain the same number of entities (for any specific type) as the original sentence. Inspired by the observation, we propose a new consistency loss to leverage unlabeled data: x u and x u should have the same number of entities for any given entity type. Specifically, for an unlabeled sentence x u and its paraphrase x u , we first guess their token labels with the current model: To avoid predictions being too uniform at the early stage, we sharpen every token prediction y u,i ∈ y u with a temperature T : where ||.|| 1 denotes the l1-norm. We then add the predictionŷ u,i over all tokens in the sentence to denote its total number of entities for each type: Note thatŷ u,num is the guessed label vector with Cdimensions, where C is the total number of entity types. The i-th element in theŷ u,num denotes the total number i-type entity in the sentence.  During training, we use the same procedure to get the number of entities for original and each paraphrase sentence (without sharpening). Assume there are K paraphrases, denote the entity number vector for the k-the paraphrase asŷ k u,num . The consistency objective for unlabeled sentence x and its paraphrases is: Here we treatŷ u,num as fixed and back-propagate only throughŷ u,num to train the model. Taking into account the loss objectives for both labeled and unlabeled data (Equation 1 and Equation 5), our Semi-LADA training objective is: where γ controls the trade-off between the supervised loss term and the unsupervised loss term.

Datasets and Pre-processing
We performed experiments on two datasets in different languages: CoNLL 2003 (Tjong Kim Sang and De Meulder, 2003) in English and GermEval 2014 (Benikova et al., 2014) in German. The data statistics are shown in Table 2. We used the BIO labeling scheme and reported the F1 score. In order to make LADA possible in recent transformerbased models like BERT, we assigned labels to  . Since BERT tokenized a token into one or multiple subtokens, we not only assigned labels to the first subtoken but also to the remaining sub-tokens following the rules: (1) O word: Oxx→OOO, (2) I word: Ixx→III,(3) B word: Bxx→BII, as such kind of assignment will not harm the performance (ablation study was conducted in Section 4.4). During the evaluation, we ignored special tokens and non-first sub-tokens for fair comparisons.
In the fully supervised setting, we followed the standard data splits shown in Table 2. In the semisupervised setting, we sampled 10,000 sentences in the training set as the unlabeled training data. We adopted FairSeq 1 to implement the back translation. For CoNLL dataset, we utilized German as the intermediate language and English as the intermediate language for GermEval.

Baselines & Model Settings
Our LADA can be applied to any models in standard sequence labeling frameworks. In this work, we applied LADA to two state-of-the-art pre-trained models to show the effectiveness: • Flair (Akbik et al., 2019): We used the pretrained Flair embeddings 2 , and a multi-layer BiLSTM-CRF (Ma and Hovy, 2016) as the encoder to detect the entities.
• BERT (Devlin et al., 2019): We loaded the BERT-base-multilingual-cased 3 as the encoder and a linear layer to predict token labels.
To demonstrate whether our Semi-LADA works with unlabeled data, we compared it with two recent state-of-the-art semi-supervised NER models: • VSL-GG-Hier (Chen et al., 2018) introduced a hierarchical latent variables models into semi-supervised NER learning.
• MT + Noise (Lakshmi Narayan et al., 2019) explored different noise strategies including word-dropout, synonym-replace, Gaussian noise and network-dropout in a mean-teacher framework.
We also compared our models with another two recent state-of-the-art NER models trained on the whole training set: • CVT (Clark et al., 2018) performed multitask learning and made use of 1 Billion Word Language Model Benchmark as the source of unlabeled data.
• BERT-MRC  formulated the NER as a machine reading comprehension task instead of a sequence labeling problem.
For Intra-LADA, as it broke the sentence structures, it cannot be applied to Flair that was based on LSTM-CRF. Thus we only combined it with BERT and only used the labeled data. The mix layer set was {12}. For Inter-LADA, we applied it to Flair and BERT trained with only the labeled data. The mix layer set was {8,9,10}, k in kNNs was 3, and 0.5 was a good start point for tuning µ. Semi-LADA utilized unlabeled data as well. The model was built on BERT. The weight γ to balance the supervised loss and unsupervised loss was 1.

Main Results
We evaluated the baselines and our methods using F1-scores on the test set.

Utilizing Limited Labeled Data
We varied the number of labeled data (made use of 5%, 10%, 30% of labeled sentences in each dataset, which were 700, 1400, 4200 in CoNLL and 1200, 2400, 7200 in GermEval) and the results were shown in Table 3. Compared to purely Flair and BERT, applying Intra-LADA and Inter-LADA consistently boosted performances significantly, indicating the effectiveness of creating augmented training data through local linear interpolations. When unlabeled data was introduced, VSL-GG-Hier and MT + Noise performed slightly better than Flair and BERT with 5% labeled data in CoNLL, but pre-trained models (Flair, BERT) still got higher F1 scores when there were more labeled data. Both kinds of BERT + Semi-LADA significantly boosted the F1 scores on CoNLL and GermEval compared to baselines, as Semi-LADA not only utilized LADA on labeled data to avoid overfitting but also combined back translation based data augmentations on unlabeled data for consistent training, which made full use of both labeled data and unlabeled data.
Utilizing All the Labeled data Table 4 summarized the experimental results on the full training sets (14,987 on CoNLL 2003 and 24,000 on Ger-mEval 2014). Compared to pre-trained Flair and BERT 4 , there were still significant performance gains from utilizing our LADA, which indicated that our proposed data augmentation methods work well even with a large amount of labeled training data (full datasets). We also showed two stateof-the-art NER models' results with different settings, they had better performance mainly due to the multi-task learning with more unlabeled data (CVT) or formulating the NER as reading comprehension problems (BERT + MRC). Note that our LADA was orthogonal to these two models.
Loss on the Development Set To illustrate that our LADA could also help the overfitting problem, we plotted the loss on the development set of BERT, BERT + Inter-LADA and BERT + Semi-Inter-LADA on CoNLL and GermEval training with 5% labeled data in Figure 3. After applying LADA, the loss curve was more stable with training epoch increased, while the loss curve of BERT started increasing after about 10 epochs, indicating that the model might overfit the training data. Such property made LADA a suitable method, especially for semi-supervised learning.
Combining Intra&Inter-LADA We further combined Intra-LADA and Inter-LADA with a ratio π, i.e. data point would be augmented through Intra-LADA with a probability π and Inter-LADA with a probability 1 − π. In practice, we set the probability 0.3, and kept the settings for each kind of LADA the same. The results are shown in Table 3. Through combining two variations, BERT + Intra&Inter-LADA further boosted model performance on both datasets, with an increase of 0.25, 0.04 and 0.19 on CoNLL over BERT + Inter-LADA trained with 5%, 10% and 30% labeled data. We obtained consistent improvement in semi-supervised settings: BERT + Semi-Intra&Inter-LADA improved over BERT + Semi-Inter-LADA trained with 5%, 10% and 30% labeled data on GermEval by +0.05, +0.07 and +0.10. This showed that our Intra-LADA and Inter-LADA can be easily combined by future work to create diverse augmented data to help sequence labeling tasks.

Ablation Study
Different Sub-token Labeling Strategies To prove that our pre-processing of labeling subtokens for training was reasonable, we compared BERT training with different sub-token labeling strategies in Table 5."None" strategy was used in original BERT-Tagger where sub-tokens are ignored during learning. "Real" strategy was used in our Inter-LADA where O words' sub-tokens were assigned O (Oxx→OOO), I and B words' sub-tokens were assigned I (Ixx→III, Bxx→BII). "Repeat" referred to assigning the original label to each sub-token (Oxx→OOO, Ixx→III, Bxx→BBB). "O" means we assigned O to each sub-token (Oxx→OOO, Ixx→IOO, Bxx→BOO). "Real" strategy received comparable performances with original BERT models while the other two strategies decreased F1 scores, indicating our strategy mitigated the sub-token labeling issue.
Influence of µ in Inter-LADA We varied the µ in BERT + Inter-LADA from 0 to 1 to validate that combining kNNs sampling and random sampling in Inter-LADA could achieve the best performance, and the results were plotted in Figure 4. Note that when µ = 0, Inter-LADA only did random sampling and it barely improved over BERT largely due to too much noise from interpolations between unrelated sentences. And when µ = 1, Inter-LADA only did kNNs sampling, and it could get a better F1 score over BERT because of providing mean-  ingful signals to training. BERT + Inter-LADA got the best F1 score with µ = 0.7 on CoNLL and µ = 0.5 on GermEval, which indicated the tradeoff between noise and diversity (kNNs sampling with lower noise and random sampling with higher diversity) was necessary for Inter-LADA.
5 Related Work

Named Entity Recognition
Conditional random fields (CRFs) (Lafferty et al., 2001b;Sutton et al., 2004) have been widely used for NER, until recently they have been outperformed by neural networks. Hammerton (2003) and Collobert et al. (2011) are among the first several studies to model sequence labeling using neural networks. Specifically Hammerton (2003) encoded the input sequence using a unidirectional LSTM (Hochreiter and Schmidhuber, 1997) while (Collobert et al., 2011) instead used a CNN with character level embedding to encode sentences. Ma and Hovy (2016); Lample et al. (2016b) proposed LSTM-CRFs to combine neural networks with CRFs that aim to leverage both the representation learning capabilities of neural network and structured loss from CRFs. Instead of modeling NER as a sequence modeling problem,  converted NER into a reading comprehension task with an input sentence and a query sentence based on the entity types and achieved competitive performance.

Semi-supervised Learning for NER
There has been extensive previous work (Altun et al., 2005;Søgaard, 2011;Mann and McCallum, 2010) that utilized semi-supervised learning for NER. For instance, (Zhang et al., 2017;Chen et al., 2018) applied variational autoencoders (VAEs) to semi-supervised sequence labeling; (Zhang et al., 2017) proposed to use discrete labeling sequence as latent variables while (Chen et al., 2018) used continuous latent variables in their models. Recently, contextual representations such as ELMO (Peters  (2019) applied noise injection and word dropout and obtained a performance boost, Bodapati et al. (2019) varied the capitalization of words to increase the robustness to capitalization errors, Liu et al. (2019) augmented traditional models with pretraining on external knowledge bases. In contrast, our work can be viewed as data augmentation in the continuous hidden space without external resources.

Mixup-based Data Augmentation
Mixup (Zhang et al., 2018) was originally proposed for image classification (Verma et al., 2018; as a data augmentation and regularization method , building on which Miao et al. (2020) proposed to interpolate sentences' encoded representations with augmented sentences by tokensubstitutions for text classification. Similarly, Chen et al. (2020a) designed a linguistically informed interpolation of hidden space and demonstrated significant performance increases on several text classification benchmarks. Cheng et al. (2020) performed interpolations at the embedding space in sequence-to-sequence learning for machine translations. Different from these previous studies, we sample sentences based on local additivity and utilize mixup for the task of sequence labeling.

Conclusion
This paper introduced a local additivity based data augmentation (LADA) methods for Named Entity Recognition (NER) with two different interpolation strategies. To utilize unlabeled data, we introduced a novel consistent training objective combined with LADA. Experiments have been conducted and proved our proposed methods' effectiveness through comparing with several state-ofthe-art models on two NER benchmarks.