NAT: Noise-Aware Training for Robust Neural Sequence Labeling

Sequence labeling systems should perform reliably not only under ideal conditions but also with corrupted inputs—as these systems often process user-generated text or follow an error-prone upstream component. To this end, we formulate the noisy sequence labeling problem, where the input may undergo an unknown noising process and propose two Noise-Aware Training (NAT) objectives that improve robustness of sequence labeling performed on perturbed input: Our data augmentation method trains a neural model using a mixture of clean and noisy samples, whereas our stability training algorithm encourages the model to create a noise-invariant latent representation. We employ a vanilla noise model at training time. For evaluation, we use both the original data and its variants perturbed with real OCR errors and misspellings. Extensive experiments on English and German named entity recognition benchmarks confirmed that NAT consistently improved robustness of popular sequence labeling models, preserving accuracy on the original input. We make our code and data publicly available for the research community.


Introduction
Sequence labeling systems are generally trained on clean text, although in real-world scenarios, they often follow an error-prone upstream component, such as Optical Character Recognition (OCR; Neudecker, 2016) or Automatic Speech Recognition (ASR; Parada et al., 2011).Sequence labeling is also often performed on user-generated text, which may contain spelling mistakes or typos (Derczynski et al., 2013).Errors introduced in an upstream task are propagated downstream, diminishing the performance of the end-to-end system (Alex and Burns, 2014).While humans can easily cope with typos, misspellings, and the complete omission of letters when reading (Rawlinson,  : An example of a labeling error on a slightly perturbed sentence.Our noise-aware methods correctly predicted the location (LOC) label for the first word, as opposed to the standard approach, which misclassified it as an organization (ORG).We complement the example with a high-level idea of our noise-aware training, where the original sentence and its noisy variant are passed together through the system.The final loss is computed based on both sets of features, which improves robustness to the input perturbations.
2007), most Natural Language Processing (NLP) systems fail when processing corrupted or noisy text (Belinkov and Bisk, 2018).Although this problem is not new to NLP, only a few works addressed it explicitly (Piktus et al., 2019;Karpukhin et al., 2019).Other methods must rely on the noise that occurs naturally in the training data.
In this work, we are concerned with the performance difference of sequence labeling performed on clean and noisy input.Is it possible to narrow the gap between these two domains and design an approach that is transferable to different noise distributions at test time?Inspired by recent research in computer vision (Zheng et al., 2016), Neural Machine Translation (NMT; Cheng et al., 2018), and ASR (Sperber et al., 2017), we propose two Noise-Aware Training (NAT) objectives that improve the accuracy of sequence labeling performed on noisy input without reducing efficiency on the original data.Figure 1 illustrates the problem and our approach.
Our contributions are as follows: • We formulate a noisy sequence labeling problem, where the input undergoes an unknown noising process ( §2.2), and we introduce a model to estimate the real error distribution ( §3.1).Moreover, we simulate real noisy input with a novel noise induction procedure ( §3.2).
• We propose a data augmentation algorithm ( §3.3) that directly induces noise in the input data to perform training of the neural model using a mixture of noisy and clean samples.
• We implement a stability training method (Zheng et al., 2016), adapted to the sequence labeling scenario, which explicitly addresses the noisy input data problem by encouraging the model to produce a noise-invariant latent representation ( §3.4).
• We evaluate our methods on real OCR errors and misspellings against state-of-the-art baseline models (Peters et al., 2018;Akbik et al., 2018;Devlin et al., 2019) and demonstrate the effectiveness of our approach ( §4).
• To support future research in this area and to make our experiments reproducible, we make our code and data publicly available1 .
2 Problem Definition

Neural Sequence Labeling
Figure 2 presents a typical architecture for the neural sequence labeling problem.We will refer to the sequence labeling system as F (x; θ), abbreviated as F (x)2 , where x = (x 1 , . . ., x N ) is a tokenized input sentence of length N , and θ represents all learnable parameters of the system.F (x) takes x as input and outputs the probability distribution over the class labels y(x) as well as the final sequence of labels ŷ = (ŷ 1 , . . ., ŷN ).
Either a softmax model (Chiu and Nichols, 2016) or a Conditional Random Field (CRF; Lample et al., 2016) can be used to model the output distribution over the class labels y(x) from the logits l(x), i.e., non-normalized predictions, and to output the final sequence of labels ŷ.As a labeled entity can span several consecutive tokens within a sentence, special tagging schemes are often employed for decoding, e.g., BIOES, where the Beginning, Inside, Outside, End-of-entity and Single-tag-entity subtags are also distinguished (Ratinov and Roth, 2009).This method introduces strong dependencies between subsequent labels, which are modeled explicitly by a CRF (Lafferty et al., 2001) that produces the most likely sequence of labels.In the standard scenario ( §2.1), the original sentence x is fed as input to the sequence labeling system F (x). Token embeddings e(x) are retrieved from the corresponding look-up table and fed to the sequence labeling model f (x), which outputs latent feature vectors h(x).The latent vectors are then projected to the class logits l(x), which are used as input to the decoding model (softmax or CRF) that outputs the distribution over the class labels y(x) and the final sequence of labels ŷ.In a realworld scenario ( §2.2), the input sentence undergoes an unknown noising process Γ, and the perturbed sentence x is fed to F (x).

Noisy Neural Sequence Labeling
Similar to human readers, sequence labeling should perform reliably both in ideal and sub-optimal conditions.Unfortunately, this is rarely the case.User-generated text is a rich source of informal language containing misspellings, typos, or scrambled words (Derczynski et al., 2013).Noise can also be introduced in an upstream task, like OCR (Alex and Burns, 2014) or ASR (Chen et al., 2017), causing the errors to be propagated downstream.
To include the noise present on the source side of F (x), we can modify its definition accordingly (Figure 2).Let us assume that the input sentence x is additionally subjected to some unknown noising process Γ = P (x i |x i ), where x i is the original i-th token, and xi is its distorted equivalent.Let V be the vocabulary of tokens and Ṽ be a set of all finite character sequences over an alphabet Σ. Γ is known as the noisy channel matrix (Brill and Moore, 2000) and can be constructed by estimating the probability P (x i |x i ) of each distorted token xi given the intended token x i for every x i ∈ V and xi ∈ Ṽ.

Named Entity Recognition
We study the effectiveness of state-of-the-art Named Entity Recognition (NER) systems in handling imperfect input data.NER can be considered as a special case of the sequence labeling problem, where the goal is to locate all named entity mentions in unstructured text and to classify them into pre-defined categories, e.g., person names, organizations, and locations (Tjong Kim Sang and De Meulder, 2003).NER systems are often trained on the clean text.Consequently, they exhibit degraded performance in real-world scenarios where the transcriptions are produced by the previous upstream component, such as OCR or ASR ( §2.2), which results in a detrimental mismatch between the training and the test conditions.Our goal is to improve the robustness of sequence labeling performed on data from noisy sources, without deteriorating performance on the original data.We assume that the source sequence of tokens x may contain errors.However, the noising process is generally label-preserving, i.e., the level of noise is not significant enough to affect the corresponding labels3 .It follows that the noisy token xi inherits the ground-truth label y i from the underlying original token x i .

Noise Model
To model the noise, we use the character-level noisy channel matrix Γ, which we will refer to as the character confusion matrix ( §2.2).

Natural noise
We can estimate the natural error distribution by calculating the alignments between the pairs (x, x) ∈ P of noisy and clean sentences using the Levenshtein distance metric (Levenshtein, 1966), where P is a corpus of paired noisy and manually corrected sentences ( §2.2).The allowed edit operations include insertions, deletions, and substitutions of characters.We can model insertions and deletions by introducing an additional symbol ε into the character confusion matrix.The probability of insertion and deletion can then be formulated as P ins (c|ε) and P del (ε|c), where c is a character to be inserted or deleted, respectively.Synthetic noise P is usually laborious to obtain.Moreover, the exact modeling of noise might be impractical, and it is often difficult to accurately estimate the exact noise distribution to be encountered at test time.Such distributions may depend on, e.g., the OCR engine used to digitize the documents.Therefore, we keep the estimated natural error distribution for evaluation and use a simplified synthetic error model for training.We assume that all types of edit operations are equally likely: where c and c are the original and the perturbed characters, respectively.Moreover, P ins and P subst are uniform over the set of allowed insertion and substitution candidates, respectively.We use the hyper-parameter η to control the amount of noise to be induced with this method4 .

Noise Induction
Ideally, we would use the noisy sentences annotated with named entity labels for training our sequence labeling models.Unfortunately, such data is scarce.On the other hand, labeled clean text corpora are widely available (Tjong Kim Sang and De Meulder, 2003;Benikova et al., 2014).Hence, we propose to use the standard NER corpora and to induce noise into the input tokens during training synthetically.
In contrast to the image domain, which is continuous, the text domain is discrete, and we cannot directly apply continuous perturbations for written language.Although some works applied distortions at the level of embeddings (Miyato et al., 2017;Yasunaga et al., 2018;Bekoulis et al., 2018), we do not have a good intuition how it changes the meaning of the underlying textual input.Instead, we apply our noise induction procedure to generate distorted copies of the input.For every input sentence x, we independently perturb each token x i = (c 1 , . . ., c K ), where K is the length of x i , with the following procedure (Figure 3): (1) We insert the ε symbol before the first and after every character of x i to get an extended token x i = (ε, c 1 , ε, . . ., ε, c K , ε).
(2) For every character c k of x i , we sample the replacement character c k from the corresponding probability distribution P (c k |c k ), which can be obtained by taking a row of the character confusion matrix that corresponds to c k .
As a result, we get a noisy version of the extended input token x i .
(3) We remove all ε symbols from x i and collapse the remaining characters to obtain a noisy token xi .Three examples correspond to insertion, deletion, and substitution errors.x i , x i , x i , and xi are the original, extended, extended noisy, and noisy tokens, respectively.

Data Augmentation Method
We can improve robustness to noise at test time by introducing various forms of artificial noise during training.We distinct regularization methods like dropout (Srivastava et al., 2014) and task-specific data augmentation that transforms the data to resemble noisy input.The latter technique was successfully applied in other domains, including computer vision (Krizhevsky et al., 2012) and speech recognition (Sperber et al., 2017).
During training, we artificially induce noise into the original sentences using the algorithm described in §3.2 and train our models using a mixture of clean and noisy sentences.Let L 0 (x, y; θ) be the standard training objective for the sequence labeling problem, where x is the input sentence, y is the corresponding ground-truth sequence of labels, and θ represents the parameters of F (x).We define our composite loss function as follows: where x is the perturbed sentence, and α is a weight of the noisy loss component.L augm is a weighted sum of standard losses calculated using clean and noisy sentences.Intuitively, the model that would optimize L augm should be more robust to imperfect input data, retaining the ability to perform well on clean input.Figure 4a presents a schematic visualization of our data augmentation approach.

Stability Training Method
Zheng et al. ( 2016) pointed out the output instability issues of deep neural networks.They proposed a training method to stabilize deep networks against small input perturbations and applied it to the tasks of near-duplicate image detection, similar-image ranking, and image classification.Inspired by their idea, we adapt the stability training method to the natural language scenario.
Our goal is to stabilize the outputs y(x) of a sequence labeling system against small input perturbations, which can be thought of as flattening y(x) in a close neighborhood of any input sentence x.When a perturbed copy x is close to x, then y(x) should also be close to y(x).Given the standard training objective L 0 (x, y; θ), the original input sentence x, its perturbed copy x and the sequence of ground-truth labels y, we can define the stability training objective L stabil as follows: where L sim encourages the similarity of the model outputs for both x and x, D is a task-specific feature distance measure, and α balances the strength of the similarity objective.Let R(x) and Q(x) be the discrete probability distributions obtained by calculating the softmax function over the logits l(x) for x and x, respectively: We model D as Kullback-Leibler divergence (D KL ), which measures the correspondence between the likelihood of the original and the perturbed input: Figure 4: Schema of our auxiliary training objectives.
x, x are the original and the perturbed inputs, respectively, that are fed to the sequence labeling system F (x). Γ represents a noising process.y(x) and y(x) are the output distributions over the entity classes for x and x, respectively.L 0 is the standard training objective.L augm combines L 0 computed on both outputs from F (x). L stabil fuses L 0 calculated on the original input with the similarity objective L sim .
where i, j are the token, and the class label indices, respectively.Figure 4b summarizes the main idea of our stability training method.
A critical difference between the data augmentation and the stability training method is that the latter does not use noisy samples for the original task, but only for the stability objective5 .Furthermore, both methods need perturbed copies of the input samples, which results in longer training time but could be ameliorated by fine-tuning the existing model for a few epochs6 .

Experiment Setup
Model architecture We used a BiLSTM-CRF architecture (Huang et al., 2015) with a single Bidirectional Long-Short Term Memory (BiLSTM) layer and 256 hidden units in both directions for f (x) in all experiments.We considered four different text representations e(x), which were used to achieve state-of-the-art results on the studied data set and should also be able to handle misspelled text and out-of-vocabulary (OOV) tokens: • FLAIR (Akbik et al., 2018) learns a Bidi-rectional Language Model (BiLM) using an LSTM network to represent any sequence of characters.We used settings recommended by the authors and combined FLAIR with GloVe (Pennington et al., 2014; FLAIR + GloVe) for English and Wikipedia FastText embeddings (Bojanowski et al., 2017; FLAIR + Wiki) for German.
• BERT (Devlin et al., 2019) employs a Transformer encoder to learn a BiLM from large unlabeled text corpora and sub-word units to represent textual tokens.We use the BERT BASE model in our experiments.
• ELMo (Peters et al., 2018) utilizes a linear combination of hidden state vectors derived from a BiLSTM word language model trained on a large text corpus.
• Glove/Wiki + Char is a combination of pretrained word embeddings (GloVe for English and Wikipedia FastText for German) and randomly initialized character embeddings (Lample et al., 2016).
Training We trained the sequence labeling model f (x) and the final CRF decoding layer on top of the pre-trained embedding vectors e(x), which were fixed during training, except for the character embeddings (Figure 2).We used a mixture of the original data and its perturbed copies generated from the synthetic noise distribution ( §3.1) with our noise induction procedure ( §3.2).We kept most of the hyper-parameters consistent with Akbik et al.
(2018)7 .We trained our models for at most 100 epochs and used early stopping based on the development set performance, measured as an average F1 score of clean and noisy samples.Furthermore, we used the development sets of each benchmark data set for validation only and not for training.
Performance measures We measured the entitylevel micro average F1 score on the test set to compare the results of different models.We evaluated on both the original and the perturbed data using various natural error distributions.We induced OCR errors based on the character confusion matrix Γ ( §3.2) that was gathered on a large document corpus (Namysl and Konya, 2019) using the Tesseract OCR engine (Smith, 2007).Moreover, we employed two sets of misspellings released by Belinkov and Bisk (2018) and Piktus et al. (2019).
Following the authors, we replaced every original token with the corresponding misspelled variant, sampling uniformly among available replacement candidates.We present the estimated error rates of text that is produced with these noise induction procedures in Table 5 in the appendix.As the evaluation with noisy data leads to some variance in the final scores, we repeated all experiments five times and reported mean and standard deviation.
Implementation We implemented our models using the FLAIR framework (Akbik et al., 2019) 8 .We extended their sequence labeling model by integrating our auxiliary training objectives ( §3.3, §3.4).Nonetheless, our approach is universal and can be implemented in any other sequence labeling framework.
Table 1 presents the results of this experiment12 .We found that our auxiliary training objectives boosted accuracy on noisy input data for all baseline models and both languages.At the same time, they preserved accuracy for the original input.The data augmentation objective seemed to perform slightly better than the stability objective.However, the chosen hyper-parameter values were rather ar-bitrary, as our goal was to prove the utility and the flexibility of both objectives.

Sensitivity Analysis
We evaluated the impact of our hyper-parameters on the sequence labeling accuracy using the English CoNLL 2003 data set.We trained multiple models with different amounts of noise η train and different weighting factors α.We chose the FLAIR + GloVe model as our baseline because it achieved the best results in the preliminary analysis ( §4.2) and showed good performance, which enabled us to perform extensive experiments.
Figure 5 summarizes the results of the sensitivity experiment.The models trained with our auxiliary objectives mostly preserved or even improved accuracy on the original data compared to the baseline model (α = 0).Moreover, they significantly outperformed the baseline on data perturbed with natural noise.The best accuracy was achieved for η train from 10 to 30%, which roughly corresponds to the label-preserving noise range.Similar to Heigold et al. (2018) and Cheng et al. (2019), we conclude that a non-zero noise level induced during training (η train > 0) always yields improvements on noisy input data when compared with the models trained exclusively on clean data.The best choice of α was in the range from 0.5 to 2.0.α = 5.0 exhibited lower performance on the original data.Moreover, the models trained on the real error distribution demonstrated at most slightly better performance, which indicates that the exact noise distribution does not necessarily have to be known at training time13 .

Error Analysis
To quantify improvements provided by our approach, we measured sequence labeling accuracy on the subsets of data with different levels of perturbation, i.e., we divided input tokens based on edit distance to their clean counterparts.Moreover, we partitioned the data by named entity class to assess the impact of noise on recognition of different entity types.For this experiment, we used both the test and the development parts of the English CoNLL 2003 data set and induced OCR errors with our noising procedure.
Figure 6 presents the results for the baseline and the proposed methods.It can be seen that our ap-  proach achieved significant error reduction across all perturbation levels and all entity types.Moreover, by narrowing down the analysis to perturbed tokens, we discovered that the baseline model was particularly sensitive to noisy tokens from the LOC and the MISC categories.Our approach considerably reduced this negative effect.Furthermore, as the stability training worked slightly better on the LOC class and the data augmentation was more accurate on the ORG type, we argue that both methods could be combined to enhance overall sequence labeling accuracy further.Note that even if the particular token was not perturbed, its context could be noisy, which would explain the fact that our approach provided improvements even for tokens without perturbations.

Related Work
Improving robustness has been receiving increasing attention in the NLP community.The most relevant research was conducted in the NMT domain.
Noise-additive data augmentation A natural strategy to improve robustness to noise is to augment the training data with samples perturbed using a similar noise model.Heigold et al. (2018) Levenshtein Distance (LD) value.at the maximum rate, only a subset of tokens is perturbed (20-50%, depending on the language).In contrast, we used a confusion matrix, which is better suited to model statistical error distribution and can be applied to all tokens, not only those present in the corresponding look-up tables.
Robust representations Another method to improve robustness is to design a representation that is less sensitive to noisy input.Zheng et al. (2016) presented a general method to stabilize model predictions against small input distortions.Cheng et al. (2018) continued their work and developed the adversarial stability training method for NMT by adding a discriminator term to the objective func-tion.They combined data augmentation and stability objectives, while we evaluated both methods separately and provided evaluation results on natural noise distribution.Piktus et al. (2019) learned representation that embeds misspelled words close to their correct variants.Their Misspelling Oblivious Embeddings (MOE) model jointly optimizes two loss functions, each of which iterates over a separate data set (a corpus of text and a set of misspelling/correction pairs) during training.In contrast, our method does not depend on any additional resources and uses a simplified error distribution during training.
Adversarial learning Adversarial attacks seek to mislead the neural models by feeding them with adversarial examples (Szegedy et al., 2014).In a white-box attack scenario (Goodfellow et al., 2015;Ebrahimi et al., 2018) we assume that the attacker has access to the model parameters, in contrast to the black-box scenario (Alzantot et al., 2018;Gao et al., 2018), where the attacker can only sample model predictions on given examples.Adversarial training (Miyato et al., 2017;Yasunaga et al., 2018), on the other hand, aims to improve the robustness of the neural models by utilizing adversarial examples during training.
The impact of noisy input data In the context of ASR, Parada et al. (2011) observed that named entities are often OOV tokens, and therefore they cause more recognition errors.In the document processing field, Alex and Burns (2014) studied NER performed on several digitized historical text collections and showed that OCR errors have a significant impact on the accuracy of the downstream task.Namysl and Konya (2019) examined the efficiency of modern OCR engines and showed that although the OCR technology was more advanced than several years ago when many historical archives were digitized (Kim and Cassidy, 2015;Neudecker, 2016), the most widely used engines still had difficulties with non-standard or lower quality input.
Spelling-and post-OCR correction.A natural method of handling erroneous text is to correct it before feeding it to the downstream task.Most popular post-correction techniques include correction candidates ranking (Fivez et al., 2017;Flor et al., 2019), noisy channel modeling (Brill and Moore, 2000;Duan and Hsu, 2011), voting (Wemhoener et al., 2013), sequence to sequence models (Afli et al., 2016;Schmaltz et al., 2017) and hybrid systems (Schulz and Kuhn, 2017).
In this paper, we have taken a different approach and attempted to make our models robust without relying on prior error correction, which, in case of OCR errors, is still far from being solved (Chiron et al., 2017;Rigaud et al., 2019).

Conclusions
In this paper, we investigated the difference in accuracy between sequence labeling performed on clean and noisy text ( §2.3).We formulated the noisy sequence labeling problem ( §2.2) and introduced a model that can be used to estimate the real noise distribution ( §3.1).We developed the noise induction procedure that simulates the real noisy input ( §3.2).We proposed two noise-aware training methods that boost sequence labeling accuracy on the perturbed text: (i) Our data augmentation approach uses a mixture of clean and noisy examples during training to make the model resistant to erroneous input ( §3.3).(ii) Our stability training algorithm encourages output similarity for the original and the perturbed input, which helps the model to build a noise invariant latent representation ( §3.4).Our experiments confirmed that NAT consistently improved efficiency of popular sequence labeling models on data perturbed with different error distributions, preserving accuracy on the original input ( §4).Moreover, we avoided expensive re-training of embeddings on noisy data sources by employing existing text representations.We conclude that NAT makes existing models applicable beyond the idealized scenarios.It may support an automatic correction method that uses recognized entity types to narrow the list of feasible correction candidates.Another application is data anonymization (Mamede et al., 2016).
Future work will involve improvements in the proposed noise model to study the importance of fidelity to real-world error patterns.Moreover, we plan to evaluate NAT on other real noise distributions (e.g., from ASR) and other sequence labeling tasks to support our claims further.

A Noise Model -Supplementary Materials
In this section, we present the extended description of our vanilla noise model introduced in §3.1.Let P edit = η/3 be the probability of performing a single character edit operation (insertion, deletion, or substitution) that replaces the source character c with a noisy character c, where c = c.Equation (1) defines the vanilla error distribution, which we use at training time: It consists of the following components: (a) The insertion probability P ins (c|ε) in eq. ( 1a).
It describes how likely it is to insert a nonempty character c = ε and it is uniform over the set of all characters from the alphabet Σ, except the ε symbol.
(c) The substitution probability P subst (c|c) in eq. ( 1c).It is uniform over the set of all characters from the alphabet Σ, except the source character c and the ε symbol.
Equations ( 1a) and (1b) correspond to the row in the character confusion matrix Γ, where c = ε and form a valid probability distribution: Similarly, eqs.(1c) to (1e) correspond to the rows in the character confusion matrix Γ, where c ∈ Σ\{ε}, and are also valid probability distributions: Finally, for comparison, we present visualizations of the confusion matrices used in our vanilla (Figure 7a) and OCR error models (Figure 7b).

A.1 Sensitivity Analysis
In this section, we present the extended version of our sensitivity study ( §4.3). Figure 8 summarizes the results on the synthetic data distribution with various test-and training-time noise levels (η test and η train , respectively) and weighting factors α.We noticed a similar trend as in our initial analysis.As the level of noise η test increases, the overall accuracy decreases, but this trend is less pronounced for α = 0.At the same time, the gap between the models trained with and without our auxiliary objectives becomes larger.

A.2 Qualitative Analysis
In this section, we compared the outputs generated by the baseline models trained with and without our auxiliary training objectives (Table 2).We found that the NAT method improved robustness to capitalization errors (the first and the fourth row in Table 2a) as well as to substitutions (the second, the third and the fifth row in Table 2a and the first, the second, the fourth and the fifth row in Table 2b), deletions (the fifth row in Table 2a) and insertions of characters (the third and the fifth row in Table 2b).Moreover, it better recognized the semantics of the sentence in the third row of Table 2a, where the location name was creatively rewritten (Brazland instead of Brazil).

B Hyper-parameters
We present the detailed hyper-parameters of the sequence labeling model f (x) used in our experiments ( §4).Note that dropout was applied both before and after the LSTM layer (Table 3).

C Data Set Statistics and Estimated Error Rates
In this section, we present the detailed statistics of the data sets used in our NER experiments (      The vanilla noise model assigns equal probability to all substitution errors, while the OCR error model is biased towards substitutions of characters with similar shapes like "I"→"l", "$"→"5", "O"→"0" or ","→".".Moreover, the vanilla model assumes that the deletion of a character c is as likely as the sum of substitution probabilities with all non-empty symbols: P del (ε|c) = c ∈ Σ\{ε} P subst (c|c).
Figure1: An example of a labeling error on a slightly perturbed sentence.Our noise-aware methods correctly predicted the location (LOC) label for the first word, as opposed to the standard approach, which misclassified it as an organization (ORG).We complement the example with a high-level idea of our noise-aware training, where the original sentence and its noisy variant are passed together through the system.The final loss is computed based on both sets of features, which improves robustness to the input perturbations.

Figure 2 :
Figure2: Neural sequence labeling architecture.In the standard scenario ( §2.1), the original sentence x is fed as input to the sequence labeling system F (x). Token embeddings e(x) are retrieved from the corresponding look-up table and fed to the sequence labeling model f (x), which outputs latent feature vectors h(x).The latent vectors are then projected to the class logits l(x), which are used as input to the decoding model (softmax or CRF) that outputs the distribution over the class labels y(x) and the final sequence of labels ŷ.In a realworld scenario ( §2.2), the input sentence undergoes an unknown noising process Γ, and the perturbed sentence x is fed to F (x).

Figure 3 :
Figure 3: Illustration of our noise induction procedure.Three examples correspond to insertion, deletion, and substitution errors.x i , x i , x i , and xi are the original, extended, extended noisy, and noisy tokens, respectively.
a) Data augmentation training objective Laugm.Stability training objective L stabil .

Figure 5 :
Figure5: Sensitivity analysis performed on the English CoNLL 2003 test set ( §4.3).Each figure presents the results of models trained using one of our auxiliary training objectives on either original data or its variant perturbed with OCR errors.The bar marked as "OCR" represents a model trained using the OCR noise distribution.Other bars correspond to models trained using synthetic noise distribution and different hyper-parameters (α, η train ).
Divided by the entity class (clean tokens).Divided by the entity class (perturbed tokens).

Figure 6 :
Figure6: Error analysis results on the English CoNLL 2003 data set with OCR noise.We presented the results of the FLAIR + GloVe model trained with the standard and the proposed objectives.The data was divided into the subsets based on the edit distance of a token to its original counterpart and its named entity class.The group was further partitioned into the clean and the perturbed tokens.The error rate is the percentage of tokens with misrecognized entity class labels.
" # $ % & ' ( ) * + , -. / 0 1 2 3 4 5 6 7 8 9 : ;= ?@ A B C D E F G H I J K L M N O P Q R S T U V W X Y Z [ ]a b c d e f g h i j k l m n o p q r s t u v w x y Vanilla error distribution used at training time (η = 20%)." # $ % & ' ( ) * + , -. / 0 1 2 3 4 5 6 7 8 9 : ;= ?@ A B C D E F G H I J K L M N O P Q R S T U V W X Y Z [ ]a b c d e f g h i j k l m n o p q r s t u v w x Real error distribution estimated from a large document corpus using the Tesseract OCR engine.

Figure 7 :
Figure 7: Confusion matrices for the vanilla and the OCR error distributions.Each cell represents P (c|c).The rows correspond to the original characters c and the columns represent the perturbed characters c.In this example, we include all symbols from the alphabet of the English CoNLL 2003 data set.The vanilla noise model assigns equal probability to all substitution errors, while the OCR error model is biased towards substitutions of characters with similar shapes like "I"→"l", "$"→"5", "O"→"0" or ","→".".Moreover, the vanilla model assumes that the deletion of a character c is as likely as the sum of substitution probabilities with all non-empty symbols: P del (ε|c) = c ∈ Σ\{ε} P subst (c|c).

Table 1 :
Evaluation results on the CoNLL 2003 and the GermEval 2014 test sets.We report results on the original data, as well as on its noisy copies with OCR errors and two types of misspellings released byBelinkov and  Bisk (2018) † and Piktus et al. (2019)‡ .L 0 is the standard training objective.L augm and L stabil are the data augmentation and the stability objectives, respectively.We report mean F1 scores with standard deviations from five experiments and mean differences against the standard objective (in parentheses).

Table 3 :
Hyper-parameters of the sequence labeling model f (x) used in our experiments.

Table 2 :
Outputs produced by the models trained with and without our auxiliary NAT objectives (NAT output and Baseline output, respectively).We demonstrate examples that contain misspellings and OCR errors, where the models trained with the auxiliary NAT objectives correctly recognized all tags, while the baseline models either misclassified or completely missed some entities.

Table 4 :
Statistics of the data sets used in our NER experiments ( §4).We present statistics of the training (Train) development (Dev) and test (Test) sets, including the number of sentences, tokens, and entities: person names (PER), locations (LOC), organizations (ORG) and miscellaneous (MISC).The Ger-mEval 2014 data set defines two additional fine-grained sub-labels: "-part" and "-deriv" that mark derivation and compound words, respectively, which stand in direct relation to Named Entities.

Table 5 :
Error rate estimation for different noise distributions.OCR noise is modeled with the character confusion matrix, whereas misspellings are induced using look-up tables released byBelinkov and Bisk (2018)† and Piktus et al. (2019) ‡ .