Alleviating Digitization Errors in Named Entity Recognition for Historical Documents

This paper tackles the task of named entity recognition (NER) applied to digitized historical texts obtained from processing digital images of newspapers using optical character recognition (OCR) techniques. We argue that the main challenge for this task is that the OCR process leads to misspellings and linguistic errors in the output text. Moreover, historical variations can be present in aged documents, which can impact the performance of the NER process. We conduct a comparative evaluation on two historical datasets in German and French against previous state-of-the-art models, and we propose a model based on a hierarchical stack of Transformers to approach the NER task for historical data. Our findings show that the proposed model clearly improves the results on both historical datasets, and does not degrade the results for modern datasets.


Introduction
With the emergence of large scale archives of digitized contents, the need for efficient preservation and accessibility of historical documents through appropriate technologies increased exponentially. At the same time, there is a growing interest in extracting relevant information from historical sources. In this paper, we address the named entity recognition (NER) task which aims at identifying real-world entities, such as names of people, organizations, and locations within historical documents.
Since most of the state-of-the-art research focuses on NER for modern available datasets, the performance of the NER systems grew at a fast pace, enabled by the representational capacity of neural networks and off-the-shelf pre-trained word embeddings (Ma and Hovy, 2016;Lample et al., 2016;Yadav and Bethard, 2018). More recently, NER models based on contextual word and subword representations provided by ELMo (Peters et al., 2018), Flair (Akbik et al., 2018), or BERT (Devlin et al., 2019), achieved impressive improvements. The Transformer-based (Vaswani et al., 2017) architectures for NER became popular since the release of the BERT (Bidirectional Encoder Representations from Transformers) model.
However, while most NER systems have been developed to generally address contemporary data, NER systems for processing historical documents are less common. To extract entities from historical documents, NER tools face additional challenges. As the majority of these documents are hardcover, they are scanned and processed by an OCR to transcribe the text. However, an OCR tool can occasionally misrecognize letters and improperly identify its textual content. This can be due to the level of degradation of the actual document being scanned, to the digitization artifacts and also to the quality of the OCR tool. This leads to digitization errors in the transcribed text, such as misspelled locations or person names.
Languages evolve through time and certain words can have a different meaning depending on the period of time analyzed (Hamilton et al., 2016). The spelling of words can also change due to new orthographic conventions or cultural tendencies (Scheible et al., 2011). This high level of spelling differences can be incompatible with modern orthography and the produced noise can severely affect modern NLP systems (Lopresti, 2009).
To address these challenges of NER on historical documents, we propose a robust NER model based on a stack of Transformers that includes fine-tuned BERT encoders. We study the impact of such a model, and we conclude that this type of model is suited for the extraction of entities from historical documents.
The remainder of this paper is organized as follows. In Section 2, we present and discuss a selection of works concerning NER in modern and historical documents. Then, in Section 3, the datasets explored in this work are presented. The proposed model is detailed in Section 4. The experiments are described in Section 5. We present and discuss the obtained results in Section 6. Finally, Section 7 concludes this paper and hints at future work.
2 Related Work NER for modern documents The first end-toend systems for sequence labeling tasks are based on pre-trained word and character embeddings encoded either by a bidirectional Long Short Term Memory (BiLSTM) network or a Convolutional Neural Network (CNN) (Collobert et al., 2011;Lample et al., 2016;Ma and Hovy, 2016;Aguilar et al., 2017;Chiu and Nichols, 2016), along with a Conditional Random Fields (CRF) decoder. One shortcoming of this type of model is that they were based on a single context-independent representation for each word. This problem has been further attenuated by methods based on language model pre-training that produced context-dependent word representations. These recent large-scale language models methods such as BERT (Devlin et al., 2019) and ELMo (Peters et al., 2018) further enhanced the performance of NER, yielding state-of-the-art performances (Peters et al., 2017(Peters et al., , 2018Baevski et al., 2019).
NER for historical documents Historical documents pose multiple challenges that either depend on the quality of digitization or the historical variations of a language. Studies on how the NER models can be impacted by the digitization process (Miller et al., 2000;Rodriquez et al., 2012;Hamdi et al., 2019;van Strien et al., 2020) have clearly shown that the performance scores of a NER model can significantly decrease when applied on historical documents.
The increased interest in contributing to historical language resources is driven forward by the creation of new gold standards for historical document processing. For example, Hubková (2019) created and annotated a corpus using scanned Czech historical newspapers, and Ahmed et al. (2019) proposed a German gold standard for NER in historical biodiversity literature.
A recent competition organized by the Identifying Historical People, Places, and other Entities (HIPE) lab at CLEF 2020 1 , not only that it created a gold standard for German and French historical texts, but also encouraged researchers to participate in two sub-tasks, named entity recognition and classification and entity linking.
Considering the high level of spelling differences between modern and historical documents, variance (inconsistency), and uncertainty (digitization errors) found in historical documents, the recent methods assess these shortcomings differently. Erdmann et al. (2016) presented a CRF-based model with handcrafted features for Latin historical texts and motivated the choice of Part-of-Speech (POS) tagger by the fact that this NLP tool leverages the highly informative morphological complexity of Latin. The BiLSTM-based model proposed by Hubková (2019) applied a characterbased CNN to encode the different spellings of words.
Similar to the latter approach, we also consider that the NER model itself can help in alleviating the historical documents issues, without the use of language-specific engineered features. Differently, we introduce the NER for historical documents to the language model methods based on the Transformer architecture (Vaswani et al., 2017) and BERT (Devlin et al., 2019) methods, that, to our knowledge, have not been approached in previous research, with regard to processing historical documents.
With new needs and resources in the context of historical NER processing, we evaluate our proposed model on the dataset proposed by the HIPE competition, and we also propose a new gold standard for German and French, to assess our assumptions.

Datasets
We conduct experiments on two datasets that comprise digitized historical newspapers, HIPE and NEWSEYE datasets in French and German. Additionally, we study how the proposed methods behave in the case of contemporary data, by experimenting on the English CoNLL 2003 dataset (Tjong Kim Sang and De Meulder, 2003).
The HIPE dataset was created by the CLEF 2020 Evaluation Lab HIPE challenge (Ehrmann et al., 2020a). It is composed of articles from several Swiss, Luxembourgish, and American historical newspapers from 1790 to 2010 (Ehrmann et al., 2020b). More concisely, the German articles were collected from 1790 to 1940, and the French articles, from 1790 to 2010. The corpus was manually annotated by natives following the annotation guidelines derived from the Quaero annotation guide 2 .
We also present the NEWSEYE dataset, composed of historical newspapers in French  and German . The documents were collected through the national libraries of France 3 (BnF) and Austria 4 (ONB), respectively. This dataset was annotated following guidelines derived from the Quaero annotation guide 5 . The annotation process was made by native speakers for each language using the Transkribus tool 6 . In order to compute the inter-annotator agreement (IAA), we used the Kappa coefficient introduced by Cohen (1960). Several pages from each corpus (German and French) have been annotated twice by two groups of annotators. Satisfactory IAA scores were reached for the two corpora (0.90 for French and 0.91 for German). The NewsEye corpus is split into 80% for training and 20% for both validation and testing.
The CoNLL 2003 dataset consists of newswire from the Reuters RCV1 corpus and it includes standard train, development, and test sets. Table 1 presents the statistics regarding the number and type of entities in the aforementioned datasets. The statistics are divided according to the training, development, and test sets.

Model
We based our NER model on the pre-trained model BERT proposed by Devlin et al. (2019). Although original recommendations suggest that unsupervised pre-training of BERT encoders are expected to be sufficiently powerful on modern datasets, we consider that adding extra Transformer layers could contribute to the alleviation of word errors or misspellings.
First, we use a pre-trained BERT model, and second, we stack n Transformer blocks on top, finalized with a CRF prediction layer. We refer to this model as BERT+n×Transf where n is a hyper-  parameter referring to the number of Transformer layers. The global architecture of our model is depicted in Figure 1. We used Transformer blocks with parameters that we chose empirically similar to the configuration of the blocks in the fine-tuned model 7 .
The reasons for using BERT models are that they can easily be fine-tuned for a wide range of tasks, but also that they produce high-performing systems (Devlin et al., 2019;Conneau and Lample, 2019;Radford et al., 2018). Nonetheless, despite the major impact of BERT in the NLP community, re-434 searchers question the ability of this model to deal with noisy text (Sun et al., 2020) unless complementary techniques are used (Muller et al., 2019;Pruthi et al., 2019).
More specifically, the built-in tokenizer of BERT first performs simple white-space tokenization, then applies a Byte Pair Encoding (BPE) based tokenization, WordPiece (Wu et al., 2016). For example, word can be split into character n-grams (e.g. compatibility → 'com', '##pa', '##ti', '##bility'), where ## is a special symbol for representing the presence of a sub-word that was recognized.
Between the types of OCR errors that can be encountered in historical documents, the character insertion modification has the minimum influence (Sun et al., 2020), because the tokenization at the sub-word level of BERT would not change much in some cases, such as 'practically' → 'practicaally'. Meanwhile, the substitution and deletion errors can hurt the performance of the tokenizer the most due to the generation of uncommon samples, such as 'professionalism' → 'pr9fessi9nalism' that is tokenized as 'pr', '##9', '##fes', '##si', '##9', '##nal', '##sm'. BERT has been demonstrated to have a sensitivity to its sub-word segmentation when it comes to such words, as the meaning of the sub-words can diminish the initial meaning of the correctly spelled word (Sun et al., 2020). Thus, these new noisy tokens could influence the performance of BERT-based models 8 .
On top of BERT, we add a stack of Transformer blocks (encoders). A Transformer block (encoder), as proposed in (Vaswani et al., 2017), is a deep learning architecture based on multi-head attention mechanisms with sinusoidal position embeddings. It is composed of a stack of identical layers. Each layer has two sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position-wise fully connected feed-forward network. A residual connection is around each of the two sub-layers, followed by layer normalization. All sub-layers in the model, as well as the embedding layers, produce outputs of dimension 512. In our implementation, we used learned absolute positional embeddings (Gehring et al., 2017) instead, as it is a common practice 9 . Vaswani et al. (2017) found that the two versions produced nearly identical results.
We assume that the additional Transformer layers can alleviate the sensitivity of the built-in tokenizer of BERT towards OOV, OCR errors, or misspellings, and contribute to the learning or finding the proper informative words around entities.

Baseline
We chose as a baseline the model proposed by Ma and Hovy (2016), an end-to-end model combining a BiLSTM and a CNN character encoding, in order to take advantage of the word and character features. The character-level features are known to capture morphological and shape information (Kanaris et al., 2007;Santos and Zadrozny, 2014;dos Santos and Guimarães, 2015) that can also offer the possibility of obtaining a representation for misspelled, custom, or abnormal words. For the baseline, we used the FastText 10 pre-trained word embedding models (Grave et al., 2018) 11 .
Additionally, we analyze the aid that can be brought by an available larger dataset by training the baseline model in two stages in a transfer learning setting, similar to the setting in which the BERT encoder is used in our model: 1. pre-training, where the network is trained on a larger-scale available contemporary dataset 2. fine-tuning, where the pre-trained network is further trained on the historical datasets The modern datasets are the following: • For French, we use the fr-WikiNER 12 dataset that is extracted from Wikipedia articles. It contains about 500k tokens from which around 31k are named entities.
• For German, we use the de-GermEval 13 dataset generated from German Wikipedia and News Corpora as a collection of citations. The dataset covers over 31k sentences corresponding to over 590k tokens from which around 33k are named entities.
10 https://fasttext.cc/docs/en/ crawl-vectors.html 11 For a more detailed description of the model and of the hyperparameters can be found in Ma and Hovy (2016

Metrics
The evaluation of the NER task is done in a coarsegrained manner, with the entity (not token) as the unit of reference (Makhoul et al., 1999). We compute precision (P), recall (R), and F1 measure (F1) at micro-level, i.e. error types are considered over all documents. Two evaluation scenarios were considered: micro-strict, which looks for an exact boundary matching, and micro-fuzzy, where a prediction is correct when there is at least one token overlap (Ehrmann et al., 2020a). Further, statistical significance is measured through a two-tailed t-test, with an estimated p-value between 0.01 and 0.05.

Data Pre-processing
The HIPE dataset was initially segmented at the article-level. Since BERT is able to consume only a limited context of tokens as their input (512), we segment the articles at sentence-level. We also reconstruct the original text, including hyphenated words. The reconstructed text was passed through Freeling 4.1 (Padró and Stanilovsky, 2012) to obtain a segmentation based on sentences. We made use of the same segmentation for the baseline model. Moreover, for the BERT+n×Transf, we feed the model with batches of same sized inputs.

Hyperarameters
The hyperparameters used for both models are depicted as follows.
For the German NER, we chose as a pre-trained encoder the bert-base-german-europeana. This BERT model has been used in other NER tasks for processing contemporary and historical German documents (Schweter and Baiter, 2019;Riedl and Padó, 2018). It was trained using a large collection of newspapers provided by the Europeana Library. 14 For the French NER, we rely on the large version of the pre-trained CamemBERT (Martin et al., 2020)  For the English dataset CoNLL, we experimented with both bert-base-cased and 14 http://www.europeana-newspapers.eu/ bert-large-cased, pre-trained models presented in (Devlin et al., 2019).
We denote the number of layers (i.e., Transformer blocks) as L, the hidden size as H, and the number of self-attention heads as A. bert-base-cased has L=12, H=768, A=12, bert-large-cased and camembert-large, L=24, H=1024, A=16. In all the cases, the top Transformer blocks have L=1 for 1×Transf and L=2 for 2×Transf, H=128, A=12, chosen empirically. The BERT-based encoders are fine-tuned on the task during training.
For training, we followed the selection of parameters presented in (Devlin et al., 2019). We found that 2 × 10 −5 learning rate and a mini-batch of dimension 4 for German and English, and 2 for French, provide the most stable and consistent convergence across all experiments as evaluated on the development set.

Results
In this section, we provide experimental results of the baseline model and the proposed method. In order to assess the ability of both models with regard to the presence of errors provided by an OCR, we present several experiments: • In Table 2, the first two experiments are performed with the baseline model, with and without the pre-training proposed by the transfer learning method on larger contemporary datasets.
• It is necessary to analyze how sensitive the proposed model is to the number of Transformer layers, the hyper-parameter n. Therefore, we conduct two experiments for ablation study with the n value ∈ {0, 1, 2}. The values > 2 obtained lower performance results and had a tendency to overfit. Therefore, in the same Table 2, we present next these experiments.
• In Table 3, the results for the baseline model without any transfer learning (as it was unnecessary) are presented, along with the same ablation study for the BERT+n×Transf.
From the results in the Table 2, we can see the evidence that the BERT-based models with n×Transf achieve, for both datasets and languages, higher micro-fuzzy and micro-strict performance values than the BERT model stand-alone and the baseline  Moreover, they generally manage to maintain a balance between recall and precision, while the baseline models vary, depending on the language. We also notice that, while in general, both models obtain a more or less precision-recall balance, there are two cases where there is a large imbalance, more specifically in the NEWSEYE German dataset. Comparing with the baseline models, the BERT+n×Transf only achieves a 20 percentage points difference between precision and recall, while the baseline suffers from 40 points difference.
In the context of transfer learning applied for the baseline models, two performance results, for NEWSEYE in German, and for HIPE in French are higher due to the fine-tuning on these datasets, while the others are not degraded by the pretraining on larger contemporary datasets. This observation confirms the previous studies done on this type of model regarding their robustness to misspellings (Sun et al., 2020;Pruthi et al., 2019). We also notice that for German both datasets, the results for transfer learning from contemporary Ger-man datasets are statistically significant (< 0.01%), while contemporary datasets the performance difference for both French datasets was minimal (either < 0.5 for French NEWSEYE or < 0.9 for French HIPE).  Allemagne que pas iJKeaz Schwietz, le bour,-ijeau de Br eslasi. _Piappi les pliante fameux de Reindel _figurjal la femme _JVie & e, l'horrible mégère de Hambour'g, qui assassina une vingtaine d'enfents confiés à _jses soins mercenaires.
Allemagne que pas iJKeaz Schwietz, le bour,-ijeau de Br eslasi. _Piappi les pliante fameux de Reindel _figurjal la femme _JVie & e, l'horrible mégère de Hambour'g, qui assassina une vingtaine d'enfents confiés à _jses soins mercenaires. In the context of modern data, in the Table 3, the F1 values of the stand-alone BERT model applied on the CoNLL 2003 dataset fairly correspond to the ones reported in (Devlin et al., 2019) (the authors report a F1 of 92.4% for bert-base-cased and 92.8% for bert-large-cased). While the F1 value has a very small margin difference from the (Devlin et al., 2019), the performance results for the BERT+n×Transf slightly increased for both proposed models. We assume that one reason would be that the capacity of representation of extra Transformer layers, even in a context where no misspelling errors are present, can contribute to a modest improvement. While this improvement is more visible for the BERT bert-base-cased+1×Transf (a difference of a half of percentage point), and 0.3 percentage points for bert-base-cased+2×Transf, for the bert-large-cased BERT+n×Transf, the values remain unchanged (with a difference of 0.2 percentage points from BERT).

Discussion
For more qualitative analysis, we examine the number of unrecognized words by the pre-trained BERT-based models that were added to the specific tokenizers (WordPiece for BERT and Sentence-Piece for CamemBERT). For NEWSEYE German, 8.84% of the total number of words in the vocabulary needed to be fully trained, while only 0.14% were unknown in the HIPE dataset. Following this observation, we notice that there is a large F1 margin between BERT+CRF and BERT+n×Transf (63.4% in comparison with 73.5% and 72.6%, respectively), a fact that could be motivated by the large percentage of unknown words.
Moreover, for German, even though the BERT encoder was pre-trained on a digitized historical dataset (bert-base-german-europeana), the proposed model contributed greatly to the coverage of the misspelled or abnormal words present in the NEWSEYE. For French, the results vary of around 1 − 2 percentage F1 points between the stand-alone BERT and the BERT+n×Transf models.
Between the two datasets, only HIPE was also annotated with the Levenshtein Ratio between the gold standard entities and the transcribed ones. In Figure 3, we compare BERT and BERT+n×Transf by analyzing the number of correct predicted entities with respect to the Levenshtein distance. For the French predictions, for 56.25% of the different values of the distance, the stacked models had relatively more correct predictions. A French example of a misspelled entity that is recognized by both BERT+n×Transf but not by BERT is presented in Figure 2, in the upper part. For German, only in 18.75% of the cases, the stacked models have more correctly identified entities that are misrecognized.
We also presume that the introduction by the stacked Transformers of additional hyperparameters can increase the ability of the architecture to better model long-range contexts. Thus, we analyzed the correctly predicted German and French HIPE entities by their length. We noticed that BERT+n×Transf is better than BERT at predicting entities composed of multiple tokens (large entities). For example, for French HIPE, from 170 entities with a length equal or higher than five tokens 15 , the stand-alone BERT managed to correctly detect 70% of them, while both BERT+n×Transf models correctly identified 72.94% of them. German HIPE has less entities longer than five tokens 16 , more exactly 97, and while the stand-alone BERT detected 50.51% of them, the BERT+n×Transf models correctly detected and classified 55.67% for n = 1 and 54.63% for n = 2. In the following examples from Table 4, our method correctly predicted the full entity frequently while the stand-alone BERT only predicted a part of it.
Analyzing the French predictions for BERT and BERT+n×Transf, we observed that BERT detects on average 75.04% of the entities of size 1 to 10, with other models performing slightly better. However, for entities with more than 10 tokens, there is clear a difference, since BERT detects 55.54% of the entities, while BERT+1×Transf detects 57.13%, and BERT+2×Transf reaches 82.52%. Examples are given in Table 4  In the lower part of Figure 2, we present a German example where BERT becomes confused and 16 The length of German HIPE entities ranges from one to 16 tokens. predicts multiple partial spurious entities in a sentence. One can also observe that these entities are of two of the most common types in the dataset, persons (PERS) and locations (LOC). In this case, there is an overprediction of these types, which leads us to the interpretation that BERT is sensitive to misspellings and might overfit on OCR-related patterns. This observation proves that BERT has unbalanced attention to misspelled or corrupted words when the most informative words contain such errors (Sun et al., 2020). To assess these assumptions, in Figure 4, we compare, per model and language, the values of micro-fuzzy F1 and macro-fuzzy F1 in the HIPE corpus. We include, as well, the number of spurious cases, i.e. tokens that were considered as an entity, despite not belonging to one, such as 'Zusammenziehung' in Figure 2. 17 Due to the difference between micro and macro metrics, we can ascertain that the three presented models focused on predicting the most frequent entity types, i.e. PERS and LOC. Moreover, we can see that BERT achieved its result by creating more spurious cases in comparison to BERT+n×Transf. This could mean that BERT learned that overpredicting was a straightforward solution to achieve better results. In the case of BERT+n×Transf, we can see that the Transformer layers made the models to be more conservative and at the same time more accurate in their predictions.

Conclusions and Future Work
We presented a deep learning architecture for NER based on stacked Transformer layers that includes a fine-tuned BERT encoder and several Transformer blocks. Results on two historical datasets in French and German showed the fitness of the proposed model to process noisy digitized text corpora in distinct languages. At the same time, the approach did not degrade the performance over modern data. Thus, this type of model appears to be adapted for the NER of historical document collections.
While the improvements brought by the proposed NER model are clear, our analysis of the results highlighted several factors that could influence the results. Further analysis remains to be done. Thus, hereafter, we will investigate detailed variations of our architecture. In addition, we intend to explore data augmentation techniques, simulating digitized data by adding noise to digitallyborn documents. This could be a solution to increase the size and expand the diversity of training datasets for performing NLP tasks over historical documents.