Multi-modular domain-tailored OCR post-correction

One of the main obstacles for many Digital Humanities projects is the low data availability. Texts have to be digitized in an expensive and time consuming process whereas Optical Character Recognition (OCR) post-correction is one of the time-critical factors. At the example of OCR post-correction, we show the adaptation of a generic system to solve a specific problem with little data. The system accounts for a diversity of errors encountered in OCRed texts coming from different time periods in the domain of literature. We show that the combination of different approaches, such as e.g. Statistical Machine Translation and spell checking, with the help of a ranking mechanism tremendously improves over single-handed approaches. Since we consider the accessibility of the resulting tool as a crucial part of Digital Humanities collaborations, we describe the workflow we suggest for efficient text recognition and subsequent automatic and manual post-correction


Introduction
Humanities are no longer just the realm of scholars turning pages of thick books. As the worlds of humanists and computer scientists begin to intertwine, new methods to revisit known ground emerge and options to widen the scope of research questions are available. Moreover, the nature of language encountered in such research attracts the attention of the NLP community (Kao and Jurafsky (2015), Milli and Bamman (2016)). Yet, the basic requirement for the successful implementation of such projects often poses a stumbling block: large digital corpora comprising the textual material of interest are rare. Archives and individual scholars are in the process of improving this situation by applying Optical Character Recognition (OCR) to the physical resources. In the Google Books 1 project books are being digitized on a large scale. But even though collections of literary texts like Project Gutenberg 2 exist, these collections often lack the texts of interest to a specific question. As an example, we describe the compilation of a corpus of adaptations of Goethe's Sorrows of the young Werther which allows for the analysis of character networks throughout the publishing history of this work. The success of OCR is highly dependent on the quality of the printed source text. Recognition errors, in turn, impact results of computer-aided research (Strange et al., 2014). Especially for older books set in hard-to-read fonts and with stained paper the output of OCR systems is not good enough to serve as a basis for Digital Humanities (DH) research. It needs to be post-corrected in a time-consuming and cost-intensive process. We describe how we support and facilitate the manual post-correction process with the help of informed automatic post-correction. To account for the problem of relative data sparsity, we illustrate how a generic architecture agnostic to a specific domain can be adjusted to text specificities such as genre and font characteristics by including just small amounts of domain specific data. We suggest a system architecture (cf. Figure 1) with trainable modules which joins general and specific problem solving as required in many applications. We show that the combination of modules via a ranking algorithm yields results far above the performance of single approaches.  Figure 1: Multi-modular OCR post-correction system.
We discuss the point of departure for our research in Section 2 and introduce the data we base our system on in Section 4. In Section 5, we illustrate the most common errors and motivate our multimodular, partly customized architecture. Section 6 gives an overview of techniques included in our system and the ranking algorithm. In Section 7, we discuss results, the limitations of automatic post-correction and the influence the amount of training data takes on the performance of such a system. Finally, Section 8 describes a way to efficiently integrate the results of our research into a digitization work-flow as we see the easy accessibility of computer aid as a central point in Digital Humanities collaborations.

Related work
There are two obvious ways to automatically improve quality of digitized text: optimization of OCR systems or automatic post-correction. Commonly, OCR utilizes just basic linguistic knowledge like character set of a language or reading direction. The focus lies on the image recognition aspect which is often done with artificial neural networks (cf. Graves et al. (2009), Desai (2010)). Post-correction is focused on the correction of errors in the linguistic context. It thus allows for the purposeful inclusion of knowledge about the text at hand, e.g. genre-specific vocabulary. Nevertheless, post-correction has predominantly been tackled OCR system agnostic as outlined below. As an advantage, post-correction can also be applied when no scan or physical resource is available. There have been attempts towards shared datasets for evaluation. Mihov et al. (2005) released a corpus covering four different kinds of OCRed text comprising German and Bulgarian. However, in 2017 the corpus was untraceable and no recent re-search relating to the data could be found. OCR post-correction is applied in a diversity of fields in order to compile high-quality datasets. This is not merely reflected in the homogeneity of techniques but in the metric of evaluation as well. While accuracy has been widely used as evaluation measure in OCR post-correction research, Reynaert (2008a) advocates the use of precision and recall in order to improve transparency in evaluations. Dependent on the paradigm of the applied technique even evaluation measures like BLEU score can be found (cf. Afli et al. (2016)).
Since shared tasks are a good opportunity to establish certain standards and facilitate the comparability of techniques, the Competition on Post-OCR Text Correction 3 organized in the context of IC-DAR 2017 could mark a milestone for more unified OCR post-correction research efforts. Regarding techniques used for OCR postcorrection, there are two main trends to be mentioned: statistical approaches utilizing error distributions inferred from training data and lexical approaches oriented towards the comparison of source words to a canonical form. Combinations of the two approaches are also available. Techniques residing in this statistical domain have the advantage that they can model specific distributions of the target domain if training data is available. Tong and Evans (1996) approach post-correction as a statistical language modeling problem, taking context into account. Pérez-Cortes et al. (2000) employ stochastic finite-state automaton along with a modified version of the Viterbi Algorithm to perform a stochastic error correcting parsing. Extending the simpler stochastic context-sensitive models, Kolak and Resnik (2002) apply the first noisy channel model, using edit distance from noisy to corrected text on character level. In order to train such a model, manually generated training data is required. Reynaert (2008b) suggests a corpus-based correction method, taking spelling variation (especially in historical text) into account. Abdulkader and Casey (2009) introduce an error estimator neural network that learns to assess error probabilities from ground truth data which in turn is then suggested for manual correction. This decreases the time needed for manual post-correction since correct words do not have to be considered as candidates for correction by the human corrector. Llobet et al. (2010) combine information from the OCR system output, the error distribution and the language as weighted finite-state transducers. Reffle and Ringlstetter (2013) use global as well as local error information to be able to fine-tune post-correction systems to historical documents. Related to the approach introduced by Pérez-Cortes et al. (2000), Afli et al. (2016) use statistical machine translation for error correction using the Moses toolkit on character level. Volk et al. (2010) merge the output of two OCR systems with the help of a language model to increase the quality of OCR text. The corpus of yearbooks of the Swiss Alpine Club which has been manually corrected via crowdsourcing (cf. Clematide et al. (2016)) is available from their website. Lexical approaches often use rather generic distance measures between an erroneous word and a potential canonical lexical item. Strohmaier et al. (2003) investigate the influence of the coverage of a lexicon on the post-correction task. Considering the fact that writing in historical documents is often not standardized, the success of such approaches is limited. Moreover, systems based on lexicons rely on the availability of such resources. Historical stages of a language -which constitute the majority of texts in need for OCR post-correction -often lack such resources or provide incomplete lexicons which would drastically decrease performance of spell-checking-based systems. Ringlstetter et al. (2007) address this problem by suggesting a way to dynamically collect specialized lexicons for this task. Takahashi et al. (1990) apply spelling correction with preceding candidate word detection. Bassil and Alwani (2012) use Google's online spelling suggestions for as they draw on a huge lexicon based on contents gathered from all over the web. The human component as final authority has been mentioned in some of these projects. Visual support of the post-correction process has been emphasized by e.g. Vobl et al. (2014) who describe a system of iterative post-correction of OCRed historical text which is evaluated in an application-oriented way. They present the human corrector with an alignment of image and OCRed text and make batch correction of the same error in the entire document possible. They can show that the time needed by human correctors considerably decreases.

Evaluation metrics
We describe and evaluate our data by means of word error rate (WER) and character error rate (CER). The error rates are a commonly used metric in speech recognition and machine translation evaluation and can also be referred to as length normalized edit distance. They quantify the number of operations, namely the number of insertions, deletions and substitutions, that are needed to transform the suggested string into the manually corrected string and are computed as follows: WER = word insertions + word substitutions + word deletions # words in the reference CER = char insertions + char substitutions + char deletions # characters in the reference

Data
As mentioned in the introduction, errors found in OCRed texts are specific to time of origin, quality of scan and even the characteristics of a specific text. Our multi-modular architecture paves the way for a solution taking this into account by including general as well as specific modules. Thus, we suggest to include domain specific data as well as larger, more generic data sets in order to enhance coverage of vocabulary and possible error classes. The data described hereafter constitutes parallel corpora with OCR output and manually corrected text which we utilize for training statistical models.

The Werther corpus
Since our system is developed to help in the process of compiling a corpus comprising adaptations of Goethe's The Sorrows Of Young Werther throughout different text types and centuries, we  collected texts from this target domain. To be able to train a specialized system, we manually corrected a small corpus of relevant texts (cf. Table 2). We use the output of Abbyy Fine Reader 7 for several Werther adaptations (Table 1) all based on scans of books with German Gothic lettering.

The Deutsches Textarchive (DTA) corpus
Even though manual OCR post-correction is a vital part of many projects, only very little detailed documentation of this process exists. Das Deutsche Textarchiv (The German Text Archive) (DTA) is one of the few projects providing detailed correction guidelines along with the scans and the text corrected within the project (Geyken et al., 2012). This allows the compilation of a comprehensive parallel corpus of OCR output and corrected text spanning a period of four centuries (17th to 20th) in German Gothic lettering. For OCR, we use the open source software tesseract 4 (Smith and Inc, 2007) which comes with recognition models for Gothic font.

Gutenberg data for language modeling
Since the output of our system is supposed to consist of well-formed German sentences, we need a method to assess the quality of the output language. This task is generally tackled by language modeling. We compiled a collection of 500 randomly chosen texts from Project Gutenberg 5 comprising 28,528,078 tokens. With its relative closeness to our target domain it constitutes the best approximation of a target language. The language model is trained with the KenLM toolkit (Heafield, 2011) with an order of 5 on token level and 10 on character level following De Clercq et al. (2013).

Why OCR post-correction is hard
In tasks like the normalization of historical text (Bollmann et al., 2012) or social media, one can take advantage of regularities in the deviations from the standard form that appear throughout an entire genre or in case of social media e.g. dialect region (Eisenstein, 2013). Errors in OCR, however, depend on the font and quality of the scan as well as the time of origin which makes each text unique in its composition of features and errors.
In order to exemplify this claim, we analyzed three different samples: Lorenz Konau (1776), Werther der Jude (1910) and a sample from the DTA data. Figure 2 (a-c) illustrate the point that the quality of scan is crucial for the OCR success. Figure 2a shows a text from the 20th century where the type setting is rather regular and the distances between letters is uniform as opposed to Figure 2b. Figure 2c shows how the writing from the back of the page shines through and makes the script less readable. Thus, we observe a divergence in the frequency of certain character operations between those texts: the percentage of substitutions range between 74% for Lorenz Konau and 60% for Werther der Jude and 18% and 30% of insertions, respectively. The varying percentage of insertions might be due to the fact that some scans are more "washed out" than others. Successful insertion of missing characters, however, relies on the precondition that a system knows a lot of actual words and sentences in the respective language and cannot be resolved via e.g. character similarity like in the substitution from l to t. Another factor that complicates the correction of a specific text is the number of errors per word. Words with an edit distance of one to the correct version are easier to correct those with more than one necessary operation. With respect to errors per word our corpus shows significant differences in error distributions. Especially in our DTA corpus the number of words with two or more character-level errors per word is considerably higher than those with one error. For Werther der Jude (WER 10.0, CER 2.4) the number of errors in general is much lower than for Konau (WER: 34.7, CER: 10.9). These characteristics indicate that subcorpus-specific training of a system is promising.

Specialized multi-modular post-correction
In order to account for the nature of errors that can occur in OCR text, we apply a variety of modules for post-correction. The system proceeds in two stages and is largely based on an architecture suggested by Schulz et al. (2016) for normalization of user-generated contents. In the first stage, a set of specialized modules (Section 6.1) suggest corrected versions for the tokenized 6 OCR text lines. Those modules can be context-independent (work on just one word at a time) or context-dependent (an entire text line is processed at a time). The second stage is the decision phase. After the collection of various suggestions per input token, these have to be ranked to enable a decision for the most probable output token given the context. We achieve this by assigning weights the different modules with the help of Minimal Error Rate Training (MERT) (Och, 2003).

Suggestion modules
In the following, we give an outline of techniques included into our system.

Word level suggestion modules
• Original: the majority of words do not contain any kind of error, thus we want to have 6 Tokenizer of TreeTagger (Schmid, 1997). the initial token available in our suggestion pool • Spell checker: spelling correction suggestion for misspelled words with hunspell 7 • Compounder: merges two tokens into one token if it is evaluated as an existing word by hunspell • Word splitter: splits two tokens into two words using compound-splitter module from the Moses toolkit (Koehn et al., 2007) • Text-Internal Vocabulary: extracts highfrequent words from the input texts and suggests them as correction of words with small adjusted Levenshtein distance 8 The compound and word split techniques react to the variance in manual typesetting, where the distances between letters vary. This means that the word boundary recognition becomes difficult (cf. Figure 3). A problem related to the spell-checking approach is the limited coverage of the dictionary since it uses a modern German lexicon. Related to this is the difficulty of out-of-vocabulary words above average for literature text. Archaic words from e.g. the 17th century or named entities cannot be found in a dictionary and can therefore not be covered with any of the approaches mentioned above.
However, especially named entities are crucial for the automatic or semi-automatic analysis of narratives e.g. with the help of network analysis. Our Text-Internal Vocabulary technique is designed to find frequent words in the input text, following the assumption that errors would not be regular enough to distort those frequencies. We compile a list from those high-frequency words. Subsequently, erroneous words can be corrected calculating an OCR-adjusted Levenshtein distance. In this way misspelled words like Loveuzo could be resolved to Lorenzo if this name appears frequently. Since the ranking algorithm relies on a language model which will most probable not contain those suggestions, we insert the highfrequency words into the language modeling step.

Sentence level suggestion modules
As has been suggested by Afli et al. (2016), we include Phrase-based Statistical Machine Translation (SMT) into our system. We treat the post-correction as a translation problem translating from erroneous to correct text. Like in standard SMT, we train our models on a parallel corpus, the source language being the OCRed text and the target language being manually corrected text. We train models on token level as well as on character-level (unigram). This way, we aim at correcting frequently mis-recognized words along with frequent character-level errors. We train four different systems: • token level -domain specific data (cf. Section 4.1) -general data (cf. Section 4.2) • character level -domain specific data (cf. Section 4.1) -general data (cf. Section 4.2) The models are trained with the Moses toolkit (Koehn et al., 2007). Moreover, we use a subsequent approach by forwarding the output of the character-based SMT model to the token-based SMT.

Additional feature
The information whether a word contains an error can help to avoid the incorrect alternation of an initially correct word (overcorrection). In order to deliver this information to the decision module without making a hard choice for each word, we include the information whether a word has been found either in combination with the word before or after in a corpus (cf. Section 4.3) into the decision process in form of a feature that will be weighted along with the other modules. This naive language modeling approach allows for a contextrelevant decision of the correctness of a word.

Decision modules: the ranking mechanism
Since the recognition errors appearing in a text are hard to pre-classify by nature, we run all modules on each sentence of the input, returning suggestions for each word. Since the output of some of our modules are entire sentences, input sentence and output sentence have to be word-aligned in order to be able to make suggestions on word level. The word alignment between input and output sentence is done with the Needleman-Wunsch algorithm (Needleman and Wunsch, 1970), an algorithm originally developed in bioinformatics. From all corrected suggestions the most probable well-formed combination has to be chosen.
To solve the combinatorial problem of deciding which suggestion is the most probable candidate for a word, the decision module makes use of the Moses decoder. As in general SMT, the decoder makes use of a language model (cf. Section 4.3) and a phrase table. The phrase table is compiled from all input words along with all possible correction suggestions. In order to assign weights to the single modules and the language model, we tune on the phrase tables collected from a run on our dev overall set, following the assumption that suggestions of certain modules are more reliable than others and expect their feature weights to be higher after tuning.

Experimental Setup
To guarantee diversity, we split each of texts 1-4 (cf.   constitution. It constitutes an evaluation in which no initial manual correction as support for the automatic correction is included in the workflow. We henceforth call this unknown set test unk (text 6). In contrast, the second set contains parts of the same texts as the training, thus specific vocabulary might have been introduced already. The results for this test set give a first indication of the extent to which pre-informing the system with manually correcting parts of a text could assist the automatic correction process. Since this scenario can be described as a text-specific initiated post-correction, we henceforth refer to this test set as test init .
We further on experiment with an extended training set train ext (train with texts 7 and 8) to assess the influence of the size of the specific training set on the overall performance. The sizes of the datasets before and after correction along with WER and CER are summarized in Table 2. The sizes for the general dataset before and after correction along with WER and CER are summarized in Table 3.

Evaluation
In the following we concentrate on the comparison of WER and CER before and after automatic post-correction. As a baseline for our system we chose the strongest single-handed module (SMT on character-level trained on Werther data).  Table 4: WER and CER for both test sets before and after automatic post-correction for the system trained with the small training set (train) and the larger training set (train ext ). Baselines: the original text coming from the OCR system and the character-level SMT system trained on the Werther data.
Overall performance As indicated previously, our test sets differ with respect to their similarity to the training set. The results for both test scenarios for systems trained on our two training sets are summarized in Table 4. The results from test init and test unk show that our system performs considerably better than the baseline and can improve quality of the OCR output considerably. For test unk , the system improves the quality by almost 20 points of WER from 36.7 to 15.4 and over 10 points in CER from 30.0 to 19.6. For test init , our system improves the quality of the text with a reduction of approximately 20 points of WER from 23.5 to 4.7 and 7 points in CER from 15.1 to 8.0. It is not surprising that the decrease in WER is stronger than the decrease in CER. This is due to the fact that many words contain more than one error and require more than one character level operation to get from the incorrect to the correct string.
Just slight improvement can be shown by adding training material to the Werther-specific parts of the system (cf. train ext row of Table 4). Merely the CER can be improved whereas the WER stays about the same. The improvement in test unk is higher than for test init .
Module specific analysis Since a WER and CER evaluation is not expedient for all modules as they were designed to correct specific problems and not the entirety of them, we look into the specialized modules in terms of correct suggestions contributed to the suggestion pool and correct suggestions only suggested by one module (unique suggestions). As the system including the extended training set train ext delivered slightly better results, in the following we will describe the contribution of the single modules  to the overall performance of this system (cf. Table 5). For test unk the number of corrected tokens along with the number of overcorrections is higher than for test init throughout all modules. Clearly, for test init the Werther-specific modules are strongest. The more general modules prove useful for test unk . The number of corrected words increases for the SMT module trained on DTA data on character-level. The usefulness of the module extracting specific words (text-internal vocab) as well as the general SMT model and the spell checker becomes evident in terms of unique suggestions contributed by those modules. The analysis of the output of the individual modules and their contribution to the overall system uncovers an issue: those modules that produce a high number of incorrect suggestions, thus overcorrecting actually correct input tokens, are at the same time those modules that are the only ones producing correct suggestions for some of the incorrect input words. Consequently, those uniquely suggested corrections are not chosen in the decision modules due to an overall weak performance of this module. These suggestions are often crucial to the texts like the suggestions by the special vocabulary module which contain named entities or words specific to the time period. For our test unk set, the text-internal vocabulary module yields around 60 unique suggestions, out of which 15 are names (Friedrich, Amalia) or words really specific to the text (Auftrit spelled with one t instead of two).
Challenges In the context of literature OCR post-correction is a challenging problem since the texts themselves can be considered non-standard text. The aim is not to bring the text at hand to an agreed upon standard form but to digitize exactly what was contained in the print version. This can be far from the standard form of a language. In one of our texts, we find a character speaking German with a strong dialect. Her speech contains a lot of words that are incorrect in standard German, however, the goal is it to preserve this "errors" in the digital version. Thus, correction merely on the basis of the OCR text without consulting the printed version or an image-digitized facsimile, can essentially never be perfect. It follows, that the integration of automatic post-correction techniques into the character recognition process could lead to further improvements.

Adaptability
Reusability as a key concept in NLP for DH originates in the time limitations given in such projects. Since DH projects do not evolve around the development of tools but the analysis performed with the help this tools in order to answer a specific question, the tools are expected to be delivered in an early phase of collaborative projects. Fromscratch development easily exceeds this time limits. We show that our OCR post-correction system is modular enough to be adjusted to correct texts from other languages by training it for two other languages, English and French, with data released in the OCR post-correction competition organized in the context of ICDAR 2017 9 . The texts originate from the the last four centuries and come from different collections and therefore have been digitized using different OCR systems. The data is summarized in Table 6 10 . We adjust our system to the language by retraining the SMT models and including spell-checkers for the respective languages. Due to the modular architecture these adjustments can be made eas-9 https://sites.google.com/view/ icdar2017-postcorrectionocr/home, 3.07.2017. 10 The test set does not comply with the official shared task set since the manually corrected data is not yet available for the test set. We test on a combination of periodicals and monographs.   The strongest unique module for these two languages is the subsequent combination of the character-level SMT and the token-level SMT models (Cascaded). For English it performs just slightly worse on WER and even outperforms the overall system on the CER. For French, the overall system is clearly stronger than the Cascaded SMT system with more than 1 percent improvement of WER but also performs worse in terms of CER by 1.5 percent. Generally, the OCR postcorrection system achieves about 25% reduction of WER for English and over 30% reduction in WER for French.

Digitization workflow
We integrate the automatic OCR process with tesseract and our automatic post-correction system into a workflow which results in an hocr file, an XML format which is readable by PoCoTo (Vobl et al., 2014) a tool for supporting manual postcorrection of OCRed text through alignment of image and digitized text. The upload of scans or images is provided online via a webapplication 11 . This shields the user from the technicalities of the 11 http://clarin05.ims.uni-stuttgart.de/ ocr/, for access please contact the author. correction process and provides them with the input for the PoCoTo tool. The implementation of an easy-to-handle workflow is an often underemphasized aspect of DH. It needs to be intuitive enough to not absorb the time ion has been saved via automation. Since the final post-correction step requires that the human corrector compares the digitized version with the scan, presenting both next to each other is an ideal scenario. This functionality is one of the main strengths of PoCoTo, a visual correction tool, supporting manually initiated correction operations and batch correction of the same error.

Conclusion
We can show that the enhancement of a general, adaptable architecture by including small but specific data sets can improve results within a specific domain. Moreover, the combination of different techniques for of OCR post-correction is significantly superior to single techniques. Especially the integration of SMT models on token level and character level contributes to the overall success of the system. Due to the complexity of OCR post-correction, there cannot be a general solution. Even though the ranking algorithm achieves large improvement, further potential lies in the inclusion of fine-tuned language models since the decision process highly depends upon it. The intrinsic characteristic of literature as being non-standard complicates the task. However, techniques that focus on these features like our module that is specialized on extracting text-specific vocabulary show promising results for e.g. named entity correction.