Multi-Input Attention for Unsupervised OCR Correction

We propose a novel approach to OCR post-correction that exploits repeated texts in large corpora both as a source of noisy target outputs for unsupervised training and as a source of evidence when decoding. A sequence-to-sequence model with attention is applied for single-input correction, and a new decoder with multi-input attention averaging is developed to search for consensus among multiple sequences. We design two ways of training the correction model without human annotation, either training to match noisily observed textual variants or bootstrapping from a uniform error model. On two corpora of historical newspapers and books, we show that these unsupervised techniques cut the character and word error rates nearly in half on single inputs and, with the addition of multi-input decoding, can rival supervised methods.


Introduction
Optical character recognition (OCR) software has made vast quantities of printed material available for retrieval and analysis, but severe recognition errors in corpora with low quality of printing and scanning or physical deterioration often hamper accessibility (Chiron et al., 2017). Many digitization projects have employed manual proofreading to further correct OCR output (Holley, 2009), but this is time consuming and depends on fostering a community of volunteer workers. These problems with OCR are exacerbated in library-scale digitization by commercial (e.g., Google Books, Newspapers.com), government (e.g., Library of Congress, Bibliothèque nationale de France), and nonprofit (e.g., Internet Archive) organizations.
The scale of these projects not only makes it difficult to adapt OCR models to their diverse layouts and typefaces but also makes it impractical to present any OCR output other than a single-best transcript.
Existing methods for automatic OCR postcorrection are mostly supervised methods that correct recognition errors in a single OCR output (Kolak and Resnik, 2002;Kolak et al., 2003;Yamazoe et al., 2011). Those systems are not scalable since human annotations are expensive to acquire, and they are not capable of utilizing complementary sources of information. Another line of work is ensemble methods (Lund et al., 2013(Lund et al., , 2014 combining OCR results from multiple scans of the same document. Most of these ensemble methods, however, require aligning multiple OCR outputs (Lund and Ringger, 2009;Lund et al., 2011), which is intractable in general and might introduce noise into the later correction stage. Furthermore, voting-based ensemble methods (Lund and Ringger, 2009;Wemhoener et al., 2013;Xu and Smith, 2017) only work where the correct output exists in one of the inputs, while classification methods (Boschetti et al., 2009;Lund et al., 2011;Al Azawi et al., 2015) are also trained on human annotations.
To address these challenges, we propose an unsupervised OCR post-correction framework both to correct single input text sequences and also to exploit multiple candidate texts by simultaneously aligning, correcting, and voting among input sequences. Our proposed method is based on the observation that significant number of duplicate and near-duplicate documents exist in many corpora (Xu and Smith, 2017), resulting in OCR output containing repeated texts with various quality. As shown by the example in Table 1, different errors (characters in red) are introduced when the OCR system scans the same text in multiple editions, each with its own layout, fonts, etc. For ex-ample, in is recognized as m in the first output and a is recognized as u in the third output, while the second output is correctly recognized. Therefore, duplicated texts with diverse errors could serve as complementary information sources for each other.
OCR eor**y that I have been slam in battle, for 1 Output sorry that I have been slain in battle, for I sorry tha' I have been s uin in battle, f r I Original sorry that I have been slain in battle, for I Text In this paper, we aim to train an unsupervised correction model via utilizing the duplication in OCR output. We propose to map each erroneous OCR'd text unit to either its high-quality duplication or a consensus correction among its duplications via bootstrapping from an uniform error model. The baseline correction system is a sequence-to-sequence model with attention (Bahdanau et al., 2015), which has been shown to be effective in text correction tasks (Chollampatt et al., 2016;Xie et al., 2016).
We also seek to improve the correction performance for duplicated texts by integrating multiple inputs. Previous work on combining multiple inputs in neural translation deal with data from different domains, e.g., multilingual (Zoph and Knight, 2016) or multimodal (Libovický and Helcl, 2017) data. Therefore, their models need to be trained on multiple inputs to learn parameters to combine inputs from each domain. Given that the inputs of our task are all from the same domain, our model is trained on a single input and introduces multi-input attention to generate a consensus result merely for decoding. It does not require learning extra parameters for attention combination and thus is more efficient to train. Furthermore, average attention combination, a simple multi-input attention mechanism, is proposed to improve both the effectiveness and efficiency of multi-input combination on the OCR post-correction task.
We experiment with both supervised and unsupervised training and with single-and multiinput decoding on data from two manually transcribed collections in English with diverse typefaces, genres, and time periods: newspaper articles from the Richmond (Virginia) Daily Dispatch (RDD) from 1860-1865 and books from 1500-1800 from the Text Creation Partnership (TCP). For both collections, which were manually transcribed by other researchers and are in the public domain, we aligned the one-best output of an OCR system to the manual transcripts. We also aligned the OCR in the training and evaluation sets to other public-domain newspaper issues (from the Library of Congress) and books (from the Internet Archive) to find multiple duplicates as "witnesses", where available, for each line. Experimental results on both datasets show that our proposed averarge attention combination mechanism is more effective than existing methods in integrating multiple inputs. Moreover, our noisy error correction model achieves comparable performance with the supervised model via multiple-input decoding on duplicated texts.
In summary, our contributions are: (1) a scalable framework needing no supervision from human annotations to train the correction model; (2) a multi-input attention mechanism incorporating aligning, correcting, and voting on multiple sequences simultaneously for consensus decoding, which is more efficient and effective than existing ensemble methods; and (3) a method that corrects text either with or without duplicated versions, while most existing methods can only deal with one of these cases.

Data Collection
We perform experiments on one-best OCR output from two sources: two million issues from the Chronicling America collection of historic U.S. newspapers, which is the largest public-domain full-text collection in the Library of Congress; 1 and three million public-domain books in the Internet Archive. 2 For supervised training and for evaluation, we aligned manually transcribed texts to these onebest OCR transcripts: 1384 issues of the Richmond (Virginia) Daily Dispatch from 1860-1865 (RDD) 3 and 934 books from 1500-1800 from the Text Creation Partnership (TCP). 4 Both of these manually transcribed collections, which were produced independently from the current authors, are in the public domain and in English, although both Chronicling America and the Internet Archive also contain much non-English text.
To get more evidence for the correct reading of an OCR'd line, we aligned each OCR'd RDD issue to other issues of the RDD and other newspapers from Chronicling America and aligned each OCR'd TCP page to other pre-1800 books in the Internet Archive. To perform these alignments between noisy OCR transcripts efficiently, we used methods from our earlier work on text-reuse analysis (Smith et al., 2014;Wilkerson et al., 2015). An inverted index of hashes of word 5-grams was produced, and then all pairs from different pages in the same posting list were extracted. Pairs of pages with more than five shared hashed 5-grams were aligned with the Smith-Waterman algorithm with equal costs for insertion, deletion, and substitution, which returns a maximally aligned subsequence in each pair of pages (Smith and Waterman, 1981). Aligned passages that were at least five lines long in the target RDD or TCP text were output. For each target OCR line-i.e., each line in the training or test set-there are thus, in addition to the ground-truth manual transcript, zero or more witnesses from similar texts, to use the term from textual criticism.
In our experiments on OCR correction, each training and test example is a line of text following the layout of the scanned image documents 5 . The average number of characters per line is 42.4 for the RDD newspapers and 53.2 for the TCP books. Table 2 lists statistics for the number of OCR'd text lines with manual transcriptions and additional witnesses. 43% of the manually transcribed lines have witnesses in the RDD newspapers, and 64% of them have witnesses in the TCP books. In the full Chronicling America data, 44% of lines align to at least one other witness. Although not all OCR collections will have this level of repetition, it is notable that these collections, which are some of the largest public-domain digital libraries, do exhibit this kind of reprinting. Similarly, at least 25% of the pages in Google's web crawls are duplicates (Henzinger, 2006). Although we exploit text reuse, where available, to improve decoding and unsupervised training, we also show (

Methods
In this section, we first define our problem in §3.1, followed by model description. In general, we train an OCR error correction model via an attention-based RNN encoder-decoder, which takes a single erroneous OCR'd line as input and outputs the corrected text ( §3.2). At decoding time, multi-input attention combination strategies are introduced to allow the decoder to integrate information from multiple inputs ( §3.3). Finally, we discuss several unsupervised settings for training the correction model in §3.4.

Problem Definition
Given a line of OCR'd text x, comprising the sequence of characters [x 1 , · · · , x T S ], our goal is to map it to an error-free text y = [y 1 , · · · , y T T ] via modeling p(y|x). Given p(y|x), we also seek to model p(y|X) to search for consensus among duplicated texts X, where X = [x 1 , · · · , x N ] are duplicated lines of OCR'd text.

Attention-based Seq2Seq Model
Similar to previous work (Bahdanau et al., 2015), the encoder is a bidrectional RNN (Schuster and Paliwal, 1997) that converts source sequence is a concatenation of both forward and backward hidden states here f is the dynamic function, e.g., LSTM (Hochreiter and Schmidhuber, 1997) or GRU (Cho et al., 2014). The decoder RNN predicts the output sequence y = [y 1 , · · · , y T T ], through the following dynamics and prediction model: (2) where s t is the RNN state and c t is the context vector at time t. y t is the predicted symbol from the target vocabulary at time t via prediction function g(·). The context vector is given as a linear combination of the encoder hidden states: where α t,i is the weight for each hidden state h i and η is the function that computes the strength of each encoder hidden state according to current decoder hidden state. The loss function is the crossentropy loss per time step summed over the output sequence y:

Multi-input Attention
Given a trained Seq2Seq model p(y|x), our goal is to combine multiple input sequences X to generate the target sequence y, i.e., to utilize information from multiple sources at decoding time.
and T l is the length of the l th sequence. Then, a sequence of hidden states h l = [h l,1 , · · · , h l,T l ] is generated by the encoder network for each input sequence x l . At each decoding time step t, the decoder searches through encoder hidden states H = [h 1 , · · · , h N ] to compute a global context vector c t . Different strategies to combine attention from multiple encoders are described as follows.
Flat Attention Combination. Flat attention combination assigns a weight α t,l,i to each encoder hidden state h l,i for each input sequence x l as: Therefore, the global context vector is given by Flat attention combination is similar to singleinput decoding in that it concatenates all inputs into a long sequence, except that the encoder hidden states are computed independently for each input.
Hierarchical Attention Combination. The structure of hierarchical attention combination is presented in Figure 1. We first compute a context vector for each input as: .
(8) Then a global context vector c t is computed as a weighted sum of all the context vectors: where β t,l is the weight assigned to each context vector c t,l and computed in different ways as follows: (a) Weighted Attention Combination. In weighted attention combination, the weight for each context vector is given by its dot product with the decoder state in the transformed common space: (b) Average Attention Combination. In average attention combination, each input sequence is treated as equally weighted. Thus β t,l = 1 N for each input sequence x l . It is more efficient than the weighted attention combination in that it does not need to compute a weight for each input. These attention-combination methods do not have parameters trained on multiple inputs and are only introduced at decoding time. In contrast, Libovický and Helcl (2017) and Zoph and Knight (2016) introduce parameters for each type of input and require training and decoding with the same number of inputs.

Training Settings
In this section, we introduce different settings for training our correction model, a single-input attention-based Seq2Seq model ( §3.2), which transforms each OCR'd text line into a corrected version generated via different mechanisms. Supervised Training. In this setting, the correction model is trained to map each OCR'd line into the corresponding manual transcription, i.e., the human annotation. We call the correction model trained in this setting Seq2Seq-Super. Unsupervised Training. In the absence of ground truth transcriptions, we can use different methods to generate a noisy corrected version for each OCR'd line.
(a) Noisy Training. In this setting, the correction model is trained to transform each OCR'd text line to a selected high-quality witness. The quality of the witnesses is measured by a 5-gram character language model built on the New York Time Corpus (Sandhaus, 2008) with KenLM toolkit (Heafield, 2011). For each OCR'd line with multiple witnesses, a score is assigned to each witness by the language model, divided by the number of characters in it to reduce the effect of the length of a witness. Then a witness with the highest score is chosen as the noisy ground truth for each line. Those lines with low score for all witnesses are removed. We call the correction model trained in this setting Seq2Seq-Noisy.
(b) Synthetic Training. In this setting, the error correction model is trained to recover a manually corrupted out-of-domain corpus. We construct the synthetic dataset by injecting uniformly distributed insertion, deletion and substitution errors into the New York Times corpus. Firstly, the news articles are split into lines with random length between [1, 70] following a Gaussian distribution N (45, 5), which is similar to that of the real world dataset. Then, a certain number of lines are randomly selected and injected with equal number of insertion, deletion and substitution errors. The correction model is then trained to recover the original line from each corrupted line. We call this model Seq2Seq-Syn.
(c) Synthetic Training with Bootstrapping. In this setting, we propose to further improve the performance of synthetic training via bootstrapping. The correction model trained on synthetic dataset does not perform well when correcting a given input from real world dataset, due to their difference in error distributions. But it achieves comparable performance with the supervised model when decoding lines with multiple witnesses, since the model could further benefit from jointly aligning and voting among multiple inputs. Thus, with the multi-input attention mechanism introduced in §3.3, we first generate a high-quality consensus correction for each OCR'd line with witnesses via the correction model trained on synthetic data. Then, the a bootstrapped model is trained to transform those lines into their consensus correction results. We call the correction model trained in this setting Seq2Seq-Bootstrap.

Experiments
In this section, we first introduce the details of our experimental setup ( §4.1). Then, the results of preliminary experiments comparing the performance of different options for the single-input Seq2Seq model and the multi-input attention combination strategies are presented in §4.2. The main experimental results for evaluating the correction model trained in different training settings and decoded with/without multi-input attention are reported and explained in §4.3. Further discussions of our model are described in §4.4.

Experimental Setup
We begin by describing the data split, training details, baseline systems, and evaluation metrics.

Training Details
For both RDD newspapers and TCP books, we randomly split the OCR'd lines into 80% training and 20% test either by the date of the newspaper or by the name of the books. For the RDD newspapers, we have 1.7M training lines and 0.44M test lines. For the TCP books, 2.8M lines are randomly sampled from the whole training set for different training settings to conduct a fair comparison with noisy training, and about 1.6M lines are used for testing.
Both the encoder and decoder of our model has 3 layers with 400 hidden units for each layer, where GRU is applied as the dynamic function. Adam optimizer with a learning rate of 0.0003 and default decay rates is used to train the correction model . We train up to 40 epochs with a minibatch size of 128 and select the model with the lowest perplexity on the development set. The decoder implements beam search with a beam width of 100.

Baselines and Comparisons
In preliminary experiments, we first compare the neural translation model ( §3.2) with a commonly used Seq2Seq model, pruned conditional random fields (PCRF) (Schnober et al., 2016) on the single-input correction task. CRF models have been shown to be very competitve on tasks such as OCR post-correction, spelling correction, and lemmatization. After that, we compare the different multi-input attention strategies introduced in §3.3 on multi-input correction task to choose the best strategy for the main experiments.
In the main experiment, we compare the performance of correction models trained in different training settings and decode with and without multiple witnesses. Two ensembles methods, language model ranking (LMR) and majority vote (Xu and Smith, 2017), are also considered as unsupervised baseline methods. LMR chooses a single high-quality witness for each OCR'd line by a language model as the correction for that line. Majority vote first aligns multiple input sequences using a greedy pairwise algorithm (since multiple sequence alignment is intractable) and then votes on each position in the alignment, with a slight advantage given to the original OCR output in case of ties. We also tried to use an exact unsupervised method for consensus decoding based on dual decomposition (Paul and Eisner, 2012). Their implementation, unfortunately, turned out not to return a certificate of completion on most lines in our data even after thousands of iterations.

Evaluation Metrics
Word error rate (WER) and character error rate (CER) are used to compare the performance of each method. Case is ignored. Lattice word error rate (LWER) and lattice character error rate (LCER) are also computed as the oracle performance for each method, which could reveal the capability of each model to be applied to downstream tasks taking lattices as input, e.g., reranking or retrieval of the correction hypotheses (Taghva et al., 1996;Lam-Adesina and Jones, 2006). We compute the macro average for each type of error rate, which allows us to use a paired permutation significance test.

Preliminary Experiments
In this section, we conduct two preliminary experiments to study different options for both the single-input correction models and the multi-input attention combination strategies.  We first compare the attention-based Seq2Seq (Attn-Seq2Seq) model, with a traditional Seq2Seq model, PCRF, on single input correction task. As the PCRF implementation of Schnober et al. (2016) is highly memory and time consuming for training on long sequences, we compare it with Attn-Seq2Seq model on a smaller dataset with 100K lines randomly sampled from RDD newspapers training set. The trained correction model is then applied to correct the full test set. CER and WER of the correction results from both models are listed in Table 3. We can find that the Attn-Seq2Seq neural translation model works significantly better than the PCRF when trained on a dataset of the same size. The performance of the Attn-Seq2seq model could be further improved by including more training data or by multi-input decoding for duplicated texts, while the PCRF could only be trained on limited data and is not able to work on multiple inputs. Thus, we choose Attn-Seq2Seq as our error correction model.

Multi-input Attention Combination
We also compare different attention combination strategies on a multi-input decoding task. The results from Table 4 reveal that average attention combination performs best among all the decoding strategies on RDD newspapers and TCP books datasets. It reduces the CER of single input decoding by 41.5% for OCR'd lines in RDD newspapers and 9.76% for TCP books.
The comparison between two hierarchical attention combination strategies shows that averaging evidence from each input works better than a weighted summation mechanism. Flat attention combination, which merges all the inputs into a long sequence when computing the strength of each encoder hidden state, obtains the worst performance in terms  Table 4: Results of correcting lines in the RDD newspapers and TCP books with multiple witnesses when decoding with different strategies using the same supervised model. Attention combination strategies that statistically significantly outperform single-input decoding are highlighted with * (p < 0.05, paired-permutation test). Best result for each column is in bold.
of both CER and WER.

Main Results
We now present results on the full training and test sets for the Richmond Daily Dispatch newspapers and Text Creation Partnership books. All results are on the same test set. The multi-input decoding experiments have access to additional witnesses for each line, where available, but fall back to single-input decoding when no additional witnesses are present for a given line. Table 5 presents the results for our model trained in different training settings as well as the baseline language model reranking (LMR) and majority vote methods. Multiple input decoding performs better than single input decoding for every training setting, and the model trained in supervised mode with multi-input decoding achieves the best performance. The majority vote baseline, which works only on more than two inputs, performs worst on both the TCP books and RDD newspapers. Our proposed unsupervised framework Seq2Seq-Noisy and Seq2Seq-Boots achieves performance comparable with the supervised model via multi-input decoding on the RDD newspaper dataset. The performance of Seq2Seq-Noisy is worse on the TCP Books than the RDD newspapers, since those old books contain the character long s 6 , which is formerly used where s occurred in the middle or at the beginning of a word. These characters are recognized as f in all the witnesses because of similar shape. Thus, the model trained on noisy data are unable to correct them into s. Nonetheless, by removing the factor of long s, i.e., replacing the long s in the ground truth with f, Seq2Seq-Noisy could achieve a CER of 0.062 for single-input decoding and 0.058 for multi-input decoding on the TCP books. Both Seq2Seq-Syn and Seq2Seq-Boots work better on the RDD newspapers than the TCP books dataset. We conjecture that it is because the synthetic dataset is trained on (modern) newspapers, which are more similar to the nineteenth-century RDD newspapers. The long s problem also makes it more difficult for the model trained on synthetic data to work on the TCP books.

Discussion
In this section, we provide further analysis on different aspects of our method. Does Corruption Rate Affect Synthetic Training? We first examine how the corruption rate of the synthetic dataset would affect the performance of the correction model. Figure 2 presents the results of single-input correction and multi-input correction tasks on the RDD newspapers and TCP books when trained on synthetic data corrupted with different error rate: 0.9, 0.12, 0.15. For both tasks, the character error rate increases a little bit when the correction model is trained to recover the synthetic date with higher corruption rate. However, the performance is more stable on the RDD newspapers than the TCP books when more errors are introduced.    the performance of multiple-input decoding. The test set is divided into subgroups with varying size according to their number of witnesses. Figure 3 presents the performance of multi-input correction on subgroups with different number of witnesses. We can see that supervised training achieves the best performance on each subgroup for both datasets. On the RDD newspapers, the performance of each training setting is significantly improved when the number of witnesses increases from 0 to 2, then the error rate tends to be flat when more witnesses are observed. For the TCP books, the character error rate for both Seq2Seq-Syn and Seq2Seq-Boots decreases with small fluctuation when the number of witnesses increases. Seq2Seq-Noisy performs the worst almost on all subgroups on the TCP books since all the witnesses suffers from the long s problem.
Can More Training Data Benefit Learning? Figure 4 shows the test results for our correction model trained on datasets of different size. As the size of the training set increases, the CER of our model decreases consistently for both single and multiple input correction on the RDD newspapers. However, the performance curve of correction model on TCP books dataset is flatter since it is larger overall than RDD newspapers.

Related Work
Multi-Input OCR Correction. Ensemble methods have been shown to be effective in OCR postcorrection by combining OCR output from multiple scans of the same document (Lopresti and Zhou, 1997;Klein and Kopel, 2002;Cecotti and Belaïd, 2005;Lund et al., 2013). Existing methods aim at generating consensus results by aligning multiple inputs, followed by supervised methods such as classification (Boschetti et al., 2009;Lund et al., 2011;Al Azawi et al., 2015), or unsupervised methods such as dictionary-based selection (Lund and Ringger, 2009) and voting (Wemhoener et al., 2013;Xu and Smith, 2017). While supervised ensemble methods require human annotation for training, unsupervised selection methods work only when the correct word or character exists in one of the inputs. Furthermore, those methods could not correct single inputs.
Multi-Input Attention. Multi-input attention has already been explored in tasks such as machine translation (Zoph and Knight, 2016;Libovický and Helcl, 2017) and summarization (Wang and Ling, 2016). Wang and Ling (2016) propose to concatenate multiple inputs to generate a summary; this flat attention combination model might be affected by the order of input sequences. Zoph and Knight (2016) aims at developing a multisource translation model on a trilingual corpus where the encoder for each language is combined to pass to the decoder; however, it requires the same number of inputs at training and decoding time since the parameters depend on the number of inputs. Libovický and Helcl (2017) explore different attention combination strategies for multiple information sources such as image and text. In contrast, our method does not require multiple inputs for training, and the attention combination strategies are used to integrate multiple inputs when decoding.

Conclusions
We have proposed an unsupervised framework for OCR error correction, which can handle both single-input and multi-input correction tasks. An attention-based sequence-to-sequence model is applied for single-input correction, based on which a strategy of multi-input attention combination is designed to correct multiple input sequences simultaneously. The proposed strategy naturally incorporates aligning, correcting, and voting among multiple sequences, and is thus effective in improving the correction performance for corpora containing duplicated text. We propose two ways of training the correction model without human annotation by exploiting the duplication in the corpus. Experimental results on historical books and newspapers show that these unsupervised approaches significantly improve OCR accuracy and, when multiple inputs are available, achieve performance comparable to supervised methods.