Upcycle Your OCR: Reusing OCRs for Post-OCR Text Correction in Romanised Sanskrit

We propose a post-OCR text correction approach for digitising texts in Romanised Sanskrit. Owing to the lack of resources our approach uses OCR models trained for other languages written in Roman. Currently, there exists no dataset available for Romanised Sanskrit OCR. So, we bootstrap a dataset of 430 images, scanned in two different settings and their corresponding ground truth. For training, we synthetically generate training images for both the settings. We find that the use of copying mechanism (Gu et al., 2016) yields a percentage increase of 7.69 in Character Recognition Rate (CRR) than the current state of the art model in solving monotone sequence-to-sequence tasks (Schnober et al., 2016). We find that our system is robust in combating OCR-prone errors, as it obtains a CRR of 87.01% from an OCR output with CRR of 35.76% for one of the dataset settings. A human judgement survey performed on the models shows that our proposed model results in predictions which are faster to comprehend and faster to improve for a human than the other systems.


Introduction
Sanskrit used to be the 'lingua franca' for the scientific and philosophical discourse in ancient India with literature that spans more than 3 millennia. Sanskrit primarily had an oral tradition, and the script used for writing Sanskrit varied widely across the time spans and regions. With the advent of printing press, Devanagari emerged as the prominent script for representing Sanskrit. With standardisation of Romanisation using IAST in 1894 (Monier-Williams, 1899), printing in Sanskrit was extended to roman scripts as well. There 1 The data and the codes for our system are available herehttps://github.com/majumderb/ sanskrit-ocr has been a surge in digitising printed Sanskrit manuscripts written in Roman such as the ones currently digitised by the 'Krishna Path' project 2 .
In this work, we propose a model for post-OCR text correction for Sanskrit written in Roman. Post-OCR text correction, which can be seen as a special case of spelling correction (Schnober et al., 2016), is the task of correcting errors that tend to appear in the output of the OCR in the process of converting an image to text. The errors incurred from OCR can be quite high due to numerous factors including typefaces, paper quality, scan quality, etc. The text can often be eroded, can contain noises and the paper can be bleached or tainted as well (Schnober et al., 2016). Figure 1 shows the sample images we have collected for the task. Hence it is beneficial to perform a post-processing on the OCR output to obtain an improved text. In the case of Indic OCRs, there have been considerable efforts in collection and annotation of data pertaining to Indic Scripts (Kumar and Jawahar, 2007;Bhaskarabhatla et al., 2004;Govindaraju and Setlur, 2009;Krishnan et al., 2014). Earlier attempts on Indian scripts were primarily based on handcrafted templates (Govindan and Shivaprasad, 1990;Chaudhuri and Pal, 1997) or features (Arora et al., 2010;Pal et al., 2009) which extensively used the script and language-specific information (Krishnan et al., 2014). Sequential labelling approaches were later proposed that take the word level inputs and make character level predictions (Shaw et al., 2008;Hellwig, 2015). The word based sequence labelling approaches were further extended to use neural architectures, especially using RNNs and its variants such as LSTMs and GRUs (Sankaran and Jawahar, 2012;Krishnan et al., 2014;Saluja et al.;Adiga et al., 2018;Mathew et al., 2016). But, OCR is putative in exhibiting few long-range dependencies (Schnober et al., 2016). Singh and Jawahar (2015) find that extending the neural models to process the text at the sentence level (or a textline) leads to improvement in the performance of the OCR systems. This was further corroborated by Saluja et al. where the authors found that using words within a context window of 5 for a given input word worked particularly well for the Post-OCR text correction in Sanskrit. In the case of providing a text line as input, we are essentially providing more context about the input in comparison to the word level models and the RNN (or LSTM) cells are powerful enough to capture the long-term dependencies. Particularly for Indian languages, this decision is beyond a question of performance. In Sanskrit, the word boundaries are often obscured due to phonetic transformations at the word boundaries known as Sandhi. Word segmentation of Sanskrit constructions is a matter of research on its own (Krishna et al., 2016a;Reddy et al., 2018). However, none of the existing systems are equipped for incorrect spellings and hence these systems may be brittle (Belinkov and Bisk, 2018) when it comes to handling spelling variations in the input. Hence, in our case, we assume an unsegmented sequence as our input and then we perform our Post-OCR text correction on the text. We hypothesise that this will improve the segmentation process and other downstream tasks for Sanskrit in a typical NLP pipeline.
Our major contributions are: 1. Contrary to what is observed in Schnober et al. (2016), an encoder-decoder model, when equipped with copying mechanism (Gu et al., 2016), can outperform a traditional sequence labelling model in a monotone sequence labelling task. Our model outperforms Schnober et al. (2016) in the Post-OCR text correction for Romanised Sanskrit task by 7.69 % in terms of CRR.
2. By making use of digitised Sanskrit texts, we generate images as synthetic training data for our models. We systematically incorporate various distortions to those images so as to emulate the settings of the original images.
3. Through a human judgement experiment, we asked the participants to correct the mistakes from a predicted output from the competing systems. We find that participants were able to correct predictions from our system more frequently and the corrections were done much faster than the CRF model by Schnober et al. (2016). We observe that predictions from our model score high on acceptability (Lau et al., 2015) than other methods as well.

Model Architecture
In principle, the output from any OCR which recognises Romanised Sanskrit can be used as the input to our model. Currently, there exist limited options for recognising Romanised Sanskrit texts from scanned documents. Possibly, the commercial OCR offering by Google as part of their proprietary cloud vision API and SanskritOCR 3 might be the only two viable options. Sanskri-tOCR provides an online interface to the Tesseract OCR, an open source multilingual OCR (Smith, 2007;Smith et al., 2009;Smith, 1987), trained specifically for recognising Romanised Sanskrit. Additionally, we trained an offline version of Tesseract to recognise the graphemes in the Romanised Sanskrit alphabet. In both the models we find that many scanned images, especially similar to the one shown in Figure 1b, were not recognised by the system. We hypothesise this to be due to lack of enough font styles available in our collection, in spite of using a site with the richest collection of Sanskrit fonts 4 . This leaves the Google OCR as the only option. Considering the fact that working with a commercial offering from Google OCR may not be an affordable option for various digitisation projects, we chose to use Tesseract with models trained for other languages written in Roman script. All the Latin or Roman scripts in the pre-trained models of Tesseract are trained on 400,000 text-lines spanning about 4500 fonts 5 . Use of OCR with pre-trained models for other languages French alphabet has the highest grapheme overlap with that of the Sanskrit alphabet (37 of 50), while all other languages have one less grapheme common with Sanskrit. Hence, we arbitrarily take 5 of the languages in addition to French and perform our analysis. Table 1 shows the character recognition rate (CRR) for OCR using alphabets of different languages, when performed on a dataset of 430 scanned images ( §3.1). The table also shows the count of error types made by the OCR after alignment (Jiampojamarn et al., 2007;D'hondt et al., 2016). All the languages have a near similar CRR with English and French leading the list. Based on our observations on the OCR performance, we select English for our further experiments.
Upcycling such a pre-trained model brings its own challenges. For instance, the missing 14 Sanskrit graphemes 6 in English are naturally mispredicted to other graphemes. This leads to ambiguity as the correct and the mispredicted characters now share the same target. Figure 2 shows the heat-map for such mis-predictions when we used the OCR on the set of 430 scanned images. Here, we zoom the relevant cases and show the 5 https://github.com/tesseract-ocr/ tesseract/wiki/TrainingTesseract-4.00 6 Detailed in §2 of the Supplementary Material row-normalised proportion of predictions 7 .

System Descriptions
We formalise the task as a monotone seq2seq model. We use an encoder-decoder framework that takes in a character sequence as input and the model finds embeddings at a sub-word level both at the encoder and decoder side. Here the OCR output forms input to the model. Keeping the task in mind we make two design decisions for the model. One is the use of copying mechanism (Gu et al., 2016) and other is the use of Byte Pair Encoding (BPE) (Sennrich et al., 2016) to learn a new vocabulary for the model.
CopyNet (Gu et al., 2016): Since it is possible that there will be reasonable overlap between the input and output strings, we use the copying mechanism as mentioned in CopyNet (Gu et al., 2016). The model essentially learns two probability distributions, one for generating an entry at the decoder and the other for copying the entry from the encoder. The final prediction is based on the sum of both the probabilities for the class. Given an input sequence X = (x 1 , ..., x N ) we define X , for all the unique entries in the input sequence. We also define the vocabulary Let the out-of-vocabulary (OOV) words be represented with UNK. The probability of the generate mode g and copy mode c are given by where ψ g (·) and ψ c (·) are score functions for generate-mode (g) and copy-mode (c), respectively, and Z is the normalization term shared by the two modes, Z = v∈V∪{UNK} e ψg(v) + x∈X e ψc(x) . The scoring function for both the modes, respectively, are where W c ∈ R d h ×ds , and σ is a non-linear activation function (Gu et al., 2016). BPE (Sennrich et al., 2016) : Sanskrit is a morphologically rich language. A noun in Sanskrit can have 72 different inflections and a verb may have more than 90 inflections. Additionally, Sanskrit corpora generally express a compound rich vocabulary (Krishna et al., 2016b). Hence, in a typical Sanskrit corpus, the majority of the tokens appear less than 5 times ( §3.1). These are generally considered to be rare words in a corpus (Sennrich et al., 2016). However, corpora dominated by rare words are difficult to handle for a statistical model like ours. To combat the sparsity of the data, we convert the tokens into sub-word ngrams using Byte Pair Encoding (BPE) (Sennrich et al., 2016). Methods such as wordpiece (Schuster and Nakajima, 2012) as well as Sennrich et al. (2016) are means of obtaining a new vocabulary for a given corpus. Every sequence in the corpus is then re-written as a sequence of tokens in terms of the sub-word units which forms the type in the new vocabulary so obtained. These methods essentially use a data-driven approach to maximise the language-model likelihood of the training data, given an evolving word definition (Wu et al., 2016). We explicitly set the minimum count for a token in the new vocabulary to appear in the corpora as 30. We learn a new vocabulary of size 82 with 22 of them having a length 1 and the rest with a length 2. The IAST standardisation of the Romanised Sanskrit contains 50 graphemes in Sanskrit alphabet. About 12 of the graphemes are represented using 2 character roman character combinations. Now, in the vocabulary learnt using BPE, 7 of the graphemes were not present. Hence, we add them in addition to the 82 entries learnt as vocabulary. This makes the total vocabulary to be 89. By using the new vocabulary, it is guaranteed that there will be no Out Of Vocabulary (OOV) words in our model.
We use 3 stacked layers of LSTM at the encoder and the decoder with the same settings as in Bah-danau et al. (2015). To enable copying, we share the embeddings of the source and the target vocabulary. By eliminating OOV, we make sure that copying always happens by virtue of the evidence from the training data and not by the presence of an OOV word.

Dataset
Sanskrit is a low-resource language. It is extremely scarce to obtain datasets with scanned images and the corresponding aligned texts for Romanised Sanskrit. We obtain 430 scanned images as shown in Figure 1 and manually annotate the corresponding text. We use this as our test dataset, henceforth to be referred to as OCRTest. For training, we synthetically generate images from digitised Sanskrit texts and use them as our training set and development set. The images for training, OCRTrain, were generated by synthetically adding distortions to those images to match the settings of the real scanned documents.
OCRTest contains 430 images from 1) scanned copy of Vishnu Sahaśranāma 8 and 2) scanned copy of Bhagavad Gītā, a sample of each is shown in Figure 1a and 1b. 140 out of these 430 are from Sahaśranāma and the remaining are from Bhagavad Gītā.
OCRTrain: Similar to Ul-Hasan and Breuel (2013), we synthetically generate the images, which are then fed to the OCR, to obtain our training data. We use the digitised text fromŚrīmad Bhāgavatam 9 for generating the synthetic images. The text contains about 14,094 verses in total, divided into 50,971 text-lines. The dataset is divided into 80-20 split as training set and development set, respectively. The corpus contains a vocabulary of 52,882 word types. 48,249 of the word types in the vocabulary appear less than or equal to 5 times, of which 32,411 appear exactly once. This is primarily due to the inflectional nature of Sanskrit. We find similar trends in the vocabulary of Rāmāyan . a 10 and Digital Corpus of Sanskrit (Hellwig, 2010-2016) as well.

Synthetic Generation of training set
Using the text-lines from Bhāgavatam, we generate synthetic images using ImageMagick 11 . The images were generated with a quality of 60 Dots Per Inch (DPI). The number of pixels along the height for each textline was kept constant at 65 pixels. We add several distortions to the synthetically generated images so as to visually match with the same settings as that of OCRTest. Previously, Ul-Hasan and Breuel (2013) used the approach of synthetically generating training data for multilingual OCR solution of theirs. Table 2 shows the different parameters, namely, gamma correction, noise addition, use of structural kernel for erosion and perspective distortion, that we apply sequentially on the images so as to distort and degrade the images (Chen et al., 2014). We use grid search for the parameter estimation for these processes, where those parameters and the range of values experimented with are provided in Table 2. Finally, we filter 7 (out of 38,400 combinations) different configurations based on the distribution of Character Recognition Rate (CRR) across the images compared with that of the OCRTest using KL-divergence. Among these seven configurations, four are closer to the settings for Bhagavad Gītā and the remaining three for Sahaśranāma. Figure 3 shows the two different settings (closer to each of the source textbook) for the string "ajo durmars . an . ah .śā stā viśrutātmā surārihā", along with their corresponding parameter settings and KL-Divergence. Our training set contains images from all the 7 settings for each of the textline in OCRTrain 12 .  and Jawahar, 2012). CRR is the fraction of characters recognised correctly against the total number of characters in a line, whereas WRR is the fraction of words correctly recognised against the total number of words in a line. Additionally, we use a sentence level metric, called the acceptability score. The measure indicates the extent to which a sentence is permissible or acceptable to the speakers of the language (Lau et al., 2015). From Lau et al. (2015), we use the NormLP formulation for the task, as it is found to have a high correlation with the human judgements in evaluating acceptability. NormLP is calculated by obtaining the likelihood of a predicted sentence as per the model, and then normalising it by the likelihood of the string as per a unigram language model trained on a corpus with gold standard sentences. A negative sign is then given to the score. The higher the score, the more acceptable the sentence is.

Baselines
Character Tagger -Sequence Labelling using BiLSTMs This is a sequence labelling model which uses BiLSTM cells and input is a character sequence (Saluja et al.). We use categorical crossentropy as the loss function and softmax as the activation function. For dropout, we employ spatial dropout in our architecture. The model consists of 3 layers with each layer having 128 cells. Embeddings of size 100 are randomly initialised and the learnt representations are stored in a character look-up table similar to Lample et al. (2016). In addition to every phoneme in Sanskrit as a class, we add an additional class 'no change' which signifies that the character remains as is. We also experimented with a variant where the final layer is a CRF layer (Lafferty et al., 2001). We henceforth refer to both the systems as BiLSTM and BiLSTM-CRF, respectively.
Pruned CRFs (Schnober et al., 2016): They  Table 2: Image pre-processing steps and parameters are higher order CRF models (Ishikawa, 2011) that approximate the CRF objective function using coarse-to-fine decoding. Schnober et al. (2016) adapt the sequence labeling model as a seq2seq model that can handle variable length input-output pairs. Schnober et al. (2016) show that none of the neural seq2seq models considered in their work were able to outperform the Pruned CRF with order-5. The features to the model are consecutive characters within a window of size w in either of the directions of the current position at which a prediction is made. The model is designed to handle 1-to-zero and 1-to-many matches, facilitated by the use of alignment prior to training. We consider all the three settings reported in Schnober et al. (2016) and report the results for the best setting. The order-5 model which uses 6-grams within a window of 6 performs the best. Henceforth, this model is referred to as PCRF-seq2seq (also referred to as PCRF interchangeably). Encoder-Decoder Models: For the seq2seq model (Sutskever et al., 2014), we use 3 stacked layers of LSTM each at the encoder and the decoder. Each layer is of 128 dimensions and weighted cross-entropy is used as the loss. We also add residual connections among the layers in a stack (Wu et al., 2016). To further capture the entire input context for making each prediction at the output, we make use of attention (Bahdanau et al., 2015), specifically Luong's attention mechanism (Luong et al., 2015). We experiment with two variants where EncDec+Char uses character level embeddings and EncDec+BPE uses embeddings with BPE. CopyNet+BPE: The model discussed in §2. We use CopyNet+BPE and CopyNet interchangeably throughout the paper. Table 3 shows the results for all the competing systems based on the predictions from OCRTest. CopyNet performs the best among the competing systems across all the three metrics and on both the source texts. For the Gītā dataset, the models CopyNet and PCRF-Seq2Seq report similar performances. However, Sahaśranāma is a noisier dataset, and we find that CopyNet outperforms all other models by a huge margin. The WRR for the system is double that of the next best system (EncDec) on this dataset.

System performances for various input lengths:
From Figure 4a, it can be observed that the performance in terms of CRR for CopyNet and PCRF is robust across all the lengths on strings from Gītā and never goes below 90%. For Sahaśranāma, as shown in Figure 4b, CopyNet outperforms PCRF across inputs of all the lengths except for one setting. But, in the case of WRR, CopyNet is the best performing model across all the lengths as shown in Figure 4d. Error type analysis In Table 5, we analyse the reduction in specific error types for PCRF    and CopyNet after the alignment of the predicted string with that of the ground truth in terms of insertion, deletion and substitution. We also report the system induced errors, where a correct component at the input (OCR output) is mispredicted to a wrong output by the model. CopyNet outperforms PCRF in correcting the errors and it also introduces lesser number of errors of its own. Both CopyNet and PCRF (Schnober et al., 2016) are seq2seq models and can handle varying length input and output. Both the systems perform well in handling substitution errors, the type which dominated the strings in OCRTest, though neither of the systems was able to correct the insertion errors. Insertion can be seen as a special case of 1-to-many insertion matches, which both systems are ideally capable of handling. We see that for Sahaśranāma, CopyNet corrects about 17.24 % of the deletion errors as against <5% of the deletion errors corrected by PCRF.
Since there exist 14 graphemes in Sanskrit alphabet which are not present in the English alphabet, all 14 of them get substituted to a different grapheme by the OCR. While most of them get substituted to an orthographically similar character such asā → a and h . → h, we find thatñ → i does not fit the scheme, as shown in Figure 2. In the majority of the cases, CopyNet predicts them to the correct grapheme. But PCRF still fails to correct the OCR induced confusion forñ → i in the majority of the instances. Additionally, we find that PCRF introduces its own errors, for instance it often mispredicts p → s. Figure 5 shows the over-all variations in both the systems as compared to Figure 2 for OCR induced errors.
Copy or generate? For the 14 graphemes, missing at the encoder (input) but present at the decoder side during training, those predictions have to happen with high values of generate probability in general. We find that not only the average generate probability for such instances is high but also the copy probability is extremely low. For the remaining cases, we find that both generate and copy probability are higher. But it needs to be noted that the prediction is made generally by summing of both the distributions and the distributions are not complementary to each other. A similar trend can be observed in Figure 6 as well. For example in the case of a →ā, only the generate probability is high. But, for a → a, both the copy and generate probability scores are high.

Effect of BPE and alphabet in the vocabulary
We further investigate the effect of our vocabulary which is the union of the alphabet in Romanised Sanskrit and what is learnt using BPE. We train the model with only the alphabet as vocabulary and find the CRR and WRR for the combined test sentences to be 86.1% and 66.09%, respectively. When using the original BPE vocabulary, we find that there is a slight increase in the performance than the current vocabulary with a CRR and WRR of 89.53% and 68.11%, respectively 13 . We also find that the current setting performs better than 13 Please refer to §5 of the Supplementary material for the performance table Performance comparison to Google OCR: Google OCR is probably the only available OCR that can handle Romanised Sanskrit. We could not find the architecture of the OCR or whether the service employs post-OCR text correction. We empirically compare the performance of Google OCR on OCRTest with our model. Table 4 shows the results for Google OCR. Overall we find that CopyNet outperforms Google OCR across all the metrics. We find that Google OCR reports a similar CRR for Gītā with that of ours, but still reports a lower WRR than ours. The system performs better than PCRF in all the metrics other than CRR for Gītā.
Image quality: Our training set was generated with a quality of 60 DPI for the images. We generate images corresponding to strings in OCRTrain with DPI of 50 to 300 in step sizes of 50 for a sample of 500 images. We use noise settings as shown in Figure 3. The OCR output of the said strings remained as is with that of the one generated with a DPI of 60. This experiment can be seen as a proxy in evaluating the robustness of the OCR to various scanning qualities of input. Our choice of DPI as 60 was based on the lowest setting we observed in digitisation attempts in Sanskrit texts.
Effect of adding distortions to the synthetically generated images: Table 3 shows the system performance after training our model on data generated as per the procedure mentioned in Section 3.2. Here, we make an implicit assumption that we can have access to a sample of textline images annotated with the corresponding text from the manuscript for which the Post-OCR text correction needs to be performed. This also mandates retraining the model for every new manuscript. We attempted for a more generalised version of our model, by using training data where the image generation settings are not inspired from the target manuscript for which the task needs to be performed. Using the settings from (Chen et al., 2014) for inducing noise, we generated 10 random noise configurations. Here the step sizes were fixed at values such that each parameter, except erosion (E), can assume 5 values each uniformly spread across the corresponding ranges considered. From a total of 2500 (5×5×5×5×4) configuration options, 10 random settings were chosen. Every textline was generated with each of the 10 different settings. We also experimented with a setting where no noise was added to the synthetically generated images and the images were fed to the OCR. We obtained a CRR of 80.12% from OCR, where the errors arose mostly from the missing graphemes in the alphabet getting mispredicted to a different grapheme. CopyNet after training with the text so generated reported a CRR of 86.81% (96.01% for Gītā, 75.78% for Sahaśranāma) on the test data.
Human judgement survey: In this survey 14 , we evaluate how often a human can recognise the correct construction by viewing only the prediction from one of the systems. We also evaluate how fast a human can correct them. We selected 15 constructions from Sahaśranāma, and obtained the system outputs from the OCR, CopyNet and PCRF for each of these. The average length of a sentence is 41.73 characters, all ranging between 23 and 47 characters. A respondent is shown a system prediction (system identity anonymised) and is asked to type the corrected string without referring to any sources. A respondent gets 15 different strings altogether, 5 each from each of the three systems. We consider responses from 9 participants where all of them at least have an undergraduate degree in Sanskrit linguistics. Altogether from 3 sets of questionnaires, we have 45 strings (3 outputs for a given string). Every string obtained 3 impressions. We find that a participant on an average could identify 4.44 sentences out of 5 from the CopyNet, while it was only 3.56 for PCRF and 3.11 for the OCR output. The average time taken to complete the correction of a string was 81.4 seconds, 106.6 seconds and 127.6 seconds for CopyNet, PCRF and OCR, respectively. 14 More details at §6 of Supplementary material

Conclusion
In this work, we proposed an OCR based solution for digitising Romanised Sanskrit. Our work acts as a Post-OCR text correction approach and is devoid of any OCR-specific feature engineering. We find that the use of copying mechanism in encoderdecoder performs significantly better than other seq2seq models for the task. Our model outperforms the commercially available Google OCR on the Sahaśranāma texts. From our experiments, we find that CopyNet performs stably even for OCR outputs with a CRR as low as 36%. Our immediate research direction will be to rectify insertion errors which currently are not properly handled. Also, there are 135 languages which directly share the Roman alphabet but only 35 of them have OCR system available. Our approach can be easily extended to provide a post-processed OCR for those languages.