Word-Level Alignment of Paper Documents with their Electronic Full-Text Counterparts

We describe a simple procedure for the automatic creation of word-level alignments between printed documents and their respective full-text versions. The procedure is unsupervised, uses standard, off-the-shelf components only, and reaches an F-score of 85.01 in the basic setup and up to 86.63 when using pre- and post-processing. Potential areas of application are manual database curation (incl. document triage) and biomedical expression OCR.


Introduction
Even though most research literature in the life sciences is born-digital nowadays, manual data curation (International Society for Biocuration, 2018) from these documents still often involves paper. For curation steps that require close reading and markup of relevant sections, curators frequently rely on paper printouts and highlighter pens (Venkatesan et al., 2019). Figure 1a shows a page of a typical document used for manual curation. The potential reasons for this can be as varied as merely sticking to a habit, ergonomic issues related to reading from and interacting with a device, and functional limitations of that device (Buchanan and Loizides, 2007;Köpper et al., 2016;Clinton, 2019). Whatever the reason, the consequence is a two-fold media break in many manual curation workflows: first from electronic format (either PDF or full-text XML) to paper, and then back from paper to the electronic format of the curation database. Given the above arguments in favor of paper-based curation, removing the first media break from the curation workflow does not seem feasible. Instead, we propose to bridge the gap between paper and electronic media by automatically creating an alignment between the words on the printed document pages and their counterparts in an electronic fulltext version of the same document.
Our approach works as follows: We automatically create machine-readable versions of printed paper documents (which might or might not contain markup) by scanning them, applying optical character recognition (OCR), and converting the resulting semi-structured OCR output text into a flexible XML format for further processing. For this, we use the multilevel XML format of the annotation tool MMAX2 1 (Müller and Strube, 2006). We retrieve electronic full-text counterparts of the scanned paper documents from PubMedCentral ® in .nxml format 2 , and also convert them into MMAX2 format. By using a shared XML format for the two heterogeneous text sources, we can capture their content and structural information in a way that provides a compatible, though often not identical, word-level tokenization. Finally, using a sequence alignment algorithm from bioinformatics and some pre-and post-processing, we create a word-level alignment of both documents. Aligning words from OCR and full-text documents is challenging for several reasons. The OCR output contains various types of recognition errors, many of which involve special symbols, Greek letters like µ or sub-and superscript characters and numbers, which are particularly frequent in chemical names, formulae, and measurement units, and which are notoriously difficult for OCR (Ohyama et al., 2019). If the printed paper document is based on PDF, it usually has an explicit page layout, which is different from the way the corresponding full-text XML document is displayed in a web browser. Differences include double-vs. single-column layout, but also the way in which tables and figures are rendered and positioned. Finally, printed papers might contain additional content in headers or footers (like e.g. download timestamps). Also, while the references/bibliography section is an integral part of a printed paper and will be covered by OCR, in XML documents it is often structurally kept apart from the actual document text. Given these challenges, attempting data extraction from document images if the documents are available in PDF or even full-text format may seem unreasonable. We see, however, the following useful applications: 1. Manual Database Curation As mentioned above, manual database curation requires the extraction, normalization, and database insertion of scientific content, often from paper documents. Given a paper document in which a human expert curator has manually marked a word or sequence of words for insertion into the database, having a link from these words to their electronic counterparts can eliminate or at least reduce error-prone and time-consuming steps like manual re-keying. Also, already existing annotations of the electronic fulltext 3 would also be accessible and could be used to inform the curation decision or to supplement the database entry.

Automatic PDF Highlighting for Manual
Triage Database curation candidate papers are identified by a process called document triage (Buchanan and Loizides, 2007;Hirschman et al., 2012) which, despite some attempts towards automation (e.g. Wang et al. (2020)), remains a mostly manual process. In a nut shell, triage normally involves querying a literature database (like PubMed 4 ) for specific terms, skimming the list of search results, selecting and skim-reading some papers, and finally downloading and printing the PDF versions of the most promising ones for curation (Venkatesan et al., 2019). Here, the switch from searching in the electronic full-text (or abstract) to printing the PDF brings about a loss of information, because the terms that caused the paper to be retrieved will have to be located again in the print-out. A word-level alignment between the full-text and the PDF version would allow to create an enhanced version of the PDF with highlighted search term occurrences before printing. 3. Biomedical Expression OCR Current state-ofthe-art OCR systems are very accurate at recognizing standard text using Latin script and baseline typography, but, as already mentioned, they are less reliable for more typographically complex expressions like chemical formulae. In order to develop specialized OCR systems for these types of expressions, ground-truth data is required in which image regions containing these expressions are labelled with the correct characters and their positional information (see also Section 5). If aligned documents are available, this type of data can easily be created at a large scale.
The remainder of this paper is structured as follows. In Section 2, we describe our data set and how it was converted into the shared XML format. Section 3 deals with the actual alignment procedure, including a description of the optional preand post-processing measures. In Section 4, we present experiments in which we evaluate the performance of the implemented procedure, including an ablation of the effects of the individual pre-and post-processing measures. Quantitative evaluation alone, however, does not convey a realistic idea of the actual usefulness of the procedure, which ultimately needs to be evaluated in the context of real applications including, but not limited to, database curation. Section 4.2, therefore, briefly presents examples of the alignment and highlighting detection functionality and the biomedical expression OCR use case mentioned above. Section 5 discusses relevant related work, and Section 6 summarizes and concludes the paper with some future work. All the tools and libraries we use are freely available.

Data
For the alignment of a paper document with its electronic full-text counterpart, what is minimally required is an image of every page of the document, and a full-text XML file of the same document. The document images can either be created by scanning or by directly converting the corresponding PDF into an image. The latter method will probably yield images of a better quality, because it completely avoids the physical printing and subsequent scanning step, while the output of the former method will be more realistic. We experiment with both types of images (see Section 2.1). We identify a document by its DOI, and refer to the different versions as DOI xml (from the full-text XML), DOI conv , and DOI scan . Whenever a distinction between DOI conv and DOI scan is not required, we refer to these versions collectively as DOI ocr . Printable PDF documents and their associated .nxml files are readily available at PMC-OAI. 5 In our case, however, printed paper versions were already available, as we have access to a collection of more than 6.000 printed scientific papers (approx. 30.000 pages in total) that were created in the SABIO-RK 6 Biochemical Reaction Kinetics Database project (Wittig et al., 2017(Wittig et al., , 2018. These papers contain manual highlighter markup at different levels of granularity, including the word, line, and section level. Transferring this type of markup from printed paper to the electronic medium is one of the key applications of our alignment procedure. Our paper collection spans many publication years and venues. For our experiments, however, it was required that each document was freely available both as PubMedCentral ® full-text XML and as PDF. While this leaves only a fraction of (currently) 68 papers, the data situation is still sufficient to demonstrate the feasibility of our procedure. Even more importantly, the procedure is unsupervised, i.e. it does not involve learning and does not require any training data.

Document Image to Multilevel XML
Since we want to compare downstream effects of input images of different quality, we created both a converted and a scanned image version for every document in our data set. For the DOI conv version, we used pdftocairo to create a high-resolution (600 DPI) PNG file for every PDF page. Figure 1c shows an example. The DOI scan versions, on the other hand, were extracted from 'sandwich' PDFs which had been created earlier by a professional scanning service provider. The choice of a service provider for this task was only motivated by the large number of pages to process, and not by expected quality or other considerations. A sandwich PDF contains, among other data, the document plain text (as recognized by the provider's OCR software) and a background image for each page. This background image is a by-product of the OCR process in which pixels that were recognized as parts of a character are inpainted, i.e. removed by being overwritten with colors of neighbouring regions. Figure 1b shows the background image corresponding to the page in Figure 1a. Note how the image retains the highlighting. We used pdfimages to extract the background images (72 DPI) from the sandwich PDF for use in highlighting extraction (see Section 2.1.1 below). We refer to these versions as DOI scan_bg . For the actual DOI scan versions, we again used pdftocairo to create a high-resolution (600 DPI) PNG file for every scanned page. OCR was then performed on the DOI conv and the DOI scan versions with tesseract 4.1.1 7 , using default recognition settings (-oem 3 -psm 3) and specifying hOCR 8 with character-level bounding boxes as output format. In order to maximize recognition accuracy (at the expense of processing speed), the default language models for English were replaced with optimized LSTM models 9 . No other modification or re-training of tesseract was performed. In a final step, the hOCR output from both image versions was converted into the MMAX2 (Müller and Strube, 2006) multilevel XML annotation format, using words as tokenization granularity, and storing word-and characterlevel confidence scores and bounding boxes as MMAX2 attributes. 10

Highlighting Detection
Highlighting detection and subsequent extraction can be performed if the scanned paper documents contain manual markup. In its current state, the detection procedure described in the following requires inpainted OCR background images which, in our case, were produced by the third-party OCR software used by the scanning service provider. tesseract, on the other hand, does not produce these images. While it would be desirable to employ free software only, this fact does not severely limit the usefulness of our procedure, because 1) other software (either free or commercial) with the same functionality might exist, and 2) even for document collections of medium size, employing an external service provider might be the most economical solution even in academic / research settings, anyway. What is more, inpainted backgrounds are only required if highlighting detection is desired: For text-only alignment, plain scans are sufficient. 7 https://github.com/tesseract-ocr/ tesseract 8 http://kba.cloud/hocr-spec/1.2/ 9 https://github.com/tesseract-ocr/ tessdata_best 10 See the lower part of Figure   The actual highlighting extraction works as follows (see Müller et al. (2020) for details): Since document highlighting comes mostly in strong colors, which are characterized by large differences among their three component values in the RGB color model, we create a binarized version of each page by going over all pixels in the background image and setting each pixel to 1 if the pairwise differences between the R, G, and B components are above a certain threshold (50), and to 0 otherwise. This yields an image with regions of higher and lower density of black pixels. In the final step, we iterate over the word-level tokens created from the hOCR output and converted into MMAX2 format earlier, compute for each word its degree of highlighting as the percentage of black pixels in the word's bounding box, and store that percentage value as another MMAX2 attribute if it is at least 50%. An example result will be presented in Section 4.2.

PMC ® .nxml to Multilevel XML
The .nxml format employed for PubMedCentral ® full-text documents uses the JATS scheme 11 which supports a rich meta data model, only a fraction of which is of interest for the current task. In principle, however, all information contained in JATS-conformant documents can also be represented in the multilevel XML format of MMAX2. The .nxml data provides precise infor-11 https://jats.nlm.nih.gov/archiving/ mation about both the textual content (including correctly encoded special characters) and its wordand section-level layout. At present, we only extract content from the <article-meta> section (<article-title>, <surname>, <given-names>, <xref>, <email>, <aff>, and <abstract>), and from the <body> (<sec>, <p>, <tr>, <td>, <label>, <caption>, and <title>). These sections cover the entire textual content of the document. We also extract the formatting tags <italic>, <bold>, <underline>, and in particular <sup> and <sub>. The latter two play a crucial role in the chemical formulae and other domain-specific expressions. Converting the .nxml data to our MMAX2 format is straightforward. 12 In some cases, the .nxml files contain embedded LaTex code in <tex-math> tags. If this tag is encountered, its content is processed as follows: LaTex Math encodings for sub-and superscript, _{} andˆ{}, are removed, their content is extracted and re-inserted with JATS-conformant <sub>...</sub> and <sup>...</sup> elements. Then, the resulting LaTex-like string is sent through the detex tool to remove any other markup. While this obviously cannot handle layouts like e.g. fractions, it still preserves many simpler expressions that would otherwise be lost in the conversion.

Outline of the Alignment Procedure
The actual word-level alignment of the DOI xml version with the DOI ocr version of a document operates on lists of < token, id > tuples which are created from each version's MMAX2 annotation. These lists are characterized by longer and shorter stretches of tuples with matching tokens, which just happen to start and end at different list indices. These stretches are interrupted at times by (usually shorter) sequences of tuples with non-matching tokens, which mostly exist as the result of OCR errors (see below). Larger distances between stretches of tuples with matching tokens, on the other hand, can be caused by structural differences between the DOI xml and the DOI ocr version, which can reflect actual layout differences, but which can also result from OCR errors like incorrectly joining two adjacent lines from two columns. The task of the alignment is to find the correct mapping on the token level for as many tuples as possible. We use the align.globalxx method from the Bio.pairwise2 module of Biopython (Cock et al., 2009), which provides pairwise sequence alignment using a dynamic programming algorithm (Needleman and Wunsch, 1970). While this library supports the definition of custom similarity functions for minimizing the alignment cost, we use the most simple version which just applies a binary (=identity) matching scheme, i.e. full matches are scored as 1, all others as 0. This way, we keep full control of the alignment, and can identify and locally fix non-matching sequences during post-processing (cf. Section 3.2 below). The result of the alignment (after optional pre-and post-processing) is an n-to-m mapping between < token, id > tuples from the DOI xml and the DOI ocr version of the same document. 13

Pre-Processing
The main difference between pre-and postprocessing is that the former operates on two still unrelated tuple lists of different lengths, while for the latter the tuple lists have the same length due to padding entries («GAP») that were inserted by the alignment algorithm in order to bridge sequences of non-alignable tokens. Pre-processing aims to smooth out trivial mismatches and thus to help alignment. Both pre-and post-processing, however, only modify the tokens in DOI ocr , but never those in DOI xml , which are considered as gold-standard. 13 See also the central part of Figure A.1 in the Appendix.
Pre-compress matching sequences [pre_compress=p] The space complexity of the Needleman-Wunsch algorithm is O(mn), where m and n are the numbers of tuples in each document. Given the length of some documents, the memory consumption of the alignment can quickly become critical. In order to reduce the number of tuples to be compared, we apply a simple pre-compression step which first identifies sequences of p tuples (we use p = 20 in all experiments) with perfectly identical tokens in both documents, and then replaces them with single tuples where the token and id part consist of concatenations of the individual tokens and ids. After the alignment, these compressed tuples are expanded again.
While pre-compression was always performed, the pre-and post-processing measures described in the following are optional, and their individual effects on the alignment will be evaluated in Section 4.1.
De-hyphenate DOI ocr tokens [dehyp] Sometimes, words in the DOI ocr versions are hyphenated due to layout requirements which, in principle, do not exist in the DOI xml versions. These words appear as three consecutive tuples with either the '-' or '¬' token in the center tuple. For de-hyphenation, we search the tokens in the tuple list for DOI ocr for single hyphen characters and reconstruct the potential un-hyphenated word by concatenating the tokens immediately before and after the hyphen. If this word exists anywhere in the list of DOI xml tokens, we simply substitute the three original < token, id > DOI ocr tuples with one merged tuple. De-hyphenation (like all other pre-and post-processing measures) is completely lexicon-free, because the decision whether the unhyphenated word exists is only based on the content of the DOI xml document.
Diverging tokenizations in the DOI xml and DOI ocr document versions are a common cause of local mismatches. Assuming the tokenization in DOI xml to be correct, tokenizations can be fixed by either joining or splitting tokens in DOI ocr . Join incorrectly split DOI ocr tokens [pre_join] We apply a simple rule to detect and join tokens that were incorrectly split in DOI ocr . We move a window of size 2 over the list of DOI ocr tuples and concatenate the two tokens. We then iterate over all tokens in the DOI xml version. If we find the reconstructed word in a matching context (one immediately preceeding and following token), we replace, in the DOI ocr version, the first original tuple with the concatenated one, assigning the concatenated ID as new ID, and remove the second tuple from the list. Consider the following example.
< phen , word_3084 >n < y l , word_3085 >n+1 =⇒ < p h e n y l , word_3084 + word_3085 >n This process (and the following one) is repeated until no more modifications can be performed. Split incorrectly joined DOI ocr tokens [pre_split] In a similar fashion, we identify and split incorrectly joined tokens. We move a window of size 2 over the list of DOI xml tuples, concatenate the two tokens, and try to locate a corresponding single token, in a matching context, in the list of DOI ocr tuples. If found, we replace the respective tuple in that list with two new tuples, one with the first token from the DOI xml tuple and one with the second one. Both tuples retain the ID from the original DOI ocr tuple. In the following example, the correct tokenization separates the trailing number 3 from the rest of the expression, because it needs to be typeset in subscript in order for the formula to be rendered correctly.

Alignment Post-Processing
Force-align [post_force_align] The most frequent post-processing involves cases where single tokens of the same length and occurring in the same context are not aligned automatically. In the following, the left column contains the DOI ocr and the right the DOI xml tuples. In the first example, the β was not correctly recognized and substituted with a B. We identify force-align candidates like these by looking for sequences of s consecutive tuples with a «GAP» token in one list, followed by a similar sequence of the same length in the other. Then, if both the context and the number of characters matches, we force-align the two sequences.

< a c i d , word_1643>
< a c i d , word_997>

Quantitative Evaluation
We evaluate the system on our 68 DOI xml -DOI ocr document pair data set by computing P, R, and F for the task of aligning tokens from DOI xml (the gold-standard) to tokens in DOI ocr . By defining the evaluation task in this manner, we take into account that the DOI ocr version usually contains more tokens, mostly because it includes the bibliography, which is generally not included in the DOI xml version. Thus, an alignment is perfect if every token in DOI xml is correctly aligned to a token in DOI ocr , regardless of there being additional tokens in DOI ocr . In order to compute P and R, the number of correct alignments (=TP) among all alignments needs to be determined. Rather than inspecting and checking all alignments manually, we employ a simple heuristic: Given a pair of automatically aligned tokens, we create two KWIC string representations, KWIC xml and KWIC ocr , with a left and right context of 10 tokens each. Then, we compute the normalized Levenshtein similarity lsim between each pair ct1 and ct2 of left and right contexts, respectively, as 1 − levdist(ct1, ct2)/max(len(ct1), len(ct2)) We count the alignment as correct (=TP) if lsim of both the two left and the two right contexts is >= .50, and as incorrect (=FP) otherwise. 14 The number of missed alignments (=FN) can be computed by substracting the number of TP from the number of all tokens in DOI xml . Based on these counts, we compute precision (P), recall (R), and F-score (F) in the standard way. Results are provided in Table 1. For each parameter setting (first column), there are two result columns with P, R, and F each. The column DOI xml -DOI conv contains alignment results for which OCR was  performed on the converted PDF pages, while results in column DOI xml -DOI scan are based on scanned print-outs. Differences between these two sets of results are due to the inferior quality of the images used in the latter. The top row in Table  1 contains the result of using only the alignment without any pre-or post-processing. Subsequent rows show results for all possible combinations of pre-and post-processing measures (cf. Section 3.1). Note that pre_split and pre_join are not evaluated separately and appear combined as pre. The first observation is that, for DOI xml -DOI conv and DOI xml -DOI scan , precision is very high, with max. values of 95.04 and 93.59, respectively. This is a result of the rather strict alignment method which will align two tokens only if they are identical (rather than merely similar). At the same time, precision is very stable across experiments, i.e. indifferent to changes in pre-and post-processing. This is because, as described in Section 3.1, pre-and post-processing exclusively aim to improve recall by either smoothing out trivial mismatches before alignment, or adding missing alignments afterwards. In fact, preand post-processing actually introduce precision errors, since they relax this alignment condition somewhat: This is evident in the fact that the two top precision scores result from the setup with no pre-or post-processing at all, and even though the differences across experiments are extremly small, the pattern is still clear. Table 1 also shows the intended positive effect of the different pre-and post-processing measures on recall. Without going into much detail, we can state the following: For DOI xml -DOI conv and DOI xml -DOI scan , the lowest recall results from the setup without pre-or post-processing. When pre-and post-processing measures are added, recall increases constantly, at the expense of small drops in precision. However, the positive effect consistently outweighs the negative, causing the F-score to increase to a max. score of 86.63 and 85.20, respectively, when all pre-and post-processing measures are used. Finally, as expected, the inferior quality of the data in DOI scan as compared to DOI conv is nicely reflected in consistently lower scores across all measurements. The absolute differences, however, are very small, amounting to only about 1.5 points. This might be taken to indicate that converted (rather than printed and scanned) PDF documents can be functionally equivalent as input for tasks like OCR ground-truth data generation.

Qualitative Evaluation and Examples
This section complements the quantitative evaluation with some illustrative examples. Figure 2 shows two screenshots in which DOI scan (left) and DOI xml (right) are displayed in the MMAX2 annotation tool. The left image shows that the off-theshelf text recognition accuracy of tesseract is very good for standard text, but lacking, as expected, when it comes to recognising special characters and subscripts (like µ, ZnCl 2 , or k obs in the example). For the highlighting detection, the yellow text background was chosen as visualization in MMAX2 in order to mimick the physical highlighting of the printed paper. Note that since the highlighting detection is based on layout position only (and not anchored to text), manually highlighted text is recognized as highlighted regardless of whether the actual underlying text is recognized correctly. The right image shows the rendering of the correct text extracted from the original PMC ® full-text XML. The rendering of the title as bold and underlined is based on typographic information that was extracted at conversion time (cf. Section 2.2). The same is true for the subscripts, which are correctly rendered both in terms of the content and the position. Table 2 displays a different type of result, i.e. a small selection of a much larger set of OCR errors with their respective images and the correct recognition result. This data, automatically identi- fied by the alignment post-processing, is a valuable resource for the development of biomedical expression OCR systems.

Related Work
The work in this paper is obviously related to automatic text alignment, with the difference that what is mostly done there is the alignment of texts in different languages (i.e. bi-lingual alignment). Gale and Church (1993) align not words but entire sentences from two languages based on statistical properties. Even if words were aligned, alignment candidates in bi-lingual corpora are not identified on the basis of simple matching, with the exception of language-independent tokens like e.g. proper names.
Scanning and OCR is also often applied to historical documents, which are only available in paper (Hill and Hengchen, 2019;van Strien et al., 2020;Schaefer and Neudecker, 2020). Here, OCR postcorrection attempts to map words with word-and character-level OCR errors (similar to those found in our DOI ocr data) to their correct variants, but it does so by using general language models and dictionaries, and not an aligned correct version. Many of the above approaches have in common that they employ specialized OCR models and often ML/DL models of considerable complexity.
The idea of using an electronic and a paper version of the same document for creating a characterlevel alignment dates back at least to Kanungo and Haralick (1999), who worked on OCR ground-truth data generation. Like most later methods, the procedure of Kanungo and Haralick (1999) works on the graphical level, as opposed to the textual level. Kanungo and Haralick (1999) use LaTex to cre-ate what they call 'ideal document images' with controlled content. Print-outs of these images are created, which are then photocopied and scanned, yielding slightly noisy and skewed variants of the 'ideal' images. Then, corresponding feature points in both images are identified, and a projective transformation between these is computed. Finally, the actual ground-truth data is generated by applying this transformation for aligning the bounding boxes in the ideal images to their correspondences in the scanned images. Since Kanungo and Haralick (1999) have full control over the content of their 'ideal document images', extracting the groundtruth character data is trivial. The approach of van Beusekom et al. (2008) is similar to that of Kanungo and Haralick (1999), but the former use more sophisticated methods, including Canny edge detection (Canny, 1986) for finding corresponding sections in images of the original and the scanned document, and RAST (Breuel, 2001) for doing the actual alignment. Another difference is that van Beusekom et al. (2008) use pre-existing PDF documents as the source documents from which groundtruth data is to be extracted. Interestingly, however, their experiments only use synthetic ground-truth data from the UW3 data set 15 , in which bounding boxes and the contained characters are explicitly encoded. In their conclusion, van Beusekom et al. (2008) concede that extracting ground-truth data from PDF is a non-trivial task in itself. Ahmed et al. (2016) work on automatic ground-truth data generation for camera-captured document images, which they claim pose different problems than document images created by scanning, like e.g. blur, perspective distortion, and varying lighting. Their procedure, however, is similar to that of van Beusekom et al. (2008). They also use pre-existing PDF documents and automatically rendered 300 DPI images of these documents.

Conclusions
In this paper, we described a completely unsupervised procedure for automatically aligning printed paper documents with their electronic full-text counterparts. Our point of departure and main motivation was the idea to alleviate the effect of the paper-to-electronic media break in manual biocuration, where printed paper is still very popular when it comes to close reading and manual 15 http://tc11.cvc.uab.es/datasets/ DFKI-TGT-2010_1 markup. We also argued that the related task of document triage can benefit from the availability of alignments between electronic full-text documents (as retrieved from a literature database) and the corresponding PDF documents. Apart from this, we identified yet another field of application, biomedical expression OCR, which can benefit from ground-truth data which can automatically be generated with our procedure. Improvements in biomedical expression OCR, then, can feed back into the other use cases, by improving the OCR step and thus the alignment, thus potentially establishing a kind of bootstrapping development. Our implementation relies on tried and tested technology, including tesseract as off-the-shelf OCR component, Biopython for the alignment, and MMAX2 as visualization and data processing platform. The most computationally complex part is the actual sequence alignment with a dynamic programming algorithm from the Biopython library, which we keep tractable even for longer documents by using a simple pre-compression method. The main experimental finding of this paper is that our approach, although very simple, yields a level of performance that we consider suitable for practical applications. In quantitative terms, the procedure reaches a very good F-score of 86.63 on converted and 85.20 on printed and scanned PDF documents, with corresponding precision scores of 94.90 and 93.47, respectively. The negligible difference in results between the two types of images is interesting, as it seems to indicate that converted PDF documents, which are very easy to generate in large amounts, are almost equivalent to the more labour-intensive scans. In future work, we plan to implement solutions for the identified use cases, and to test them in actual biocuration settings. Also, we will start creating OCR ground-truth data at a larger scale, and apply that for the development of specialised tools for biomedical OCR. In the long run, procedures like the one presented in this paper might contribute to the development of systems that support curators to work in a more natural, practical, convenient, and efficient way. is extracted from .nxml documents. Each content token is associated with an alignment token (solid blue boxes). Bottom: Text and meta-data is extracted from the OCR result of scanned document pages. Meta-data includes bounding boxes, which link the recognized text to image regions, and numerical recognition scores, which reflect the confidence with which the OCR system recognized the respective token. (Not all meta-data is given in the