(apologies for cross-postings)
====
HIPE-OCRepair 2026 - Historical OCR Post-Correction Shared Task
Website: https://hipe-eval.github.io/HIPE-OCRepair-2026/
Task: LLM-Assisted OCR Post-Correction for Multilingual Historical Documents
Venue: ICDAR 2026 (31 Aug - 4th Sep 2026)
====
Data: https://github.com/hipe-eval/HIPE-OCRepair-2026-data
How-to: https://github.com/hipe-eval/HIPE-OCRepair-2026-data/blob/main/README-Pa...
Scorer: https://github.com/hipe-eval/HIPE-OCRepair-scorer/
====
We invite participation in HIPE-OCRepair 2026, the ICDAR 2026 Competition on LLM-Assisted OCR Post-Correction for Historical Documents.
Large-scale digitized historical collections still contain substantial OCR errors. Re-processing millions of pages with improved engines is rarely feasible, making post-correction the most viable strategy for addressing the OCR debt accumulated in digital heritage collections. Recent progress in large language models opens promising new directions, but their effectiveness varies across languages and error types, and they may introduce hallucinations or unintended alterations.
To what extent can modern large language models address the OCR debt accumulated in large-scale digitized historical collections?
HIPE-OCRepair 2026 addresses this question through HIPE-OCRepair-Bench, a unified multilingual benchmark comprising curated datasets, a standardised evaluation protocol, baseline systems, and an open leaderboard.
Task
Participants correct noisy OCR transcripts of historical documents without access to the original images. For each text chunk (typically a paragraph or article), the dataset provides:
- one OCR hypothesis
- document metadata (language, date, publication title)
- OCR quality indicators (CER, WER, lexicon-based quality score)
Systems must produce improved corrected text. Both generative (LLM-based) and discriminative or hybrid approaches are welcome.
Data
The benchmark consists of parallel OCR and ground truth data drawn from multiple curated historical collections, covering English, French, and German materials from primarily the 17th to the 20th century, including newspapers and printed works. It consolidates existing resources alongside newly curated materials.
Important dates
- 10 Dec 2025: Sample data release
- 02 Mar 2026: Training and development data release; scorer
- 23 Mar 2026: Hugging Face leader board release
- 06-08 Apr 2026: Evaluation phase (test release & submission)
- 10 Apr 2026: Results publication
- 31 Aug-4 Sep 2026: Presentation at ICDAR 2026
HIPE-OCRepair addresses a central challenge for the document analysis, NLP, and digital humanities communities: improving the usability of large historical text collections at scale. It offers a reproducible evaluation framework, openly available data and tools, and a persistent leaderboard for ongoing benchmarking beyond the competition itself.
We look forward to your participation!
Best regards,
HIPE-OCRepair 2026 Organizers