Christos Papadopoulos
The IMPACT dataset of historical document images
Papadopoulos, Christos; Pletschacher, Stefan; Clausner, Christian; Antonacopoulos, Apostolos
Authors
Dr Stefan Pletschacher S.Pletschacher@salford.ac.uk
Lecturer
Dr Christian Clausner C.Clausner@salford.ac.uk
Senior Research Fellow
Prof Apostolos Antonacopoulos A.Antonacopoulos@salford.ac.uk
Professor
Abstract
Representative and comprehensive datasets are a prerequisite for any research activity, from studying specific types of problems through training of algorithms to evaluating results of actual implementations. This paper describes an invaluable resource which is the result of a large scale effort undertaken in the EU funded project IMPACT. A number of challenges faced during the creation phase but also the significant benefits and potential of this collection of printed historical documents are described. The dataset contains over 600,000 document images that originate from major European libraries and are representative of both their respective holdings and digitisation plans for the near to medium term. It is truly unique with regard to the very substantial amount of high-quality ground truth which is available for approximately 45,000 pages, capturing detailed layout, reading order and text content. The dataset is publicly available through the IMPACT Centre of Competence (www.digitisation.eu).
Presentation Conference Type | Conference Paper (published) |
---|---|
Conference Name | HIP '13: 2nd International Workshop on Historical Document Imaging and Processing |
Start Date | Aug 24, 2013 |
Online Publication Date | Aug 24, 2013 |
Publication Date | Aug 24, 2013 |
Deposit Date | Jun 26, 2025 |
Peer Reviewed | Peer Reviewed |
Pages | 123-130 |
Series Title | HIP: Historical Document Imaging and Processing conference proceedings |
Book Title | HIP '13: Proceedings of the 2nd International Workshop on Historical Document Imaging and Processing |
ISBN | 978-1-4503-2115-0 |
DOI | https://doi.org/10.1145/2501115.2501130 |
You might also like
Encoding of Digitised Documents
(2025)
Thesis
NAME – A Rich XML Format for Named Entity and Relation Tagging
(2023)
Presentation / Conference Contribution
Flexible character accuracy measure for reading-order-independent evaluation
(2020)
Journal Article
A cloud-hosted MapReduce architecture for syntactic parsing
(2019)
Presentation / Conference Contribution
Efficient and effective OCR engine training
(2019)
Journal Article
Downloadable Citations
About USIR
Administrator e-mail: library-research@salford.ac.uk
This application uses the following open-source libraries:
SheetJS Community Edition
Apache License Version 2.0 (http://www.apache.org/licenses/)
PDF.js
Apache License Version 2.0 (http://www.apache.org/licenses/)
Font Awesome
SIL OFL 1.1 (http://scripts.sil.org/OFL)
MIT License (http://opensource.org/licenses/mit-license.html)
CC BY 3.0 ( http://creativecommons.org/licenses/by/3.0/)
Powered by Worktribe © 2025
Advanced Search