Mr Christian Clausner C.Clausner@salford.ac.uk
Senior Research Fellow
The ENP image and ground truth dataset of historical newspapers
Clausner, C; Papadopoulos, C; Pletschacher, S; Antonacopoulos, Apostolos
Authors
C Papadopoulos
Mr Stefan Pletschacher S.Pletschacher@salford.ac.uk
Lecturer
Prof Apostolos Antonacopoulos A.Antonacopoulos@salford.ac.uk
Professor
Abstract
This paper presents a research dataset of historical newspapers comprising over 500 page images, uniquely representative of European cultural heritage from the digitization projects of 12 national and major European libraries, created within the scope of the large-scale digitisation Europeana Newspapers Project (ENP). Every image is accompanied by comprehensive ground truth (Unicode encoded full-text, layout information with precise region outlines, type labels, and reading order) in PAGE format and searchable metadata about document characteristics and artefacts. The first part of the paper describes the nature of the dataset, how it was built, and the challenges encountered. In the second part, a baseline for two state-of-the-art OCR systems (ABBYY FineReader Engine 11 and Tesseract 3.03) is given with regard to both text recognition and segmentation/layout analysis performance
Citation
Clausner, C., Papadopoulos, C., Pletschacher, S., & Antonacopoulos, A. The ENP image and ground truth dataset of historical newspapers. In 2015 13th International Conference on Document Analysis and Recognition (ICDAR) (931-935). IEEE-CPS. https://doi.org/10.1109/ICDAR.2015.7333898
Deposit Date | Mar 22, 2016 |
---|---|
Pages | 931-935 |
Book Title | 2015 13th International Conference on Document Analysis and Recognition (ICDAR) |
ISBN | 9781479918058 |
DOI | https://doi.org/10.1109/ICDAR.2015.7333898 |
Publisher URL | http://dx.doi.org/10.1109/ICDAR.2015.7333898 |
Additional Information | Funders : European Commission |
You might also like
A survey of OCR evaluation tools and metrics
(2021)
Conference Proceeding
VISE : an interface for Visual Search and Exploration of museum collections
(2019)
Journal Article
Efficient and effective OCR engine training
(2019)
Journal Article
Downloadable Citations
About USIR
Administrator e-mail: library-research@salford.ac.uk
This application uses the following open-source libraries:
SheetJS Community Edition
Apache License Version 2.0 (http://www.apache.org/licenses/)
PDF.js
Apache License Version 2.0 (http://www.apache.org/licenses/)
Font Awesome
SIL OFL 1.1 (http://scripts.sil.org/OFL)
MIT License (http://opensource.org/licenses/mit-license.html)
CC BY 3.0 ( http://creativecommons.org/licenses/by/3.0/)
Powered by Worktribe © 2024
Advanced Search