Skip to main content

Research Repository

Advanced Search

The ENP image and ground truth dataset of historical newspapers

Clausner, C; Papadopoulos, C; Pletschacher, S; Antonacopoulos, Apostolos

Authors

C Papadopoulos



Abstract

This paper presents a research dataset of historical newspapers comprising over 500 page images, uniquely representative of European cultural heritage from the digitization projects of 12 national and major European libraries, created within the scope of the large-scale digitisation Europeana Newspapers Project (ENP). Every image is accompanied by comprehensive ground truth (Unicode encoded full-text, layout information with precise region outlines, type labels, and reading order) in PAGE format and searchable metadata about document characteristics and artefacts. The first part of the paper describes the nature of the dataset, how it was built, and the challenges encountered. In the second part, a baseline for two state-of-the-art OCR systems (ABBYY FineReader Engine 11 and Tesseract 3.03) is given with regard to both text recognition and segmentation/layout analysis performance

Citation

Clausner, C., Papadopoulos, C., Pletschacher, S., & Antonacopoulos, A. The ENP image and ground truth dataset of historical newspapers. In 2015 13th International Conference on Document Analysis and Recognition (ICDAR) (931-935). IEEE-CPS. https://doi.org/10.1109/ICDAR.2015.7333898

Deposit Date Mar 22, 2016
Pages 931-935
Book Title 2015 13th International Conference on Document Analysis and Recognition (ICDAR)
ISBN 9781479918058
DOI https://doi.org/10.1109/ICDAR.2015.7333898
Publisher URL http://dx.doi.org/10.1109/ICDAR.2015.7333898
Additional Information Funders : European Commission