Dr Christian Clausner C.Clausner@salford.ac.uk
Senior Research Fellow
Dr Christian Clausner C.Clausner@salford.ac.uk
Senior Research Fellow
S Pletschacher
Prof Apostolos Antonacopoulos A.Antonacopoulos@salford.ac.uk
Professor
The feasibility of large-scale OCR projects can so far only be assessed by running pilot studies on subsets of the target document collections and measuring the success of different workflows based on precise ground truth, which can be very costly to produce in the required volume. The premise of this paper is that, as an alternative, quality prediction may be used to approximate the success of a given OCR workflow. A new system is thus presented where a classifier is trained using metadata, image and layout features in combination with measured success rates (based on minimal ground truth). Subsequently, only document images are required as input for the numeric prediction of the quality score (no ground truth required). This way, the system can be applied to any number of similar (unseen) documents in order to assess their suitability for being processed using the particular workflow. The usefulness of the system has been validated using a realistic dataset of historical newspaper pages.
Presentation Conference Type | Conference Paper (published) |
---|---|
Conference Name | 2016 12th IAPR Workshop on Document Analysis Systems (DAS) |
Start Date | Aug 11, 2016 |
End Date | Aug 14, 2016 |
Acceptance Date | Dec 14, 2015 |
Publication Date | Jun 13, 2016 |
Deposit Date | Mar 22, 2016 |
Volume | 2016 |
Pages | 138-143 |
DOI | https://doi.org/10.1109/das.2016.82 |
Publisher URL | http://dx.doi.org/10.1109/das.2016.82 |
Related Public URLs | http://ieeexplore.ieee.org/xpl/mostRecentIssue.jsp?punumber=7485953 |
Additional Information | This work has been funded through the EU Competitiveness and Innovation Framework Programme grant Europeana Newspapers(Ref. 297380) |
Efficient and effective OCR engine training
(2019)
Journal Article
Highlights of the novel dewaterability estimation test (DET) device
(2019)
Journal Article
The ENP image and ground truth dataset of historical newspapers
(-0001)
Book Chapter
About USIR
Administrator e-mail: library-research@salford.ac.uk
This application uses the following open-source libraries:
Apache License Version 2.0 (http://www.apache.org/licenses/)
Apache License Version 2.0 (http://www.apache.org/licenses/)
SIL OFL 1.1 (http://scripts.sil.org/OFL)
MIT License (http://opensource.org/licenses/mit-license.html)
CC BY 3.0 ( http://creativecommons.org/licenses/by/3.0/)
Powered by Worktribe © 2025
Advanced Search