Skip to main content

Research Repository

Advanced Search

Quality prediction system for large-scale digitisation workflows

Clausner, C; Pletschacher, S; Antonacopoulos, Apostolos

Authors



Abstract

The feasibility of large-scale OCR projects can so far only be assessed by running pilot studies on subsets of the target document collections and measuring the success of different workflows based on precise ground truth, which can be very costly to produce in the required volume. The premise of this paper is that, as an alternative, quality prediction may be used to approximate the success of a given OCR workflow. A new system is thus presented where a classifier is trained using metadata, image and layout features in combination with measured success rates (based on minimal ground truth). Subsequently, only document images are required as input for the numeric prediction of the quality score (no ground truth required). This way, the system can be applied to any number of similar (unseen) documents in order to assess their suitability for being processed using the particular workflow. The usefulness of the system has been validated using a realistic dataset of historical newspaper pages.

Citation

Clausner, C., Pletschacher, S., & Antonacopoulos, A. (2016). Quality prediction system for large-scale digitisation workflows. . https://doi.org/10.1109/das.2016.82

Conference Name 2016 12th IAPR Workshop on Document Analysis Systems (DAS)
Conference Location Santorini, Greece
Start Date Aug 11, 2016
End Date Aug 14, 2016
Acceptance Date Dec 14, 2015
Publication Date Jun 13, 2016
Deposit Date Mar 22, 2016
Volume 2016
Pages 138-143
DOI https://doi.org/10.1109/das.2016.82
Publisher URL http://dx.doi.org/10.1109/das.2016.82
Related Public URLs http://ieeexplore.ieee.org/xpl/mostRecentIssue.jsp?punumber=7485953
Additional Information This work has been funded through the EU Competitiveness and Innovation Framework Programme grant Europeana Newspapers(Ref. 297380)