Skip to main content

Research Repository

Advanced Search

Efficient and effective OCR engine training

Clausner, C; Antonacopoulos, A; Pletschacher, S

Efficient and effective OCR engine training Thumbnail


Authors



Abstract

We present an efficient and effective approach to train OCR engines using the Aletheia document analysis system. All components required for training are seamlessly integrated into Aletheia: training data preparation, the OCR engine’s training processes themselves, text recognition, and quantitative evaluation of the trained engine. Such a comprehensive training and evaluation system, guided through a GUI, allows for iterative incremental training to achieve best results. The widely used Tesseract OCR engine is used as a case study to demonstrate the efficiency and effectiveness of the proposed approach. Experimental results are presented validating the training approach with two different historical datasets, representative of recent significant digitisation projects. The impact of different training strategies and training data requirements is presented in detail.

Citation

Clausner, C., Antonacopoulos, A., & Pletschacher, S. (2020). Efficient and effective OCR engine training. International Journal on Document Analysis and Recognition, 23(1), 73-78. https://doi.org/10.1007/s10032-019-00347-8

Journal Article Type Article
Acceptance Date Oct 10, 2019
Online Publication Date Oct 30, 2019
Publication Date Mar 1, 2020
Deposit Date Oct 14, 2019
Publicly Available Date Nov 1, 2019
Journal International Journal on Document Analysis and Recognition
Print ISSN 1433-2833
Electronic ISSN 1433-2825
Publisher Springer Verlag
Volume 23
Issue 1
Pages 73-78
DOI https://doi.org/10.1007/s10032-019-00347-8
Publisher URL https://doi.org/10.1007/s10032-019-00347-8
Related Public URLs https://link.springer.com/journal/10032

Files





You might also like



Downloadable Citations