Efficient and effective OCR engine training

Clausner, C; Antonacopoulos, A; Pletschacher, S

doi:10.1007/s10032-019-00347-8

Efficient and effective OCR engine training

Clausner, C; Antonacopoulos, A; Pletschacher, S

Authors

Dr Christian Clausner C.Clausner@salford.ac.uk
Senior Research Fellow

Prof Apostolos Antonacopoulos A.Antonacopoulos@salford.ac.uk
Professor

S Pletschacher

Abstract

We present an efficient and effective approach to train OCR engines using the Aletheia document analysis system. All components required for training are seamlessly integrated into Aletheia: training data preparation, the OCR engine’s training processes themselves, text recognition, and quantitative evaluation of the trained engine. Such a comprehensive training and evaluation system, guided through a GUI, allows for iterative incremental training to achieve best results. The widely used Tesseract OCR engine is used as a case study to demonstrate the efficiency and effectiveness of the proposed approach. Experimental results are presented validating the training approach with two different historical datasets, representative of recent significant digitisation projects. The impact of different training strategies and training data requirements is presented in detail.

Citation

Clausner, C., Antonacopoulos, A., & Pletschacher, S. (2020). Efficient and effective OCR engine training. International Journal on Document Analysis and Recognition, 23(1), 73-78. https://doi.org/10.1007/s10032-019-00347-8

Journal Article Type	Article
Acceptance Date	Oct 10, 2019
Online Publication Date	Oct 30, 2019
Publication Date	Mar 1, 2020
Deposit Date	Oct 14, 2019
Publicly Available Date	Nov 1, 2019
Journal	International Journal on Document Analysis and Recognition
Print ISSN	1433-2833
Electronic ISSN	1433-2825
Publisher	Springer Verlag
Volume	23
Issue	1
Pages	73-78
DOI	https://doi.org/10.1007/s10032-019-00347-8
Publisher URL	https://doi.org/10.1007/s10032-019-00347-8
Related Public URLs	https://link.springer.com/journal/10032