Mr Christian Clausner C.Clausner@salford.ac.uk
Senior Research Fellow
Efficient and effective OCR engine training
Clausner, C; Antonacopoulos, A; Pletschacher, S
Authors
Prof Apostolos Antonacopoulos A.Antonacopoulos@salford.ac.uk
Professor
Mr Stefan Pletschacher S.Pletschacher@salford.ac.uk
Lecturer
Abstract
We present an efficient and effective approach to train OCR engines using the Aletheia document analysis system. All components required for training are seamlessly integrated into Aletheia: training data preparation, the OCR engine’s training processes themselves, text recognition, and quantitative evaluation of the trained engine. Such a comprehensive training and evaluation system, guided through a GUI, allows for iterative incremental training to achieve best results. The widely used Tesseract OCR engine is used as a case study to demonstrate the efficiency and effectiveness of the proposed approach. Experimental results are presented validating the training approach with two different historical datasets, representative of recent significant digitisation projects. The impact of different training strategies and training data requirements is presented in detail.
Citation
Clausner, C., Antonacopoulos, A., & Pletschacher, S. (2020). Efficient and effective OCR engine training. International Journal on Document Analysis and Recognition, 23(1), 73-78. https://doi.org/10.1007/s10032-019-00347-8
Journal Article Type | Article |
---|---|
Acceptance Date | Oct 10, 2019 |
Online Publication Date | Oct 30, 2019 |
Publication Date | Mar 1, 2020 |
Deposit Date | Oct 14, 2019 |
Publicly Available Date | Nov 1, 2019 |
Journal | International Journal on Document Analysis and Recognition |
Print ISSN | 1433-2833 |
Electronic ISSN | 1433-2825 |
Publisher | Springer Verlag |
Volume | 23 |
Issue | 1 |
Pages | 73-78 |
DOI | https://doi.org/10.1007/s10032-019-00347-8 |
Publisher URL | https://doi.org/10.1007/s10032-019-00347-8 |
Related Public URLs | https://link.springer.com/journal/10032 |
Files
Clausner2019_Article_EfficientAndEffectiveOCREngine.pdf
(2 Mb)
PDF
Licence
http://creativecommons.org/licenses/by/4.0/
Publisher Licence URL
http://creativecommons.org/licenses/by/4.0/
You might also like
A survey of OCR evaluation tools and metrics
(2021)
Conference Proceeding
VISE : an interface for Visual Search and Exploration of museum collections
(2019)
Journal Article
Crowdsourcing historical tabular data : 1961 census of England and Wales
(2019)
Conference Proceeding
Downloadable Citations
About USIR
Administrator e-mail: library-research@salford.ac.uk
This application uses the following open-source libraries:
SheetJS Community Edition
Apache License Version 2.0 (http://www.apache.org/licenses/)
PDF.js
Apache License Version 2.0 (http://www.apache.org/licenses/)
Font Awesome
SIL OFL 1.1 (http://scripts.sil.org/OFL)
MIT License (http://opensource.org/licenses/mit-license.html)
CC BY 3.0 ( http://creativecommons.org/licenses/by/3.0/)
Powered by Worktribe © 2024
Advanced Search