Dr Christian Clausner C.Clausner@salford.ac.uk
Senior Research Fellow
Dr Christian Clausner C.Clausner@salford.ac.uk
Senior Research Fellow
Prof Apostolos Antonacopoulos A.Antonacopoulos@salford.ac.uk
Professor
S Pletschacher
We present an efficient and effective approach to train OCR engines using the Aletheia document analysis system. All components required for training are seamlessly integrated into Aletheia: training data preparation, the OCR engine’s training processes themselves, text recognition, and quantitative evaluation of the trained engine. Such a comprehensive training and evaluation system, guided through a GUI, allows for iterative incremental training to achieve best results. The widely used Tesseract OCR engine is used as a case study to demonstrate the efficiency and effectiveness of the proposed approach. Experimental results are presented validating the training approach with two different historical datasets, representative of recent significant digitisation projects. The impact of different training strategies and training data requirements is presented in detail.
Clausner, C., Antonacopoulos, A., & Pletschacher, S. (2020). Efficient and effective OCR engine training. International Journal on Document Analysis and Recognition, 23(1), 73-78. https://doi.org/10.1007/s10032-019-00347-8
Journal Article Type | Article |
---|---|
Acceptance Date | Oct 10, 2019 |
Online Publication Date | Oct 30, 2019 |
Publication Date | Mar 1, 2020 |
Deposit Date | Oct 14, 2019 |
Publicly Available Date | Nov 1, 2019 |
Journal | International Journal on Document Analysis and Recognition |
Print ISSN | 1433-2833 |
Electronic ISSN | 1433-2825 |
Publisher | Springer Verlag |
Volume | 23 |
Issue | 1 |
Pages | 73-78 |
DOI | https://doi.org/10.1007/s10032-019-00347-8 |
Publisher URL | https://doi.org/10.1007/s10032-019-00347-8 |
Related Public URLs | https://link.springer.com/journal/10032 |
Clausner2019_Article_EfficientAndEffectiveOCREngine.pdf
(2 Mb)
PDF
Licence
http://creativecommons.org/licenses/by/4.0/
Publisher Licence URL
http://creativecommons.org/licenses/by/4.0/
Highlights of the novel dewaterability estimation test (DET) device
(2019)
Journal Article
The ENP image and ground truth dataset of historical newspapers
(-0001)
Book Chapter
A survey of OCR evaluation tools and metrics
(2021)
Presentation / Conference Contribution
About USIR
Administrator e-mail: library-research@salford.ac.uk
This application uses the following open-source libraries:
Apache License Version 2.0 (http://www.apache.org/licenses/)
Apache License Version 2.0 (http://www.apache.org/licenses/)
SIL OFL 1.1 (http://scripts.sil.org/OFL)
MIT License (http://opensource.org/licenses/mit-license.html)
CC BY 3.0 ( http://creativecommons.org/licenses/by/3.0/)
Powered by Worktribe © 2025
Advanced Search