Dr Christian Clausner C.Clausner@salford.ac.uk
Senior Research Fellow
Dr Christian Clausner C.Clausner@salford.ac.uk
Senior Research Fellow
Prof Apostolos Antonacopoulos A.Antonacopoulos@salford.ac.uk
Professor
Mr Stefan Pletschacher S.Pletschacher@salford.ac.uk
Lecturer
We present an efficient and effective approach to train OCR engines using the Aletheia document analysis system. All components required for training are seamlessly integrated into Aletheia: training data preparation, the OCR engine’s training processes themselves, text recognition, and quantitative evaluation of the trained engine. Such a comprehensive training and evaluation system, guided through a GUI, allows for iterative incremental training to achieve best results. The widely used Tesseract OCR engine is used as a case study to demonstrate the efficiency and effectiveness of the proposed approach. Experimental results are presented validating the training approach with two different historical datasets, representative of recent significant digitisation projects. The impact of different training strategies and training data requirements is presented in detail.
Clausner, C., Antonacopoulos, A., & Pletschacher, S. (2020). Efficient and effective OCR engine training. International Journal on Document Analysis and Recognition, 23(1), 73-78. https://doi.org/10.1007/s10032-019-00347-8
Journal Article Type | Article |
---|---|
Acceptance Date | Oct 10, 2019 |
Online Publication Date | Oct 30, 2019 |
Publication Date | Mar 1, 2020 |
Deposit Date | Oct 14, 2019 |
Publicly Available Date | Nov 1, 2019 |
Journal | International Journal on Document Analysis and Recognition |
Print ISSN | 1433-2833 |
Electronic ISSN | 1433-2825 |
Publisher | Springer Verlag |
Volume | 23 |
Issue | 1 |
Pages | 73-78 |
DOI | https://doi.org/10.1007/s10032-019-00347-8 |
Publisher URL | https://doi.org/10.1007/s10032-019-00347-8 |
Related Public URLs | https://link.springer.com/journal/10032 |
Clausner2019_Article_EfficientAndEffectiveOCREngine.pdf
(2 Mb)
PDF
Licence
http://creativecommons.org/licenses/by/4.0/
Publisher Licence URL
http://creativecommons.org/licenses/by/4.0/
VISE : an interface for Visual Search and Exploration of museum collections
(2019)
Journal Article
Highlights of the novel dewaterability estimation test (DET) device
(2019)
Journal Article
Document analysis and text recognition
(2018)
Book
Making Europe’s historical newspapers searchable
(2016)
Journal Article
About USIR
Administrator e-mail: library-research@salford.ac.uk
This application uses the following open-source libraries:
Apache License Version 2.0 (http://www.apache.org/licenses/)
Apache License Version 2.0 (http://www.apache.org/licenses/)
SIL OFL 1.1 (http://scripts.sil.org/OFL)
MIT License (http://opensource.org/licenses/mit-license.html)
CC BY 3.0 ( http://creativecommons.org/licenses/by/3.0/)
Powered by Worktribe © 2025
Advanced Search