Skip to main content

Research Repository

Advanced Search

All Outputs (21)

Efficient and effective OCR engine training (2019)
Journal Article
Clausner, C., Antonacopoulos, A., & Pletschacher, S. (2020). Efficient and effective OCR engine training. International Journal on Document Analysis and Recognition, 23(1), 73-78. https://doi.org/10.1007/s10032-019-00347-8

We present an efficient and effective approach to train OCR engines using the Aletheia document analysis system. All components required for training are seamlessly integrated into Aletheia: training data preparation, the OCR engine’s training proces... Read More about Efficient and effective OCR engine training.

Creating a complete workflow for digitising historical census documents : considerations and evaluation (2017)
Presentation / Conference Contribution

The 1961 Census of England and Wales was the first UK census to make use of computers. However, only bound volumes and microfilm copies of printouts remain, locking a wealth of information in a form that is practically unusable for research. In this... Read More about Creating a complete workflow for digitising historical census documents : considerations and evaluation.

Unearthing the recent past : digitising and understanding statistical information from census tables (2017)
Presentation / Conference Contribution

Censuses comprise a wealth of information at a large (national) scale that allow governments (who commission them) and the public to have a detailed snapshot of how people live (geographical distribution and characteristics). In addition to underpinn... Read More about Unearthing the recent past : digitising and understanding statistical information from census tables.

Effective geometric restoration of distorted historical documents for large-scale digitization (2017)
Journal Article
Yang, P., Antonacopoulos, A., Clausner, C., Pletschacher, S., & Qi, J. (2017). Effective geometric restoration of distorted historical documents for large-scale digitization. IET Image Processing, 11(10), 841-853. https://doi.org/10.1049/iet-ipr.2016.0973

Due to storage conditions and material’s non-planar shape, geometric distortion of the 2-D content is widely present in scanned document images. Effective geometric restoration of these distorted document images considerably increases character recog... Read More about Effective geometric restoration of distorted historical documents for large-scale digitization.

ICDAR2015 competition on recognition of documents with complex layouts - RDCL2015 (2015)
Book Chapter
Antonacopoulos, A., Clausner, C., Papadopoulos, C., & Pletschacher, S. (2015). ICDAR2015 competition on recognition of documents with complex layouts - RDCL2015. In 2015 13th International Conference on Document Analysis and Recognition (ICDAR) (1151-1155). IEEE. https://doi.org/10.1109/ICDAR.2015.7333941

This paper presents an objective comparative evaluation of page segmentation and region classification methods for documents with complex layouts. It describes the competition (modus operandi, dataset and evaluation methodology) held in the context o... Read More about ICDAR2015 competition on recognition of documents with complex layouts - RDCL2015.

Aletheia - An advanced document layout and text ground-truthing system for production environments (2011)
Presentation / Conference Contribution

Large-scale digitisation has led to a number of new possibilities with regard to adaptive and learning based methods in the field of Document Image Analysis and OCR. For ground truth production of large corpora, however, there is still a gap in terms... Read More about Aletheia - An advanced document layout and text ground-truthing system for production environments.

A new framework for recognition of heavily degraded characters in historical typewritten documents based on semi-supervised clustering (2009)
Presentation / Conference Contribution

This paper presents a new semi-supervised clustering
framework to the recognition of heavily degraded characters
in historical typewritten documents, where off-theshelf
OCR typically fails. The constraints are generated
using typographical (colle... Read More about A new framework for recognition of heavily degraded characters in historical typewritten documents based on semi-supervised clustering.