Skip to main content

Research Repository

Advanced Search

Encoding of Digitised Documents

Pletschacher, Stefan

Encoding of Digitised Documents Thumbnail


Authors

Stefan Pletschacher



Contributors

Abstract

This work presents solutions to the multifaceted problem of document encoding in the context of digitisation processes. It addresses aspects related to structure and layout on page level as well as mechanisms for representing underlying textual information.
The main contribution of this work is a format specification for page content, embedded in the PAGE (Page Analysis and Ground truth Elements) format framework. This page content format specification has seen wide uptake in the community and reached such popularity that it is now commonly referred to by the name of the whole framework – as PAGE or PAGE XML. The specification is described in detail from its inception through multiple iterations of improvements to the very mature state it has evolved into. A discussion of success factors documents the development of an entire infrastructure surrounding PAGE as well as its influence on other formats and the impact it has had on the research landscape.
The second and smaller contribution that is presented here concerns the problem of encoding unstandardised characters that are practically unrecognisable by current off-the-shelf OCR (Optical Character Recognition) systems. A novel approach based on Adaptive Glyph Clustering is described with the specific characteristic that it can run without user intervention by estimating required parameters automatically from the input. The usefulness of the approach is demonstrated through practical applications related to training data generation/semi-automatic OCR and faithful reproduction using document specific alphabets and fonts.

Thesis Type Thesis
Online Publication Date Mar 27, 2025
Deposit Date Feb 17, 2025
Publicly Available Date Apr 28, 2025
Award Date Mar 27, 2025

Files





You might also like



Downloadable Citations