A novel domain independent scene text localizer

Roy, Ayush; Palaiahnakote, Shivakumara; Pal, Umapada; Liu, Cheng-Lin

doi:10.1016/j.patcog.2024.111015

A novel domain independent scene text localizer

Roy, Ayush; Palaiahnakote, Shivakumara; Pal, Umapada; Liu, Cheng-Lin

Authors

Ayush Roy

Dr Shivakumara Palaiahnakote S.Palaiahnakote@salford.ac.uk
Lecturer in Computer Vision

Umapada Pal

Cheng-Lin Liu

Abstract

Text localization across multiple domains is crucial for applications like autonomous driving and tracking marathon runners. This work introduces DIPCYT, a novel model that utilizes Domain Independent Partial Convolution and a Yolov5-based Transformer for text localization in scene images from various domains, including natural scenes, underwater, and drone images. Each domain presents unique challenges: underwater images suffer from poor quality and degradation, drone images suffer from tiny text and loss of shapes, and scene images suffer from arbitrarily oriented, shaped text. Additionally, license plates in drone images may not provide rich semantic information compared to other text types due to loss of contextual information between characters. To tackle these challenges, DIPCYT employs new partial convolution layers within Yolov5 and integrates Transformer detection heads with a novel Fourier Positional Convolutional Block Attention Module (FPCBAM). This approach leverages common text properties across domains, such as contextual (global) and spatial (local) relationships. Experimental results demonstrate that DIPCYT outperforms existing methods, achieving F-scores of 0.90, 0.90, 0.77, 0.85, 0.85, and 0.88 on Total-Text, ICDAR 2015, ICDAR 2019 MLT, CTW1500, Drone, and Underwater datasets, respectively.

Journal Article Type	Article
Acceptance Date	Sep 10, 2024
Online Publication Date	Sep 15, 2024
Publication Date	Sep 18, 2024
Deposit Date	Sep 12, 2024
Publicly Available Date	Sep 23, 2024
Journal	Pattern Recognition
Print ISSN	0031-3203
Publisher	Elsevier
Peer Reviewed	Peer Reviewed
Volume	158
Article Number	111015
DOI	https://doi.org/10.1016/j.patcog.2024.111015
Keywords	Scene text detection; Transformer; Attention module; Drone images; Underwater images