Learning Chomsky-like grammars for biological sequence families

Muggleton, SH; Bryant, CH; Srinivasan, A

Learning Chomsky-like grammars for biological sequence families

Muggleton, SH; Bryant, CH; Srinivasan, A

Authors

SH Muggleton

Dr Chris Bryant C.H.Bryant@salford.ac.uk
Lecturer

A Srinivasan

Contributors

P Langley
Editor

Abstract

This paper presents a new method of measuring performance when positives are rare and investigates whether Chomsky-like grammar representations are useful for learning accurate comprehensible predictors of members of biological sequence families. The positive-only learning framework of the Inductive Logic Programming (ILP) system CProgol is used to generate a grammar for recognising a class of proteins known as human neuropeptide precursors (NPPs). As far as these authors are aware, this is both the first biological grammar learnt using ILP and the first real-world scientific application of the positive-only learning framework of CProgol. Performance is measured using both predictive accuracy and a new cost function, em Relative Advantage (RA). The RA results show that searching for NPPs by using our best NPP predictor as a filter is more than 100 times more efficient than randomly selecting proteins for synthesis and testing them for biological activity. The highest RA was achieved by a model which includes grammar-derived features. This RA is significantly higher than the best RA achieved without the use of the grammar-derived features.

Presentation Conference Type	Conference Paper (published)
Start Date	Jun 29, 2000
End Date	Jul 2, 2000
Publication Date	Jul 2, 2000
Deposit Date	Feb 16, 2009
Publicly Available Date	Feb 16, 2009
Pages	631-638
Book Title	Proceedings of the 17th International Conference on Machine Learning
ISBN	1-55860-707-2
Publisher URL	https://dl.acm.org/doi/10.5555/645529.658131
Additional Information	Event Type : Conference