A new unsupervised feature selection method for text clustering based on genetic algorithms

Shamsinejadbabki, P; Saraee, MH

doi:10.1007/s10844-011-0172-5

A new unsupervised feature selection method for text clustering based on genetic algorithms

Shamsinejadbabki, P; Saraee, MH

Authors

P Shamsinejadbabki

Prof Mo Saraee M.Saraee@salford.ac.uk
Interim Director of Computer Science

Abstract

Nowadays a vast amount of textual information is collected and stored in various databases around the world, including the Internet as the largest database of all. This rapidly increasing growth of published text means that even the most avid reader cannot hope to keep up with all the reading in a field and consequently the nuggets of insight or new knowledge are at risk of languishing undiscovered in the literature. Text mining offers a solution to this problem by replacing or supplementing the human reader with automatic systems undeterred by the text explosion. It involves analyzing a large collection of documents to discover previously unknown information. Text clustering is one of the most important areas in text mining, which includes text preprocessing, dimension reduction by selecting some terms (features) and finally clustering using selected terms. Feature selection appears to be the most important step in the process. Conventional unsupervised feature selection methods define a measure of the discriminating power of terms to select proper terms from corpus. However up to now the valuation of terms in groups has not been investigated in reported works. In this paper a new and robust unsupervised feature selection approach is proposed that evaluates terms in groups. In addition a new Modified Term Variance measuring method is proposed for evaluating groups of terms. Furthermore a genetic based algorithm is designed and implemented for finding the most valuable groups of terms based on the new measure. These terms then will be utilized to generate the final feature vector for the clustering process . In order to evaluate and justify our approach the proposed method and also a conventional term variance method are implemented and tested using corpus collection Reuters-21578. For a more accurate comparison, methods have been tested on three corpuses and for each corpus clustering task has been done ten times and results are averaged. Results of comparing these two methods are very promising and show that our method produces better average accuracy and F1-measure than the conventional term variance method.

Citation

Shamsinejadbabki, P., & Saraee, M. (2012). A new unsupervised feature selection method for text clustering based on genetic algorithms. Journal of Intelligent Information Systems, 38(3), 669-684. https://doi.org/10.1007/s10844-011-0172-5

Journal Article Type	Article
Acceptance Date	Jul 11, 2011
Online Publication Date	Jul 28, 2011
Publication Date	Jun 1, 2012
Deposit Date	Oct 19, 2011
Publicly Available Date	Apr 5, 2016
Journal	Journal of Intelligent Information Systems
Print ISSN	0925-9902
Electronic ISSN	1573-7675
Publisher	Springer Verlag
Peer Reviewed	Peer Reviewed
Volume	38
Issue	3
Pages	669-684
DOI	https://doi.org/10.1007/s10844-011-0172-5
Publisher URL	http://dx.doi.org/10.1007/s10844-011-0172-5
Related Public URLs	http://www.springerlink.com/content/0925-9902/#AboutSection