Skip to main content

Research Repository

Advanced Search

Semantic based indexing technique for optimisation and intelligent document representation : application to structured and unstructured document clustering

Barresi, S

Authors

S Barresi



Contributors

S Nefti-Meziani S.Nefti-Meziani@salford.ac.uk
Supervisor

Abstract

The advances in data collection and the increasing amount of unstructured and unlabeled
text documents have led to the need for better disambiguation and indexing techniques,
which allow for the effective and intelligent organisation of large amounts of documents
into a small number of significant clusters; facilitating the analysis, browsing, and
searching of document collections. Traditionally, document clustering systems have
relied on bag-of-words and term frequency approaches to represent and subsequently
classify documents, by only taking into account document syntax and with no
consideration for semantic aspects. To address this issue, more complex indexing and
clustering techniques, which consider the semantic associations between the words
contained in a document and differentiate the degree of semantic importance of terms
during the classification process, need to be further investigated in order to enable
appropriate and automatic contextualisation of text documents and information.
This research proposes a new indexing technique, which can be used to effectively
represent, and subsequently cluster, collections of unstructured or structured documents.
The presented technique aims at overcoming some of the major problems related to the
bag-of-words approach; such as its lack of consideration for synonyms as well as its usual
failure in differentiating the degree of semantic importance of terms. The main idea
behind the proposed technique is to map each document into a lower dimensional space;
by considering the semantic associations between the words contained in the document.
To address the semantic problems posed by traditional indexing, the investigated method
focuses on word sense disambiguation and document concepts. The proposed technique
extracts concepts from documents and uses a set of these concepts as indexing units,
achieving vector dimensionality reduction as well as more cohesive and separated
clusters. Good results are also achieved in terms of purity, entropy, and when compared
with similar studies in the field of semantic-based concept indexing.

Citation

Barresi, S. Semantic based indexing technique for optimisation and intelligent document representation : application to structured and unstructured document clustering. (Thesis). Salford : University of Salford

Thesis Type Thesis
Deposit Date Oct 3, 2012
Award Date Jan 1, 2010