Datasets Size: Effect on Clustering Results

The recent advancement in the way we capture and store data pose a serious challenge for data analysis. This gives a wider acceptance to data mining, being an interdisciplinary field that implements algorithm on stored data with a view to discovering hidden knowledge. Most people that keep records,...

Full description

Bibliographic Details
Main Authors: Raheem, Ajiboye Adeleke, Ruzaini, Abdullah Arshah, Hongwu, Qin
Format: Conference or Workshop Item
Language:English
Published: 2013
Subjects:
Online Access:http://umpir.ump.edu.my/id/eprint/5007/
http://umpir.ump.edu.my/id/eprint/5007/1/22-UMP.pdf
Description
Summary:The recent advancement in the way we capture and store data pose a serious challenge for data analysis. This gives a wider acceptance to data mining, being an interdisciplinary field that implements algorithm on stored data with a view to discovering hidden knowledge. Most people that keep records, however, are yet to reap the benefits of this tool, this is due to the general notion that a large datasets is required to guarantee reliable results. However, this may not be applicable in all cases. In this paper, we proposed a research technique that implements descriptive algorithms on numeric datasets of varied sizes. We modeled each subset of our data using EM clustering algorithm; two different numbers of partitions (k) were estimated and used for each experiment. The clustering results were validated using external evaluation measure in order to determine their level of correctness. The approach unveils the implication of datasets size on the clusters formed and the impact of estimated number of partitions.