Application of clustering in managing unstructured textual data in relational database / Wael Mohamed Shaher Yafooz

Huge reliance on computer usage in everyday life, leads to a continuous increase of large data applications in textual forms. The data are reposited to a secondary storage for future usage. Therefore, a relational database (RDB) is most commonly used as a backbone in most application software for or...

Full description

Bibliographic Details
Main Author: Yafooz, Wael Mohamed Shaher
Format: Thesis
Language:English
Published: 2014
Subjects:
Online Access:http://ir.uitm.edu.my/id/eprint/28040/
http://ir.uitm.edu.my/id/eprint/28040/1/TP_WAEL%20MOHAMED%20SHAHER%20YAFOOZ%20CS%2014_5.pdf
Description
Summary:Huge reliance on computer usage in everyday life, leads to a continuous increase of large data applications in textual forms. The data are reposited to a secondary storage for future usage. Therefore, a relational database (RDB) is most commonly used as a backbone in most application software for organising such data into structured form. The RDB has robust and powerful structures for managing, organising, and retrieving the data. However, the database structure can still contain large amounts of unstructured textual data. Dealing with unstructured textual data leads to three basic issues; users encounter difficulties to find useful information, inaccurate information retrieval and insufficient performance of query processing. Attempts have been made to resolve all of these issues by using several methods such as: full text searching, text indexing, a database schema management, database data model, and query-based techniques. However, the front-end approach, in the form of software applications, are still needed to organise the unstructured textual information in the RDB. This study proposes a Textual Virtual Schema Model (TVSM) as the back-end approach to reorganising textual data inside relational databases, while performing automatic semantic linking and clustering assignments. Upon storing any new unstructured textual data into a database, all words are extracted to uncover the underlying meaning of such data. Their name entities and top most frequent terms are selected for the factors used in a cluster assignment. The model is tested and evaluated by embedding it in a component-based package of a relational databases internal structure. Three experiments have been conducted on textual Reuters corpus, Classic and WAP dataset. The clustering results have been validated using the F-measure, Entropy and Purity methods of measurement and compared with two common methods, which are information extraction and textual document clustering, for example, K-means, Frequent Item-Set, Hierarchical Clustering Algorithms and Oracle Text. The results show that there are linkages between structured textual data and unstructured information, quality improvement in textual document clustering with accurate clusters and high performance of query processing. Thus, the proposed technique can increase retrieval performance and produce high accuracy textual data clusters. This model envisages a beneficial and useful approach for various domains that involve big textual data such as document clustering, topic detecting and tracking, information integration, personal data management and information retrieval.