Discovering optimal features using static analysis and a genetic search based method for Android malware detection

Mobile device manufacturers are rapidly producing miscellaneous android versions worldwide. Simultaneously, cyber criminals are executing malicious actions such as tracking user activities, stealing personal data, and committing bank fraud. These criminals gain numerous benefits as many people use a...

Full description

Bibliographic Details
Main Authors: Ahmad Firdaus, Zainal Abidin, Nor Badrul, Anuar, Ahmad, Karim, Mohd Faizal, Ab Razak
Format: Article
Language:English
English
Published: Springer 2018
Subjects:
Online Access:http://umpir.ump.edu.my/id/eprint/19177/
http://umpir.ump.edu.my/id/eprint/19177/
http://umpir.ump.edu.my/id/eprint/19177/
http://umpir.ump.edu.my/id/eprint/19177/1/Discovering%20optimal%20features%20using%20static.pdf
http://umpir.ump.edu.my/id/eprint/19177/2/Discovering%20optimal%20features%20using%20static1.pdf
Description
Summary:Mobile device manufacturers are rapidly producing miscellaneous android versions worldwide. Simultaneously, cyber criminals are executing malicious actions such as tracking user activities, stealing personal data, and committing bank fraud. These criminals gain numerous benefits as many people use android for their daily routines, including important communications. With this in mind, security practitioners have conducted static and dynamic analyses to identify malware. In this study, we used static analysis because of its overall code coverage, low resource consumption, and rapid processing. However, static analysis requires a minimal number of features to classify malware efficiently. Therefore, we used genetic search (GS), which is a search based on a genetic algorithm (GA), to select the features among 106 strings. To evaluate the best features determined by GS, we used five machine learning classifiers, namely, Naïve Bayes (NB), Functional Trees (FT), J48, Random Forest (RF), and Multilayer Perceptron (MLP). Among these classifiers, FT gave the highest accuracy (95%) and true positive rate (TPR) (96.7%) with the use of only six features.