System Properties Based Malware Detection using Machine Learning

Abstract

Problem- The Financial Technology (FinTech) sector is one of the most targeted sectors for malicious software (malware) attacks. The COVID-19 pandemic has further exacerbated this issue as the United Nations noted an increase in cybercrime by 600%. Traditional malware detection methods have become obsolete due to the rapid development of polymorphic and metamorphic malware that automatically change its shape and produce several signatures for the same malware. Objectives- Against this background, it was the objective of this research to detect malware based on system properties using machine learning algorithms namely Extreme Gradient Boosting (XGBoost), Multi-layer Perceptron (MLP), Decision Tree and Logistic Regression, ascertain the best classifier, tune the model hyperparameters and determine the most important features in detecting malware based on system properties. Methodology- The methodology utilized was a modified version of the CRoss-Industry Standard Process for Data Mining (CRISP-DM). The data used was real-world malware data curated by Microsoft and hosted on Kaggle. The data was summarized, explored and analysed using univariate, bivariate and multivariate analysis. Furthermore, the data was cleaned by removing features that had a sufficiently high number of missing values and imbalance classes. Missing values below the threshold were replaced and 9 new features were created. 10% of the entire data was randomly selected, encoded using label and frequency encoding and modelled. A train test split of 70:30 was used alongside Grid-search with 5-fold cross-validation to search for the optimal hyperparameters. Achievements- The results showed that machine learning algorithms were effective in detecting malware based on system properties. Results after hyperparameter tuning showed that XGBoost performed better based on AUC-ROC value, followed by MLP, decision tree and logistic regression being the least. In addition, the most improved model after tuning based on AUC-ROC difference was the decision tree, next to XGBoost, logistic regression and MLP. Furthermore, the most important features were “SmartScreen" using gain, “AvProductsInstalled" using over, and “AvSigVersion" using cover.

Type
Publication
An unpublished MSc thesis submitted to the Division of Computing Science and Mathematics, University of Stirling, United Kingdom, for the degree of Master of Science in Financial Technology (FinTech)
Faithful Chiagoziem OWNUEGBUCHE
Faithful Chiagoziem OWNUEGBUCHE
PhD Candidate in Machine Learning and Blockchain Technology

My research focuses on the intersection of machine learning and blockchain technology, particularly their applications in the fields of cybersecurity and finance.