Prediction of Malwares in Microsoft Windows Operating Systems

Olukayode Aiyeniko
Aishat Oladayo Jimoh-Mahmud
Temitope Ayanladun Oyelakun
Oluwaseyi Inubiwon Oluwabukola
Stella Kehinde Ogunkan
202-213
Aug 7, 2024
Computer Science

Prediction of Malwares in Microsoft Windows Operating Systems

Olukayode Aiyeniko¹, Aishat Oladayo Jimoh-Mahmud², Temitope Ayanladun Oyelakun³, Oluwaseyi Inubiwon Oluwabukola⁴, Stella Kehinde Ogunkan⁵

¹Department of Computer Science, Lagos State University, Lagos State, Nigeria

²Department of Computer Science, Alhikmah University, Ilorin, Kwara State State, Nigeria

^3,5Department of Information System, Ladoke Akintola University of Technology, Ogbomoso, Oyo State, Nigeria

⁴Department of Computer Science, Kogi State Polytechnic, Kogi State, Nigeria

⁵Department of Computer Science, Ladoke Akintola University of Technology, Ogbomoso, Oyo State, Nigeria

DOI: https://doi.org/10.51584/IJRIAS.2024.907020

Received: 18 June 2024; Accepted: 29 June 2024; Published: 07 August 2024

ABSTRACT

Malware has been identified as one of the predominant cyber threats with the fast growth of the internet. Anti-malware vendors nowadays receive a huge amount of speculated malware files daily. To keep up with the flow of these malware-ridden files, machine-learning techniques are used to abstract similar malwares. This paper presents a prediction model for the prediction of malware attacks based on certain features of the machine and the Light Gradient Boost Algorithm was employed for this purpose. A dataset that came in two splits (the train and test splits) with entries of Windows machines with different specifications was acquired and preprocessed to remove irrelevant features. The train split of the dataset was then used to train the Light Gradient Boost Algorithm to derive a model which was then used for the prediction on the test split. The accuracy of the model was found to be 98% while the precision and recall of the model were also found to be 98%. This study would help Windows users know what kind of specifications of machines are more prone to malware attacks.

Keywords: Malware, Prediction, Microsoft Window Operating System

INTRODUCTION

Electronic products have become a necessity for most people nowadays [1]. Because these electronic products have taken over almost every side of human life and endeavour, the attention of hackers has also been drawn to them. This attention has yielded a considerable spike in the number of malwares that is currently in distribution [2]. The development of current technologies has produced computers one of the most commonly used electronic products [3]. However, there are always several people who want to take advantage of others by attacking others’ computers. To evade property damage as much as possible, precise and efficient detection is essential. As reported by Kaspersky Labs in 2017, there was an increasing trend in the types of malwares as at least 360,000 new malicious files were detected every day for the whole of 2017.

Nowadays, the generation of technological innovation and the enlargement and extension of data in the database is very fast [4]. All things associated with technology are completely complementing data growth, as well as scientific data, social media, and financial technology [5]. Data mining tools and techniques have been employed extensively by researchers to predict future trends by making businesses more proactive and better knowledge-driven decisions [6]. With the fast advancement in data mining techniques and methods came the advent of machine learning which represents a subset of Artificial Intelligence[7]. Machine learning can also be viewed as a subdivision of the computing field where the main idea is the ability of a digital computer to learn without being trained [7]. A computer program is claimed to learn from experience A of some classes of tasks B and performance measure C, if its performance at tasks within B, as estimated by C, boosts with experience A.

Fundamentally, the idea behind every machine learning task is to train a model based on some algorithm to perform certain tasks like clusterization, classification, regression and so on. An input dataset is used to train the model and the subsequent model that is built is used to make predictions. Implementation and the initial task performed are determining factors of the output of such a model. Machine learning has been applied extensively in this study to make predictions based on the dataset if a particular machine is likely to be infected with malware [8]. Under this, it is incredibly significant to find out which factors are related to the infection of computers. This study focused on the different machines involved and their probability of getting attacked, which supply necessary instructions to protect computers.

LITERATURE REVIEW

2.1 Malware

Malware is referred to as programs that have malicious code which serves as a major threat to any internet user [9]. It has a severe effect on the technological world. It has been classified to be one of the most dangerous threats on the internet [10]. Malware analysis is the process of finding the functionality, origin and potential impact of a given malware sample such as a virus, worm, Trojan horse, rootkit, or backdoor [11]. Malware or malicious software is any software that is intended to harm the user’s operating system or steal sensitive data from users, organizations, or companies. Malware is immensely powerful to the extent that it can take full control of infected host and network connection deactivating all the firewalls and installed antiviruses. The problem is increasing with internet use as most web pages have been infected with diverse types of malware downloads which can be gotten by the opening of web pages [12]. According to statistics made by Google, 70% of malware is gotten from popular sites.

2.2 Related Work

Maniriho et al. [13] conducted a comprehensive overview of the current state of Windows malware detection techniques, research issues, and future directions. The systematic literature review was done by examining the scientific literature on Windows malware detection based on executable files (.EXE file format) published between 2009 and 2022. The study presented new insights into the categorization of malware detection techniques based on datasets, features, machine learning and deep learning algorithms. It identified ten experimental biases that could influence the performance of malware detection techniques. It provides insights on performance evaluation metrics and discuss several research issues that impede the effectiveness of existing techniques. The study also provides recommendations for future research directions and is a valuable resource for researchers and practitioners working in the field of Windows malware detection.

Avci et al. [14] evaluated and benchmarked the performance of LSTM-based malware detection approaches on specific LSTM architectures to provide insight into malware detection. The method builds LSTM-based malware prediction models and performs experiments using different LSTM architectures including Vanilla LSTM, stacked LSTM, bi-directional LSTM, and CNN-LSTM. The study evaluated the performance of each of these architectures and different configurations. The study, as a contribution, shows that Bidirectional LSTM with hyperparameter optimization is found to be overperforming other selected LSTM architectures and alao tthat different LSTM approaches and architectures apply to the malware detection problem. Quality attributes such as efficiency and accuracy, and the software system architecture adopted for the implementation impact the selection of the LSTM approach.

A malware identification and classification scheme based on bicubic interpolation to improve the security of a plant protection information terminal system Chen et al. [16]. The study used a bicubic interpolation algorithm to restructure the generated malware images to reduce the problem of image size imbalance. The cycle-GAN model was used for data augmentation to balance the number of samples among malware families and build an efficient malware classification model based on CNNs to improve the malware identification and classification performance of the system. The accuracy of RGB and grey images generated by the Microsoft Malware Classification Challenge Dataset (BIG2015) can reach 99.76% and 99.62%, respectively.

Performance evaluation of machine learning algorithms for malware detection [9]. Classifiers such as kNN, DT, RF, AdaBoost, SGD, extra trees and the Gaussian NB classifier were used to synthesize the study. After reviewing the test and experimental data for all five classifiers, it was discovered that the RF, SGD, extra trees and Gaussian NB Classifier all achieved 100% accuracy in the test, as well as a perfect precision (1.00), a good recall (1.00), and a good f1-score (1.00). Therefore, it is reasonable to assume that the proof-of-concept employing autonomous behaviour-based malware analysis and machine learning methodologies might identify malware effectively and rapidly.

Developed ransomware classification model in Microsoft Windows using hardware profile [16]. shows that obtaining a hardware execution profile is beneficial to identify the obfuscated ransomware too. The features obtained from hardware performance counters were evaluated to classify malicious applications into ransomware and non-ransomware categories using several machine learning algorithms such as Random Forest, Decision Tree, Gradient Boosting, and Extreme Gradient Boosting. The employed data set comprises 80 ransomware and 80 non-ransomware applications, which are collected using the VirusShare platform. The results revealed that extracted hardware features play a substantial part in the identification and detection of ransomware with an F-measure score of 0.97 achieved by Random Forest and Extreme Gradient Boosting.

METHODOLOGY

3.1 Approach to Prediction of Malware Attacks in Windows Operating Systems

Various steps were involved in the development of a predictive model for malware in the Windows Operating System. The dataset that was used in conducting this experiment was acquired from Kaggle and it came in two splits, the training and testing set. The dataset was first visualized to get the percentage of columns with high cardinality, missing values or null values, and irrelevant features. Then cleansing was performed on the dataset to remove columns with high cardinality, missing values or null values and irrelevant features. After data cleansing, the training set was split into two. The first part was used to train the light gradient model explained in the section. The second part was used to obtain the performance metrics. Then the model was used to predict a machine’s likelihood of getting attacked by malware as shown in Figure 1.

Figure 1: Framework of Prediction Model

3.2 Dataset Acquisition and Preprocessing

The dataset is divided into two sets, the training set and the testing set, and the dataset holds many properties of dissimilar computer hardware and software-related information. It accommodates a total of over eight million rows and eighty-three columns. Each row is set aside for a unique machine identifier. The other columns supply different software and hardware specifications of the machine.

Table 1:Features of the Dataset

The dataset holds many features and some of features are shown in Table 1.

S/N	Feature	Description	Used in Training/Testing
1	HasDetections	It tells whether the machine is affected by malware or not	Yes
2	MachineIdentifer	Individual machine ID	Yes
3	TPM (Trusted Protection Module)	Shows whether the machine has TPM capabilities	Yes
4	Platform Identifier	OS properties and microprocessor info	No
5	City	City identifier where the machine is accessed	No
6	Country	Country identifier from where the machine is accessed	No
7	OS Version	Version of the Operating System	Yes
8	Is protected	Shows if the system is protected or not. The field holds if at least one of the installed antiviruses is up to date.	Yes
9	Census TotalPhysicalRAM	Details of physical RAM	No
10	Firewall Details	It shows the firewall information and holds if the firewall is active	Yes
11	isVirtualDevice	Virtual devices hosting on the specific machine	Yes
12	Secure Boot	Shows if secure boot is enabled or not	Yes
13	Primary Disk Type	Shows Information about the primary disks, e.g., HDD, SSD.	Yes

The features contained in the dataset are not limited to Table 1.

3.2.1 Analysis of Data

The dataset has three types of columns contained in it. The Numerical, Binary and Categorical Types as described in Figure 3. The numerical column deals with all entries that have a numerical value. The Binary column deals with all entries in the column that either have a ‘0’ or a ‘1’ as an entry. An entry with ‘0’ means ‘no’ and an entry with ‘1’ means ‘yes’. The categorical column type holds every other kind of entry. The column type distribution in the dataset was gotten using the following Python script and is shown in Figure 2

Figure 2: Code Snippet for Showing Column Distribution

Figure 3: Column Type Distribution

As seen in Figure 3, the graph gives more categorical columns than there are binary and numerical.

3.2.2 Null Values

In the dataset, there are some columns with null values. These were visualized using the code snippet as discussed in Figure 4. The graphical representation is shown in Figure 5.

Figure 4: Null Values Count Snippet

Figure 5: Null Values Count Graph

From Figure 4, it shows that there are seven features with more than 50% missing values.

3.2.3 Column Cardinality

The cardinality of a column refers to the number of times an entry appears in that column. The cardinality of each column is shown using the code snippet as shown in Figure 6. The result is shown in Figure 7

Figure 6: Column Cardinality Snippet

Figure 7: Column Cardinality Graph

Some of the features have too many categorical values (high cardinality), which can be a problem for some models. As seen from the graph above, it can be deduced that the column with the highest cardinality is Census_System Volume Capacity, followed by Census_OEM Model Identifier and so on.

RESULTS AND DISCUSSION

4.1 Result for Implementation of the Model on the Test Split

The test set was loaded into Jupiter after which the model was applied to it for the prediction of the likelihood of a computer being infected by malware. The process is illustrated using Figure 8 and the outputs are shown in Figure 9

Figure 8: Prediction on the Test set

Figure 9: Likelihood of each machine identifier being attacked by Malware

4.2 Result of Data Cleansing

The first step conducted in the pre-processing of the dataset involved the removal of the columns with high cardinality. This was achieved using the code snippet as shown in Figure 10. The columns that were dropped are represented in Figure 11.

Figure 10: Drop Columns with High Cardinality Snippet

Figure 11: Columns with High Cardinality

Unwanted columns and columns with more than 40% missing values were also removed. This was achieved using the following code snippet in Figure 12.

Figure 12: Processing the Dataset for Light Gradient Boost Model

4.3 Train/validation of Random Split

The training dataset was divided into train and validation for the Light Gradient Boost Model using the code snippet as discussed in Figure 13. The model was ranked and visualized using the code snippet as shown in Figure 14.

Figure 13: Code Snippet for Light Gradient Boost Model

Figure 14: Train/Validation Split Snippet

4.4 Model Training

The code snippet for the training of the Light Gradient Boost model is shown in Figure 15. The importance of each feature to the building is discussed in Figure 16

Figure 15: Training the LGB Model Snippet

Figure 16: Model of Feature

Figure 15: Importance of Feature in the Model

As seen in the graph above, the most important feature that was used by the model is the Engine Version column.

4.6 Performance Evaluation Metrics

The metrics that were used in evaluating the model are shown in Figure 16 and also Figure 17 indicates the confusion matrix metrics.

Figure 16: Evaluation Metrics for the Model

Figure 17: Confusion Matrix of LGBM

The True Positive (TP) of the train split = 0.98

The False Positive (FP) of the train split = 0.02

The False Negative (FN) of the train split = 0.02

The False Negative (TN) of the train split = 0.98

The accuracy of the model on the train split = 0.98.

The precision is calculated thus:

(1)

Therefore, the precision of the model on the train split = 0.98.

The recall is calculated thus:

(2)

Therefore, the recall of the model on the train split = 0.98.

To have a joint effect of the Precision and Recall, the F1 Score is calculated. The F1 score is the harmonic mean of the Precision and Recall.

The F1 Score is calculated thus:

(3)

Therefore, the F1 score = 0.98%.

CONCLUSION

The Light Gradient Boost Algorithm was utilized in experimenting and the outcome was that each machine was given a value standing for its likelihood of being attacked by malicious software. From the result, it can be deduced that the likelihood of a machine getting infected by malware increased as the value approached 1. Based on the result obtained, the light Gradient Boost Algorithm had an accuracy of 98%.

REFERENCES

W. Niu and M. Fan, “Control and Research of Computer Virus by Multimedia Technology,” Int. J. Inf. Syst. Supply Chain Manag., vol. 17, no. 1, pp. 1–17, 2023, doi: 10.4018/IJISSCM.333896.
A. K. S. Vijayanand. C. D, “Impact of Malware in Modern Society Impact of Malware in Modern Society Abstract :,” J. Sci. Res. Dev., vol. 2, no. 3, pp. 593–600, 2019.
Q. Pan, W. Tang, and S. Yao, “The application of LightGBM in microsoft malware detection,” J. Phys. Conf. Ser., vol. 1684, no. 1, pp. 1–7, 2020, doi: 10.1088/1742-6596/1684/1/012041.
S. Chandrakant Narwade and N. S. Ratnaparkhi, “An Overview Paper on Data Mining Techniques and Applications,” Int. J. Classif. Res. Tech. Adv. Multidiscip. Peer-Reviewed Online Open Access J. Impact Factor 5, vol. 455, no. 3, pp. 48–53, 2023, [Online]. Available: www.ijcrta.org
C. Okoro, J. W. Etukudo, and C. J. Obizuo, “Impact of Financial Technology o n Financial Institutions ’ Performance . Evidence From Nigerian Commercial Banks,” J. Account. Financ. Manag., vol. 10, no. 3, pp. 1–25, 2024, doi: 10.56201/jafm.v10.no3.2024.pg111.134.
S. Gheware, A. S. Kejkar, and S. M. Tondare, “Data Mining : Task , Tools , Techniques and Applications,” Int. J. Adv. Res. Comput. Commun. Eng., vol. 3, no. 10, pp. 1–5, 2014, doi: 10.17148/IJARCCE.2014.31003.
I. Sabah, L. L. Nahmatwlla, and W. A. Ahmed, “Role of AI and Machine Learning Based on Principle of Web Technology and Cloud Computinmg,” J. Biomech. Sci. Eng., pp. 117–141, 2023, doi: 10.17605/OSF.IO/4A9MH.
N. Z. Gorment, A. Selamat, L. K. Cheng, and O. Krejcar, “Machine Learning Algorithm for Malware Detection: Taxonomy, Current Challenges, and Future Directions,” IEEE Access, vol. 11, pp. 141045–141089, 2023, doi: 10.1109/ACCESS.2023.3256979.
Muhammad Shoaib Akhtar and Tao Feng, “Malware Analysis and Detection Using Machine Learning Algorithms,” Symmetry (Basel)., vol. 14, pp. 1–11, 2022.
S. Franjić, “Cybercrime is very dangerous form of criminal behavior and cybersecurity,” Emerg. Sci. J., vol. 4, no. Special Issue, pp. 18–26, 2020, doi: 10.28991/esj-2020-SP1-02.
S. Talukder and Z. Talukder, “A Survey on Malware Detection and Analysis Tools,” Int. J. Netw. Secur. Its Appl., vol. 12, no. 2, pp. 37–57, 2020, doi: 10.5121/ijnsa.2020.12203.
C. Eze, A. O & Chukwunonso, “Malware Analysis and Mitigation in Information Preservation,” J. Comput. Eng., vol. 20, no. 4, pp. 53–62, 2018, doi: 10.9790/0661-2004015362.
P. Maniriho, A. N. Mahmood, and M. J. M. Chowdhury, “A systematic literature review on Windows malware detection: Techniques, research issues, and future directions,” J. Syst. Softw., vol. 209, p. 111921, 2024, doi: 10.1016/j.jss.2023.111921.
C. Avci, B. Tekinerdogan, and C. Catal, “Analyzing the Performance of Long Short-Term Memory Architectures for Malware Detection Models,” Concurr. Comput. Pract. Exp., vol. 35, no. 6, p. 1, 2023, doi: 10.1002/cpe.7581.
Z. Chen, S. Xing, and X. Ren, “Efficient Windows malware identification and classification scheme for plant protection information systems,” Front. Plant Sci., vol. 14, no. February, pp. 1–12, 2023, doi: 10.3389/fpls.2023.1123696.
S. Aurangzeb, R. N. Bin Rais, M. Aleem, M. A. Islam, and M. A. Iqbal, “On the classification of Microsoft-Windows ransomware using hardware profile,” PeerJ Comput. Sci., vol. 7, pp. 1–24, 2021, doi: 10.7717/peerj-cs.361.