INTERNATIONAL JOURNAL OF RESEARCH AND SCIENTIFIC INNOVATION (IJRSI)
ISSN No. 2321-2705 | DOI: 10.51244/IJRSI |Volume XII Issue VIII August 2025
Page 535
www.rsisinternational.org
Comparative Analysis of Some Machine Learning Algorithms for the
Classification of Ransomware
Adeniyi, Adedayo Omoniyi
1
.,
Olabiyisi, Stephen Olatunde
2
., Adepoju, Temilola Morufat
3
and Sanusi,
Bashir Adewale
4
1, 2, &3
Department of Computer Science and Engineering, Ladoke Akintola University of Technology,
Ogbomoso, Oyo state Nigeria
4
University of the West of England, Bristol United Kingdom
DOI: https://doi.org/10.51244/IJRSI.2025.120800045
Received: 24 July 2025; Accepted: 30 July 2025; Published: 02 September 2025
ABSTRACT
Ransomware is a serious cybersecurity threat, encrypting data and demanding payment for its release. This
study compares six machine learning algorithms, these are Random Forest (RF), Decision Tree (DT), Neural
Network (NN), Support Vector Machine (SVM), K-Nearest Neighbors (KNN), and Naive Bayes (NB) for
ransomware classification. A GitHub sourced dataset was preprocessed using standard techniques, and feature
selection was done using correlation analysis, mutual information, and recursive feature elimination. Models
were trained and evaluated using Python’s scikit-learn library, assessed on accuracy, precision, recall, F1-
score, and ROC-AUC. RF achieved the best performance with 99.98% accuracy and 99.99% ROC-AUC,
followed closely by DT and NN. NB performed poorly across most metrics. Results indicate RF as the most
effective model for ransomware detection. These findings support the development of intelligent threat
detection systems for cybersecurity platforms, cloud infrastructure, and endpoint protection.
Keywords: Comparative Performance, Ransomware, Machine Learning (ML), Random Forest (RF), Support
Vector Machine (SVM), Decision Tree (DT), Feature Selection and Python scikit-learn.
INTRODUCTION
Ransomware attacks have become a significant and escalating threat to information security, targeting systems,
data centers, and applications across various sectors. These attacks encrypt critical data and demand ransom
payments for its release, often leading to severe operational disruptions, substantial financial losses, and long-
term reputational damage to affected organizations (Scaife et al., 2016). Traditional ransomware detection
methods, such as signature-based approaches, are increasingly inadequate due to the rapid evolution and
obfuscation tactics used by modern ransomware variants (Ucci et al., 2019). This growing sophistication
necessitates the development of more dynamic and intelligent detection mechanisms.
This research aims to address existing limitations in ransomware detection by implementing and comparing
these six machine learning algorithms (DT, RF, SVM, KNN, NN, and NB) using a consistent dataset and
comprehensive evaluation metrics. A robust feature selection methodology was employed to improve model
performance and reduce dimensionality. The goal is to evaluate the classification capabilities of each algorithm
using metrics such as Accuracy, Precision, Recall, F1-score, and ROC-AUC, and to provide practical
implementation insights for real-world systems.
By incorporating this range of algorithms, the study ensures a comprehensive evaluation of different learning
strategies and their effectiveness in the classification of ransomware. The performance of each algorithm is
assessed using a unified evaluation framework that includes accuracy, precision, recall, F1-score, and ROC-
AUC metrics. This approach supports an informed comparison and contributes to identifying the most suitable
techniques for developing effective ransomware detection systems.
INTERNATIONAL JOURNAL OF RESEARCH AND SCIENTIFIC INNOVATION (IJRSI)
ISSN No. 2321-2705 | DOI: 10.51244/IJRSI |Volume XII Issue VII July 2025
Page 536
www.rsisinternational.org
In the quest to contribute to knowledge, this study aims to investigate and compare the performance analysis of
some machine learning techniques for the classification of ransomware.
The specific objectives are to:
i. Extract relevant ransomware dataset from Github repository using ransomware.csv to facilitate the
supervised learning tasks.
ii. Implement and train machine learning models (RF, DT, NN, SVM, KNN and, NB) to classify
ransomware data extracted from Github using Python scikit-learn 1.6.1.
iii. Compare the model’s performance using Accuracy, Precision, Recall, F1-score, and ROC-AUC.
Related Work
Al-Ruwili et al. (2023) explore the impact of machine learning techniques on Android ransomware detection,
emphasizing the need for robust classification methods in mobile environments. The authors utilize a
combination of static and dynamic features to train their models, including Support Vector Machines and
Decision Trees. Their findings reveal that machine learning models can effectively identify ransomware with
high accuracy, demonstrating the potential for real-time detection in mobile applications.
Egunjobi et al. (2019) conducted a comprehensive study on the classification of ransomware using machine
learning algorithms, focusing on the integration of both static and dynamic features to enhance detection
accuracy. The authors employed a variety of supervised learning algorithms, including Naïve Bayes, Support
Vector Machines, and Random Forest, to evaluate their performance in classifying ransomware samples. The
results indicate that SVM and Random Forest achieved the highest accuracy rates of 99.5%, while Naïve
Bayes demonstrated an accuracy of 96%. The study highlights the importance of feature selection and the use
of confusion matrices to systematically compare the effectiveness of different algorithms, providing valuable
insights for future research in ransomware detection.
Khammas et al. (2022) conduct a comparative analysis of different machine learning algorithms for
ransomware detection, focusing on the effectiveness of feature selection techniques. The study evaluates
algorithms such as K-Nearest Neighbors (KNN), Decision Trees, and Support Vector Machines, providing a
detailed examination of their performance metrics. The authors find that incorporating advanced feature
selection methods significantly improves the accuracy of ransomware detection models.
Masum et al. (2022) investigate the classification and detection of ransomware using various machine learning
algorithms, including Random Forest, Naïve Bayes, and Neural Networks. The authors emphasize the
significance of feature selection and preprocessing in improving model performance. Their results indicate that
Random Forest outperforms other algorithms, achieving the highest accuracy in detecting ransomware.
Ngirande et al. (2024) present a novel approach to ransomware detection using hybrid machine learning
techniques, combining the strengths of various algorithms to enhance classification accuracy. The authors
utilize a dataset comprising both benign and malicious samples to train their models, achieving impressive
results in detecting ransomware variants. Their findings suggest that hybrid models can outperform traditional
single-algorithm approaches, providing a promising direction for future research in ransomware detection.
METHODOLOGY
In the process of evaluating of some selected machine learning models for ransomware classification and to
evaluate their performance, the following steps were involved:
Data Collection
The dataset used in this research was collected from a publicly available GitHub repository containing labeled
samples of ransomware and benign files. The dataset, named Ransomware.csv, contains various features that
INTERNATIONAL JOURNAL OF RESEARCH AND SCIENTIFIC INNOVATION (IJRSI)
ISSN No. 2321-2705 | DOI: 10.51244/IJRSI |Volume XII Issue VII July 2025
Page 537
www.rsisinternational.org
are instrumental in distinguishing between benign and ransomware files. The dataset includes relevant features
such as file metadata, static attributes, and behavioral indicators. This open-source data ensures access to
current and recent ransomware patterns.
Dataset Description
The Ransomware.csv file includes a wide range of features extracted from different types of files. These
features are crucial for building machine learning models to classify files as either benign or ransomware. The
dataset is structured with the following key attributes:
i. File Metadata: Information about the file such as size, type, and creation date.
ii. Behavioral Features: Characteristics that describe the behavior of the file, such as the number of
read/write operations, network activity, and system modifications.
iii. Static Features: Attributes derived from the static analysis of the file, including byte sequences,
strings, and header information.
iv. Dynamic Features: Data obtained from the dynamic analysis of the file, such as API calls, registry
changes, and process activities.
v. Label: The target variable indicating whether the file is benign or ransomware.
Data Pre-processing
The raw dataset was cleaned to handle missing values, encode categorical attributes, and standardize numeric
features. These pre-processing steps are necessary to ensure data consistency and to make it suitable for
training machine learning models. Proper normalization also improves algorithm performance and
convergence. The preprocessing steps include:
i. Handling Missing Values: Any missing values in the dataset were identified and appropriately
handled, either by imputing with mean/median values or by removing the rows/columns with missing
data.
ii. Encoding Categorical Variables: Categorical variables were encoded into numerical values using
techniques such as one-hot encoding or label encoding.
iii. Feature Standardization: The features were standardized to have a mean of zero and a standard
deviation of one. This step is essential for algorithms that are sensitive to the scale of the data.
Feature Selection
A combination of methods including correlation analysis, mutual information, and recursive feature
elimination was used to identify the most relevant features. Redundant and less informative features was
removed to enhance accuracy and reduce overfitting. The selected feature set becomes the input for model
training.
Various feature selection techniques were applied, including:
i. Correlation Analysis: Analyzing the correlation between features and the target variable to identify
highly correlated features.
ii. Mutual Information: Using mutual information to measure the dependency between features and the
target variable.
iii. Recursive Feature Elimination (RFE): Iteratively removing the least important features based on
model performance.
INTERNATIONAL JOURNAL OF RESEARCH AND SCIENTIFIC INNOVATION (IJRSI)
ISSN No. 2321-2705 | DOI: 10.51244/IJRSI |Volume XII Issue VII July 2025
Page 538
www.rsisinternational.org
Model Initialization
A diverse set of machine learning algorithms were selected to classify the ransomware dataset into benign and
ransomware. The models implemented include:
i. Decision Tree: The Decision Tree model is initialized using the Gini Index as the splitting criterion, as
previously defined in Equation (3.1). This function guides the selection of optimal splits by minimizing
impurity at each node of the tree. The DT algorithm splits the dataset into branches based on feature
thresholds that maximize information gain or minimize entropy. It creates a hierarchical structure of
decisions to classify data points as either ransomware or benign. DTs are initialized using the Gini
index or entropy as the splitting criterion and are known for their interpretability and fast execution
time.
󰇛
󰇜

󰇛
󰇜


3.1
where 󰇛
󰇜
is the proportion of instances belonging to class 󰇛󰇜 in set 󰇛󰇜. (Masum et al., 2022).
ii. Random Forest: RF is an ensemble method that constructs multiple decision trees and aggregates their
results for improved accuracy and generalization. RF operates by aggregating multiple decision trees
and averaging their predictions, based on the expression provided in Equation (3.2). Each tree is
trained on a random subset of the data and features. Model initialization involves setting the number of
trees (estimators), the maximum tree depth, and bootstrap sampling. RF reduces overfitting compared
to a single decision tree.
󰇛
󰇜


3.2
where:
is the predicted value,
is the number of trees,
󰇛
󰇜
is the prediction of the 󰇛
󰇝

󰇞
󰇜
tree for input 󰇛󰇜 (Breiman, 2001).
iii. Support Vector Machine: SVM is a powerful classifier that identifies the optimal hyperplane
separating the two classes by maximizing the margin between support vectors. It can be adapted for
non-linear data using kernel functions such as radial basis function (RBF). The SVM classifier is
implemented using the margin maximization formulation outlined in Equation (3.3). The
hyperparameters, including the kernel function and regularization constant, are adjusted to optimize
separation between the two classes. The SVM model is initialized with parameters such as kernel type,
regularization constant (C), and gamma for kernel scaling.
3.3
where:
INTERNATIONAL JOURNAL OF RESEARCH AND SCIENTIFIC INNOVATION (IJRSI)
ISSN No. 2321-2705 | DOI: 10.51244/IJRSI |Volume XII Issue VII July 2025
Page 539
www.rsisinternational.org
󰇛󰇜 is the weight vector,
󰇛󰇜 is the bias,
󰇛
󰇜
is the class label for instance 󰇛
󰇜
(Ngirande, 2024; Kok et al., 2019).
iv. K-Nearest Neighbors: KNN is a distance-based algorithm that classifies a data point based on the
majority label of its 'k' nearest neighbors in the feature space. KNN determines the class of a new data
point based on the majority vote among its k nearest neighbors. The distance metric guiding this
process is previously illustrated in Equation (3.4). The model is initialized by selecting an appropriate
value of k and a distance metric, typically Euclidean. KNN is non-parametric and particularly sensitive
to feature scaling and noise.
3.4
where 󰇛󰇜 and 󰇛󰇜 are two data points, and (n) is the number of features (Bawazeer et al., 2021; Abualhaj,
2024). The class of the majority of the 󰇛󰇜 nearest neighbors is assigned to the query instance
v. Neural Network: NN simulate the behavior of the human brain through layers of interconnected
neurons. In this study, a feedforward neural network is implemented with an input layer, one or more
hidden layers, and an output layer. The learning process in the feedforward neural network is guided by
weight adjustments through backpropagation, as discussed in Equation (3.5). The activation function
used at each layer influences how signals are passed forward through the network. The model is
initialized with parameters such as activation function (e.g., ReLU), learning rate, optimizer (e.g.,
Adam), and number of epochs.


3.5
where:
is the output,
are the weights,
are the inputs,
is the bias,
is the activation function (e.g., sigmoid, ReLU)
Neural Networks are particularly powerful for complex tasks such as image and speech recognition, and they
have been successfully applied in malware detection to identify patterns in data that may indicate malicious
behavior (Fuyong et al., 2018; Asad et al., 2020).
vi. Naive Bayes: NB is a probabilistic classifier based on Bayes’ Theorem with an assumption of feature
independence. It is particularly efficient for high-dimensional data. Classification is performed using
Bayes’ Theorem, assuming feature independence, as defined in Equation (3.6). This model calculates
INTERNATIONAL JOURNAL OF RESEARCH AND SCIENTIFIC INNOVATION (IJRSI)
ISSN No. 2321-2705 | DOI: 10.51244/IJRSI |Volume XII Issue VII July 2025
Page 540
www.rsisinternational.org
posterior probabilities to assign a class label to each instance. The Gaussian Naïve Bayes variant is
used when the data is continuous and assumed to follow a normal distribution. The model is initialized
without extensive hyperparameter tuning, making it lightweight and fast.
󰇛
󰇜
󰇛
󰇜
󰇛
󰇜
󰇛
󰇜

3.6
where:
󰇛
󰇜
is the posterior probability of class 󰇛󰇜 given the features 󰇛󰇜
󰇛
󰇜
is the likelihood of the features given class 󰇛󰇜
󰇛
󰇜
is the prior probability of class 󰇛󰇜
󰇛
󰇜
is the total probability of the features. Bold et al. (2022) Masum et al., 2022).
Model Training and Classification
The models were trained using the preprocessed dataset and evaluated using various metrics to assess their
performance. The steps involved are:
i. Train-Test Split: The dataset was split into training and testing sets to evaluate the models'
performance on unseen data.
ii. Model Training: Each model was trained using the training set.
Performance Evaluation
The performance of the classification models is evaluated using several metrics, all implemented using Scikit-
learn. These metrics include Accuracy, Precision, Recall, F1-score, and ROC-AUC. Each metric provides a
different perspective on the models' performance, which is crucial for a comprehensive evaluation in the
context of ransomware detection. The evaluation metrics employed are:
Accuracy: Accuracy measures the proportion of correctly classified instances (both benign and ransomware)
out of the total instances. It is calculated as:



(3.7)
where:
TP = True Positives
TN = True Negatives
FP = False Positives
FN = False Negatives
Precision: Precision indicates the proportion of true positive instances among the instances classified as
positive (ransomware). It is calculated as:



(3.8)
INTERNATIONAL JOURNAL OF RESEARCH AND SCIENTIFIC INNOVATION (IJRSI)
ISSN No. 2321-2705 | DOI: 10.51244/IJRSI |Volume XII Issue VII July 2025
Page 541
www.rsisinternational.org
Recall: Recall (or Sensitivity) measures the proportion of true positive instances among all actual positive
instances. It is calculated as:



(3.9)
F1-score: The F1-score is the harmonic mean of Precision and Recall, providing a balanced measure of the
models' performance. It is calculated as:
 


(3.10)
ROC-AUC: The Receiver Operating Characteristic - Area Under Curve (ROC-AUC) measures the ability of
the model to distinguish between classes. It is calculated as the area under the ROC curve, which plots the
True Positive Rate (Recall) against the False Positive Rate (FPR). The FPR is calculated as:


 
󰇛󰇜
RESULT AND DISCUSSION
As shown in table 4.1, the performance analysis of six machine learning models implemented for ransomware
detection: Decision Tree, Random Forest, Support Vector Machine (SVM), K-Nearest Neighbors (KNN),
Neural Network, and Naive Bayes was evaluated using multiple metrics, including Accuracy, Precision,
Recall, F1-score, and ROC-AUC.
Table 4.1 Model Performance Evaluation
Accuracy
Precision
Recall
F1 Score
ROC AUC
Decision Tree
0.99971
0.99952
0.99952
0.99952
0.99966
Random Forest
0.99978
0.99928
1.00000
0.99964
1.00000
SVM
0.99428
0.99008
0.99103
0.99055
0.99976
KNN
0.99377
0.98738
0.99211
0.98974
0.99816
Neural Network
0.99920
0.99952
0.99785
0.99868
0.99996
Naive Bayes
0.47269
0.36453
0.99761
0.53395
0.65940
The analysis of model performance metrics as shown in Figure 4.1 reveals significant variations across
different machine learning algorithms. Random Forest emerged as the top performer with exceptional metrics,
achieving the highest accuracy of 99.98%, perfect recall (100%), and outstanding precision (99.93%). This
superior performance is further validated by its excellent F1-score of 99.96% and near-perfect ROC-AUC
value of 99.99%, indicating its robust ability to distinguish between classes.
Decision Tree showed remarkable performance, closely following Random Forest with an accuracy of 99.97%.
Its consistent precision and recall values of 99.95% demonstrate balanced prediction capabilities,
complemented by strong F1-score and ROC-AUC metrics of 99.95% and 99.97% respectively.
The Neural Network implementation also demonstrated robust performance with 99.92% accuracy, high
precision (99.95%), and strong recall (99.78%). Its F1-score of 99.87% and ROC-AUC of 99.99% confirm its
effectiveness in classification tasks.
INTERNATIONAL JOURNAL OF RESEARCH AND SCIENTIFIC INNOVATION (IJRSI)
ISSN No. 2321-2705 | DOI: 10.51244/IJRSI |Volume XII Issue VII July 2025
Page 542
www.rsisinternational.org
Support Vector Machine (SVM) and K-Nearest Neighbors (KNN) both performed well, with accuracies above
99%. SVM achieved slightly better metrics overall with 99.43% accuracy compared to KNN's 99.38%, though
both models demonstrated strong capabilities in classification tasks.
In contrast, Naive Bayes showed significantly lower performance with an accuracy of 47.27% and precision of
36.45%. While it maintained high recall (99.76%), its low precision resulted in a poor F1-score (53.39%) and
ROC-AUC (65.94%), indicating substantial limitations in classification accuracy. This underperformance
suggests that Naive Bayes' assumptions about feature independence may not hold for this particular dataset.
Figure 4.1: Model Performance Comparison
Confusion Matrix Analysis
A detailed analysis of the confusion matrices reveals distinct patterns in classification performance across the
different models. The Fig 4.2 shows that Random Forest model demonstrated exceptional performance with
zero false negatives out of 27,610 total predictions, achieving perfect recall. This indicates the model never
failed to identify positive cases, while maintaining a very low false positive rate with only 11
misclassifications.
Figure 4.2 Confusion Matrix for RF
The Decision Tree classifier demonstrated a notably balanced performance, recording exactly 4 false positives
and 4 false negatives, as illustrated in Figure 4.3. This symmetry in misclassification, along with the model’s
high overall accuracy, suggests that it effectively learned the underlying patterns in the data without bias
toward either class.
INTERNATIONAL JOURNAL OF RESEARCH AND SCIENTIFIC INNOVATION (IJRSI)
ISSN No. 2321-2705 | DOI: 10.51244/IJRSI |Volume XII Issue VII July 2025
Page 543
www.rsisinternational.org
Figure 4.3 Confusion Matrix for DT
As shown in Figure 4.4, the Neural Network's confusion matrix revealed slightly asymmetric errors, with 9
false negatives and 5 false positives. Although these misclassifications remain impressively low, they indicate
a slight tendency toward underpredicting the positive class. Nonetheless, the model sustained an excellent
overall accuracy exceeding 99.9%.
Figure 4.4 Confusion Matrix for NN
The Support Vector Machine (SVM) exhibited higher, yet balanced, misclassification rates with 83 false
positives and 75 false negatives as depicted in Figure 4.5. This near-symmetrical distribution of errors suggests
that, despite a greater number of misclassifications compared to the top-performing models, the SVM
maintained a good balance between the two classes.
Figure 4.5 Confusion Matrix for SVM
The K-Nearest Neighbors (KNN) algorithm showed a notable tendency toward false positives, with 106
instances compared to 66 false negatives, as shown in Figure 4.6. This suggests a slight bias toward positive
INTERNATIONAL JOURNAL OF RESEARCH AND SCIENTIFIC INNOVATION (IJRSI)
ISSN No. 2321-2705 | DOI: 10.51244/IJRSI |Volume XII Issue VII July 2025
Page 544
www.rsisinternational.org
predictions, though the overall error rates remain low in the context of the dataset size. The higher number of
misclassifications compared to other models (except Naive Bayes) aligns with its slightly lower accuracy
metrics.
Figure 4.6 Confusion Matrix for KNN
The Naive Bayes model's confusion matrix, as shown in Figure 4.7, reveals significant challenges in
classification performance. With 11,866 false positives, the model showed a strong tendency to overpredict
positive cases, leading to poor precision. While it correctly identified most positive cases (high recall) with
only 20 false negatives, this came at the cost of misclassifying a large number of negative cases as positive.
This imbalance in prediction errors explains the model's low accuracy of 47.27% and suggests that the Naive
Bayes assumptions about feature independence were not appropriate for this classification task. The high
number of false positives indicates that the model struggled to properly differentiate between classes, making it
the least reliable among all tested models for this particular application.
Figure 4.7 Confusion Matrix for NB
Feature Importance Analysis
The analysis of feature importance across all six machine learning models, as illustrated in Figure 4.8, reveals
significant insights into the key determinants of ransomware detection. The results demonstrate varying
degrees of feature significance across different models, with some features consistently emerging as crucial
predictors.
INTERNATIONAL JOURNAL OF RESEARCH AND SCIENTIFIC INNOVATION (IJRSI)
ISSN No. 2321-2705 | DOI: 10.51244/IJRSI |Volume XII Issue VII July 2025
Page 545
www.rsisinternational.org
Figure 4.8 Feature Importance Analysis
The Decision Tree model heavily relied on two dominant features: Size Of Stack Reserve (63.98%) and Name
(34.64%), which together accounted for approximately 98% of the model's decision-making process. This
strong concentration on just two features suggests that the Decision Tree model found these characteristics
particularly discriminative for ransomware classification.
The Random Forest model exhibited a more distributed feature importance pattern, utilizing a broader range of
features in its decision-making process. The top contributing features included Name (13.02%), Size Of Stack
Reserve (10.16%), and Minor Image Version (9.92%). Additionally, features such as Version Information Size
(7.02%), Resources Min Size (6.89%), and Characteristics (5.51%) showed moderate importance,
demonstrating the model's ability to leverage multiple file characteristics for classification.
The Neural Network model showed a distinct preference for specific features, with Name having the highest
importance (36.90%), followed by md5 (18.46%). This concentration on these particular features suggests that
the neural network identified strong patterns in these characteristics for distinguishing between benign and
malicious files. The Support Vector Machine (SVM) demonstrated more balanced feature importance
distributions, with Name (3.64%) and Characteristics (3.28%) being the most influential features. This more
evenly distributed importance suggests that SVM relies on a broader set of features for its classification
decisions, although with lower individual feature weights compared to other models.
The K-Nearest Neighbors (KNN) algorithm showed relatively low feature importance values across all
features, with Subsystem (0.98%) and Name (1.12%) being the most significant. This pattern suggests that
KNN's classification decisions are based on multiple features with similar levels of importance, rather than
being dominated by specific characteristics. Naive Bayes showed the most distinct feature importance pattern,
with Check Sum (4.86%) being its most influential feature, followed by Minor Linker Version (2.34%) and
Characteristics (1.86%). However, many features showed very low or negative importance values, indicating
that the model might not effectively utilize the full range of available features, which could explain its lower
overall performance.
A notable observation as seen in the Table 4.2 across all models is the consistent importance of the Name
feature, appearing as a top contributor in most models except Naive Bayes. This suggests that file naming
patterns carry significant information for ransomware detection. Similarly, structural characteristics like Size
Of Stack Reserve and various version information features proved important across multiple models,
indicating their reliability as indicators of malicious software.
INTERNATIONAL JOURNAL OF RESEARCH AND SCIENTIFIC INNOVATION (IJRSI)
ISSN No. 2321-2705 | DOI: 10.51244/IJRSI |Volume XII Issue VII July 2025
Page 546
www.rsisinternational.org
The analysis also reveals that certain features, such as Loader Flags, Number Of Rva And Sizes, and Section
Alignment, consistently showed minimal importance across all models, suggesting they might be less relevant
for ransomware detection. This insight could be valuable for feature selection in future model optimizations.
Table 4.2: Feature Importance of first 10 features across various models
DCM
RFM
SVM
KNN
NNM
NBM
Name
0.3464
0.1302
0.0364
0.0112
0.3690
0.0006
Size Of Stack Reserve
0.6398
0.1016
0.0056
0.0004
0.0004
-0.0002
md5
0.0001
0.0014
0.0056
0.0024
0.1846
-0.0004
Characteristics
0.0013
0.0551
0.0328
0.0014
0.0024
0.0186
Minor Image Version
0.0000
0.0992
0.0012
0.0024
0.0000
0.0006
Version Information Size
0.0096
0.0702
0.0062
0.0024
0.0002
0.0022
Resources Min Size
0.0000
0.0689
0.0000
0.0000
0.0000
0.0014
Resources Nb
0.0000
0.0525
0.0012
0.0040
0.0008
0.0010
Subsystem
0.0000
0.0414
0.0018
0.0098
0.0000
0.0056
Check Sum
0.0017
0.0071
0.0002
0.0004
0.0000
0.0486
Discussion of Results
The current research findings indicate that Random Forest outperformed all other machine learning algorithms
tested, achieving an accuracy of 99.98%, perfect recall, and an F1-score of 99.96%. This aligns with the results
presented by (Egunjobi et al., 2019), who also found Random Forest to be highly effective in classifying
ransomware, achieving accuracy rates of 99.5% (Egunjobi et al., 2019). The consistency in performance across
different studies highlights the robustness of Random Forest as a preferred algorithm for ransomware
detection. In contrast, Naïve Bayes exhibited significantly lower performance in the current research, with an
accuracy of only 47.27%. This underperformance echoes findings from various studies, including those by
(Masum et al., 2022), where Naïve Bayes was noted for its limitations in effectively classifying ransomware
due to its assumptions about feature independence (Masum et al., 2022).
The Decision Tree model in the current study demonstrated remarkable performance, closely following
Random Forest with an accuracy of 99.97%. This is consistent with the findings of Khammas et al. (2022),
who emphasized the effectiveness of Decision Trees in ransomware detection (Sharma et al., 2021). The
balanced performance metrics of the Decision Tree, including its low false positive and false negative rates,
suggest that it can effectively learn the underlying patterns in the data, a characteristic also noted in previous
research. The current study's findings regarding the Neural Network's performance, achieving an accuracy of
99.92%, further support the notion that deep learning models can be effective in ransomware classification, as
highlighted by (Sharma et al., 2021), who noted the advantages of deep learning over traditional machine
learning methods (Sharma et al., 2021).
The comparative analysis of model performance metrics reveals that while SVM and KNN also performed
well, their accuracies were slightly lower than those of Random Forest and Decision Tree. This observation is
consistent with the work of (Aurangzeb et al., 2021), which indicated that SVM can achieve promising results
but may not always outperform ensemble methods like Random Forest (Aurangzeb et al., 2021). The current
research's findings on the confusion matrices provide a deeper understanding of the models' classification
capabilities, particularly the high number of false positives associated with Naïve Bayes, which aligns with the
INTERNATIONAL JOURNAL OF RESEARCH AND SCIENTIFIC INNOVATION (IJRSI)
ISSN No. 2321-2705 | DOI: 10.51244/IJRSI |Volume XII Issue VII July 2025
Page 547
www.rsisinternational.org
challenges noted in previous studies regarding the algorithm's reliability in distinguishing between classes
(Masum et al., 2022).
Furthermore, the current research highlights the importance of feature selection, with Random Forest utilizing
a broader range of features compared to the Decision Tree, which relied heavily on only two features. This
observation resonates with the findings of (Ngirande, 2024), who emphasized the significance of
comprehensive feature analysis in enhancing model performance (Ngirande, 2024). The consistent importance
of the "Name" feature across multiple models in the current study suggests that certain characteristics are
critical indicators for ransomware detection, reinforcing the conclusions drawn by other researchers regarding
the relevance of specific features in classification tasks (Sharma et al., 2021; Masum et al., 2022).
In summary, the current research findings not only corroborate previous studies regarding the effectiveness of
various machine learning algorithms in ransomware detection but also highlight existing gaps in the literature,
such as limited algorithm diversity and insufficient comparative analyses. Addressing these gaps will be
essential for advancing the field and developing more robust detection systems capable of adapting to the
evolving landscape of ransomware threats.
CONCLUSION
This research aims to evaluate and compare the performance of machine learning models namely; Random
Forest (RF), Decision Tree (DT), Neural Network (NN), Support Vector Machine (SVM), K-Nearest
Neighbors (KNN), and Naïve Bayes (NB) for the classification of ransomware data obtained from a GitHub
repository. The models are implemented using Python's Scikit-learn library, and their performance is assessed
using evaluation metrics such as Accuracy, Precision, Recall, F1-score, and ROC-AUC. The results
demonstrated that the Random Forest emerged as the top performer with exceptional metrics: 99.98%
accuracy, 100% recall, and 99.93% precision, demonstrating perfect detection of positive cases with only 11
misclassifications out of 27,610 predictions. Decision Tree showed remarkable consistency with 99.97%
accuracy and balanced precision and recall values of 99.95%, exhibiting perfect symmetry in misclassification
with exactly 4 false positives and 4 false negatives. Neural Network achieved 99.92% accuracy with high
precision (99.95%) and recall (99.78%), showing slight asymmetry in error distribution.
SVM and KNN demonstrated strong performance with accuracies above 99%, with SVM achieving 99.43%
accuracy and KNN 99.38%. Naive Bayes significantly underperformed with 47.27% accuracy and 36.45%
precision, showing a strong bias toward false positives (11,866 cases).
This research has demonstrated the superior effectiveness of ensemble and tree-based methods in ransomware
detection. The exceptional performance of Random Forest and Decision Tree algorithms, combined with
strong results from Neural Networks and SVM, establishes a robust framework for automated ransomware
detection. The study's comprehensive comparison of multiple algorithms provides valuable insights into their
relative strengths and limitations in cybersecurity applications.
Based on the findings of this study, the following recommendations are proposed for future research and
practical applications:
i. Explore advanced ensemble methods and deep learning architectures to potentially improve upon the
current performance benchmarks.
ii. Investigate feature engineering techniques to enhance model performance, particularly focusing on the
most influential features identified across multiple algorithms.
iii. Develop real-time detection systems leveraging the high-performing models, with particular emphasis
on Random Forest and Neural Network implementations.
iv. Expand the dataset to include emerging ransomware variants and conduct regular model retraining to
maintain effectiveness against evolving threats.
INTERNATIONAL JOURNAL OF RESEARCH AND SCIENTIFIC INNOVATION (IJRSI)
ISSN No. 2321-2705 | DOI: 10.51244/IJRSI |Volume XII Issue VII July 2025
Page 548
www.rsisinternational.org
v. Foster collaboration between machine learning experts and cybersecurity professionals to translate
these findings into practical defense mechanisms.
REFERENCES
1. Abualhaj, M. M., Abu-Shareha, A. A., Shambour, Q. Y., Al-Khatib, S. N., and Hiari, M. O. (2024).
Tuning the k value in k-nearest neighbors for malware detection. IAES International Journal of
Artificial Intelligence (IJ-AI), 13(2), 22752282. https://doi.org/10.11591/ijai.v13.i2.pp2275-2282
2. Al‑Ruwili, A. S. M., & Mostafa, A. M. (2023). Analysis of Ransomware Impact on Android Systems
using Machine Learning Techniques. International Journal of Advanced Computer Science and
Applications, 14(11), 775785. https://doi.org/10.14569/IJACSA.2023.0141178
3. Asad, A. B., Mansur, R., Zawad, S., Evan, N., and Hossain, M. I. (2020). Analysis of malware
prediction based on infection rate using machine learning techniques. 2020 IEEE Region 10
Symposium (TENSYMP). https://doi.org/10.1109/TENSYMP50017.2020.9230624
4. Aurangzeb, S., Rais, R. N. B., Aleem, M., Islam, M. A., and Iqbal, M. A. (2021). On the classification
of Microsoft-Windows ransomware using hardware profile. PeerJ Computer Science, 7,
e361. https://doi.org/10.7717/peerj-cs.361
5. Bawazeer, O., Helmy, T., and Al-Hadhrami, S. (2021). Malware detection using machine learning
algorithms based on hardware performance counters: Analysis and simulation. Journal of Physics:
Conference Series, 1962(1), 012010. https://doi.org/10.1088/1742-6596/1962/1/012010
6. Bold, R., Al-Khateeb, H., and Ersotelos, N. (2022). Reducing false negatives in ransomware detection:
A critical evaluation of machine learning algorithms. Applied Sciences, 12(24), 12941.
https://doi.org/10.3390/app122412941
7. Breiman, L. (2001). Random forests. Machine Learning, 45(1), 532.
https://doi.org/10.1023/A:1010933404324
8. Egunjobi, S., Parkinson, S., and Crampton, A. (2019). Classifying ransomware using machine learning
algorithms. In Intelligent Data Engineering and Automated Learning IDEAL 2019 (pp. 4552).
Springer. https://doi.org/10.1007/978-3-030-33617-2_5
9. Fuyong Xing, Yuanpu Xie, Hai Su, Fujun Liu, Lin Yang (2018). “Deep Learning in Microscopy Image
Analysis: A Survey. IEEE Transactions on Neural Networks and Learning Systems, 29(10), 4550
4568. https://doi.org/10.1109/TNNLS.2017.2766168
10. Khammas, B. M. (2022). Comparative analysis of various machine learning algorithms for ransomware
detection. TELKOMNIKA (Telecommunication Computing Electronics and Control), 20(1), 43
52. https://doi.org/10.12928/telkomnika.v20i1.18812
11. Kok, S., Abdullah, A., Jhanjhi, N. Z., and Supramaniam, M. (2019). Prevention of crypto-ransomware
using a pre-encryption detection algorithm. Computers, 8(4), 79.
https://doi.org/10.3390/computers8040079
12. Masum, M., Faruk, M. J. H., Shahriar, H., Qian, K., Lo, D., and Adnan, M. I. (2022). Ransomware
classification and detection with machine learning algorithms. 2022 IEEE 12th Annual Computing and
Communication Workshop and Conference (CCWC), 0316
0322. https://doi.org/10.1109/CCWC54503.2022.9720869
13. Ngirande, H., Muduva, M., Chiwariro, R., and Makate, A. (2024). Detection and analysis of Android
ransomware using the support vector machines. International Journal for Research in Applied Science
and Engineering Technology, 12(1), 241252. https://doi.org/10.22214/ijraset.2024.57885
14. Scaife, N., Carter, H., Traynor, P., & Butler, K. R. B. (2016). Cryptolock (and Drop It): Stopping
Ransomware Attacks on User Data. 2016 IEEE 36th International Conference on Distributed
Computing Systems (ICDCS).
15. Sharma, S., Kumar, R., and Krishna, C. R. (2021). A survey on analysis and detection of Android
ransomware. Concurrency and Computation: Practice and Experience, 33(16),
e6272. https://doi.org/10.1002/cpe.6272
16. Ucci, D., Aniello, L., & Baldoni, R. (2019). Survey of machine learning techniques for malware
analysis. Computers & Security, 81, 123147.