Machine Learning-Based Prediction of EGFR Bioactivity Using Molecular Fingerprints
Authors
PG Student, Department of Computer Science, GSS, GITAM Deemed to be University (India)
Assistant Professor, Department of Computer Science, GSS, GITAM Deemed to be University (India)
Article Information
DOI: 10.51244/IJRSI.2026.1304000079
Subject Category: Computer Science
Volume/Issue: 13/4 | Page No: 815-820
Publication Timeline
Submitted: 2026-04-04
Accepted: 2026-04-10
Published: 2026-05-01
Abstract
The process of drug discovery involves a number of factors and can be described as complicated, time-taking and costly. EGFR has become one of the main targets for further investigation in oncological diseases research. To discover new medicines, it is necessary to discover active chemicals against EGFR. This work proposes the use of machine learning to predict bioactivity based on molecular fingerprints extracted from the SMILES string of a compound. The used dataset contains data from the ChEMBL database. The dataset was preprocessed into binary classes of bioactive molecules. We implemented a variety of machine learning models such as Random Forest, Support Vector Machine, Logistic Regression, Gradient Boosting, and XG Boost. The best performance among all tested models was provided by Random Forest. The obtained accuracy was 87%. The implementation of the model was done using Streamlit web framework.
Keywords
EGFR Bioactivity, Machine Learning
Downloads
References
1. D. Mendez et al., “ChEMBL: Towards direct deposition of bioassay data,” Nucleic Acids Research, vol. 47, no. D1, pp. D930–D940, 2019. [Google Scholar] [Crossref]
2. G. Landrum, “RDKit: Open-source cheminformatics software,” 2023. [Online]. Available: https://www.rdkit.org [Google Scholar] [Crossref]
3. F. Pedregosa et al., “Scikit-learn: Machine learning in Python,” Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011. [Google Scholar] [Crossref]
4. L. Breiman, “Random Forests,” Machine Learning, vol. 45, no. 1, pp. 5–32, 2001. [Google Scholar] [Crossref]
5. J. H. Friedman, “Greedy function approximation: A gradient boosting machine,” Annals of Statistics, vol. 29, no. 5, pp. 1189–1232, 2001. [Google Scholar] [Crossref]
6. T. Chen and C. Guestrin, “XGBoost: A scalable tree boosting system,” in Proc. ACM SIGKDD, 2016, pp. 785–794. [Google Scholar] [Crossref]
7. C. Cortes and V. Vapnik, “Support-vector networks,” Machine Learning, vol. 20, no. 3, pp. 273–297, 1995. [Google Scholar] [Crossref]
8. D. Rogers and M. Hahn, “Extended-connectivity fingerprints,” Journal of Chemical Information and Modeling, vol. 50, no. 5, pp. 742–754, 2010. [Google Scholar] [Crossref]
9. A. Lavecchia, “Machine-learning approaches in drug discovery: Methods and applications,” Drug Discovery Today, vol. 20, no. 3, pp. 318–331, 2015. [Google Scholar] [Crossref]
Metrics
Views & Downloads
Similar Articles
- What the Desert Fathers Teach Data Scientists: Ancient Ascetic Principles for Ethical Machine-Learning Practice
- Comparative Analysis of Some Machine Learning Algorithms for the Classification of Ransomware
- Comparative Performance Analysis of Some Priority Queue Variants in Dijkstra’s Algorithm
- Transfer Learning in Detecting E-Assessment Malpractice from a Proctored Video Recordings.
- Dual-Modal Detection of Parkinson’s Disease: A Clinical Framework and Deep Learning Approach Using NeuroParkNet