Machine Learning-Based Prediction of EGFR Bioactivity Using Molecular Fingerprints

Authors

Kandula Siri Chandana

PG Student, Department of Computer Science, GSS, GITAM Deemed to be University (India)

Vanitha Kakollu

Assistant Professor, Department of Computer Science, GSS, GITAM Deemed to be University (India)

Article Information

DOI: 10.51244/IJRSI.2026.1304000079

Subject Category: Computer Science

Volume/Issue: 13/4 | Page No: 815-820

Publication Timeline

Submitted: 2026-04-04

Accepted: 2026-04-10

Published: 2026-05-01

Abstract

The process of drug discovery involves a number of factors and can be described as complicated, time-taking and costly. EGFR has become one of the main targets for further investigation in oncological diseases research. To discover new medicines, it is necessary to discover active chemicals against EGFR. This work proposes the use of machine learning to predict bioactivity based on molecular fingerprints extracted from the SMILES string of a compound. The used dataset contains data from the ChEMBL database. The dataset was preprocessed into binary classes of bioactive molecules. We implemented a variety of machine learning models such as Random Forest, Support Vector Machine, Logistic Regression, Gradient Boosting, and XG Boost. The best performance among all tested models was provided by Random Forest. The obtained accuracy was 87%. The implementation of the model was done using Streamlit web framework.

Keywords

EGFR Bioactivity, Machine Learning

Downloads

References

1. D. Mendez et al., “ChEMBL: Towards direct deposition of bioassay data,” Nucleic Acids Research, vol. 47, no. D1, pp. D930–D940, 2019. [Google Scholar] [Crossref]

2. G. Landrum, “RDKit: Open-source cheminformatics software,” 2023. [Online]. Available: https://www.rdkit.org [Google Scholar] [Crossref]

3. F. Pedregosa et al., “Scikit-learn: Machine learning in Python,” Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011. [Google Scholar] [Crossref]

4. L. Breiman, “Random Forests,” Machine Learning, vol. 45, no. 1, pp. 5–32, 2001. [Google Scholar] [Crossref]

5. J. H. Friedman, “Greedy function approximation: A gradient boosting machine,” Annals of Statistics, vol. 29, no. 5, pp. 1189–1232, 2001. [Google Scholar] [Crossref]

6. T. Chen and C. Guestrin, “XGBoost: A scalable tree boosting system,” in Proc. ACM SIGKDD, 2016, pp. 785–794. [Google Scholar] [Crossref]

7. C. Cortes and V. Vapnik, “Support-vector networks,” Machine Learning, vol. 20, no. 3, pp. 273–297, 1995. [Google Scholar] [Crossref]

8. D. Rogers and M. Hahn, “Extended-connectivity fingerprints,” Journal of Chemical Information and Modeling, vol. 50, no. 5, pp. 742–754, 2010. [Google Scholar] [Crossref]

9. A. Lavecchia, “Machine-learning approaches in drug discovery: Methods and applications,” Drug Discovery Today, vol. 20, no. 3, pp. 318–331, 2015. [Google Scholar] [Crossref]

Metrics

Views & Downloads

Similar Articles