A Machine Learning Model for Predicting the Risk of Developing Diabetes - T2DM Using Real-World Data from Kilifi, Kenya
Authors
Institute of Computing and Informatics, Technical University of Mombasa (Kenya)
Institute of Computing and Informatics, Technical University of Mombasa (Kenya)
Institute of Computing and Informatics, Technical University of Mombasa (Kenya)
Article Information
DOI: 10.51244/IJRSI.2025.120800026
Subject Category: Machine Learning
Volume/Issue: 12/8 | Page No: 302-310
Publication Timeline
Submitted: 2025-07-22
Accepted: 2025-08-28
Published: 2025-08-29
Abstract
Type 2 Diabetes Mellitus (T2DM) is a growing public health concern in low-resource settings, where early detection remains limited due to infrastructural and diagnostic constraints. This study presents a machine learning-based risk prediction model developed using real-world data from Kilifi County Referral Hospital in Kenya, aiming to identify individuals at risk of developing T2DM before clinical onset. The study applied the CRISP-DM framework to guide the end-to-end process, from data collection to model deployment. A dataset comprising 2,500 anonymized electronic health records was used, incorporating a diverse range of features including clinical, behavioral, demographic, and socioeconomic variables. Feature selection was conducted using both statistical (Chi-square test) and algorithm-based methods (Random Forest, Recursive Feature Elimination, and XGBoost importance), resulting in two candidate feature sets (14-feature and 7-feature subsets). Four supervised learning algorithms; Logistic Regression, Support Vector Machine (SVM), Random Forest, and XGBoost were trained and evaluated using 5-fold cross-validation. Among them, the XGBoost model achieved the best performance, with a test set accuracy of 91.33%, F1-score of 88.66%, and an AUC-ROC of 96.24%, outperforming other models across all metrics. This study demonstrates that integrating multi-domain features with machine learning can enhance early risk stratification for T2DM in under-resourced environments. The final model’s ability to categorize individuals into low, medium, and high-risk groups offers a practical tool for targeted screening and preventive healthcare interventions in Kenyan public health systems.
Keywords
Type 2 Diabetes Mellitus, Machine Learning, Risk Prediction, XGBoost, CRISP-DM, Kenya
Downloads
References
1. Bhargava, S., & Zafar, S. (2019). Socioeconomic and behavioral predictors in diabetes risk: An ML-based population health study in Pakistan. Journal of Public Health Research, 8(3), 164–170. https://doi.org/10.4081/jphr.2019.164 [Google Scholar] [Crossref]
2. Chen, H., et al. (2021). Using real-world data from rural China to predict diabetes risk via ensemble learning. BMC Endocrine Disorders, 21, 198. https://doi.org/10.1186/s12902-021-00870-w [Google Scholar] [Crossref]
3. Deberneh, H. M., & Kim, I. (2021). Prediction of type 2 diabetes based on machine learning algorithm. International Journal of Environmental Research and Public Health, 18(6), 3317. https://doi.org/10.3390/ijerph18063317 [Google Scholar] [Crossref]
4. Farran, B., et al. (2022). An explainable machine learning approach to early T2DM prediction in Qatar’s primary care. BMC Medical Informatics and Decision Making, 22, 183. https://doi.org/10.1186/s12911-022-01948-6 [Google Scholar] [Crossref]
5. International Diabetes Federation. (2021). IDF Diabetes Atlas (10th ed.). Brussels, Belgium: International Diabetes Federation. https://diabetesatlas.org/ [Google Scholar] [Crossref]
6. Islam, S. M. S., et al. (2022). Development of a non-invasive diabetes prediction tool using behavioral and anthropometric data in rural Bangladesh. Scientific Reports, 12, 14378. https://doi.org/10.1038/s41598-022-18022-6 [Google Scholar] [Crossref]
7. Kavakiotis, I., Tsave, O., Salifoglou, A., Maglaveras, N., Vlahavas, I., & Chouvarda, I. (2017). Machine learning and data mining methods in diabetes research. Computational and Structural Biotechnology Journal, 15, 104–116. https://doi.org/10.1016/j.csbj.2016.12.005 [Google Scholar] [Crossref]
8. Lee, S., et al. (2021). Diabetes risk classification with explainable ML: Application in underserved Korean population. PLoS ONE, 16(6), e0253312. https://doi.org/10.1371/journal.pone.0253312 [Google Scholar] [Crossref]
9. Mohan, V., et al. (2019). A deep learning model for diabetes prediction using Indian rural cohort. Diabetes Technology & Therapeutics, 21(10), 562–569. https://doi.org/10.1089/dia.2019.0172 [Google Scholar] [Crossref]
10. Nguyen, Q. C., et al. (2020). Leveraging social determinants and EHR data to predict diabetes risk in underserved populations. International Journal of Medical Informatics, 141, 104241. https://doi.org/10.1016/j.ijmedinf.2020.104241 [Google Scholar] [Crossref]
11. Rahman, M. M., et al. (2020). T2DM risk assessment using random forest and decision tree in community health datasets. Informatics in Medicine Unlocked, 21, 100461. https://doi.org/10.1016/j.imu.2020.100461 [Google Scholar] [Crossref]
12. Wang, F., & Hu, J. (2019). Predicting chronic disease risk using machine learning on health survey data: A case study on diabetes. IEEE Journal of Biomedical and Health Informatics, 23(6), 2548–2556. https://doi.org/10.1109/JBHI.2018.2887383 [Google Scholar] [Crossref]