A Machine Learning Model for Predicting the Risk of Developing Diabetes - T2DM Using Real-World Data from Kilifi, Kenya

Authors

Isaac Mumo Kailu

Institute of Computing and Informatics, Technical University of Mombasa (Kenya)

Dr. Mvurya Mgala

Institute of Computing and Informatics, Technical University of Mombasa (Kenya)

Dr. Fullgence Mwakondo

Institute of Computing and Informatics, Technical University of Mombasa (Kenya)

Article Information

DOI: 10.51244/IJRSI.2025.120800026

Subject Category: Machine Learning

Volume/Issue: 12/8 | Page No: 302-310

Publication Timeline

Submitted: 2025-07-22

Accepted: 2025-08-28

Published: 2025-08-29

Abstract

Type 2 Diabetes Mellitus (T2DM) is a growing public health concern in low-resource settings, where early detection remains limited due to infrastructural and diagnostic constraints. This study presents a machine learning-based risk prediction model developed using real-world data from Kilifi County Referral Hospital in Kenya, aiming to identify individuals at risk of developing T2DM before clinical onset. The study applied the CRISP-DM framework to guide the end-to-end process, from data collection to model deployment. A dataset comprising 2,500 anonymized electronic health records was used, incorporating a diverse range of features including clinical, behavioral, demographic, and socioeconomic variables. Feature selection was conducted using both statistical (Chi-square test) and algorithm-based methods (Random Forest, Recursive Feature Elimination, and XGBoost importance), resulting in two candidate feature sets (14-feature and 7-feature subsets). Four supervised learning algorithms; Logistic Regression, Support Vector Machine (SVM), Random Forest, and XGBoost were trained and evaluated using 5-fold cross-validation. Among them, the XGBoost model achieved the best performance, with a test set accuracy of 91.33%, F1-score of 88.66%, and an AUC-ROC of 96.24%, outperforming other models across all metrics. This study demonstrates that integrating multi-domain features with machine learning can enhance early risk stratification for T2DM in under-resourced environments. The final model’s ability to categorize individuals into low, medium, and high-risk groups offers a practical tool for targeted screening and preventive healthcare interventions in Kenyan public health systems.

Keywords

Type 2 Diabetes Mellitus, Machine Learning, Risk Prediction, XGBoost, CRISP-DM, Kenya

Downloads

References

1. Bhargava, S., & Zafar, S. (2019). Socioeconomic and behavioral predictors in diabetes risk: An ML-based population health study in Pakistan. Journal of Public Health Research, 8(3), 164–170. https://doi.org/10.4081/jphr.2019.164 [Google Scholar] [Crossref]

2. Chen, H., et al. (2021). Using real-world data from rural China to predict diabetes risk via ensemble learning. BMC Endocrine Disorders, 21, 198. https://doi.org/10.1186/s12902-021-00870-w [Google Scholar] [Crossref]

3. Deberneh, H. M., & Kim, I. (2021). Prediction of type 2 diabetes based on machine learning algorithm. International Journal of Environmental Research and Public Health, 18(6), 3317. https://doi.org/10.3390/ijerph18063317 [Google Scholar] [Crossref]

4. Farran, B., et al. (2022). An explainable machine learning approach to early T2DM prediction in Qatar’s primary care. BMC Medical Informatics and Decision Making, 22, 183. https://doi.org/10.1186/s12911-022-01948-6 [Google Scholar] [Crossref]

5. International Diabetes Federation. (2021). IDF Diabetes Atlas (10th ed.). Brussels, Belgium: International Diabetes Federation. https://diabetesatlas.org/ [Google Scholar] [Crossref]

6. Islam, S. M. S., et al. (2022). Development of a non-invasive diabetes prediction tool using behavioral and anthropometric data in rural Bangladesh. Scientific Reports, 12, 14378. https://doi.org/10.1038/s41598-022-18022-6 [Google Scholar] [Crossref]

7. Kavakiotis, I., Tsave, O., Salifoglou, A., Maglaveras, N., Vlahavas, I., & Chouvarda, I. (2017). Machine learning and data mining methods in diabetes research. Computational and Structural Biotechnology Journal, 15, 104–116. https://doi.org/10.1016/j.csbj.2016.12.005 [Google Scholar] [Crossref]

8. Lee, S., et al. (2021). Diabetes risk classification with explainable ML: Application in underserved Korean population. PLoS ONE, 16(6), e0253312. https://doi.org/10.1371/journal.pone.0253312 [Google Scholar] [Crossref]

9. Mohan, V., et al. (2019). A deep learning model for diabetes prediction using Indian rural cohort. Diabetes Technology & Therapeutics, 21(10), 562–569. https://doi.org/10.1089/dia.2019.0172 [Google Scholar] [Crossref]

10. Nguyen, Q. C., et al. (2020). Leveraging social determinants and EHR data to predict diabetes risk in underserved populations. International Journal of Medical Informatics, 141, 104241. https://doi.org/10.1016/j.ijmedinf.2020.104241 [Google Scholar] [Crossref]

11. Rahman, M. M., et al. (2020). T2DM risk assessment using random forest and decision tree in community health datasets. Informatics in Medicine Unlocked, 21, 100461. https://doi.org/10.1016/j.imu.2020.100461 [Google Scholar] [Crossref]

12. Wang, F., & Hu, J. (2019). Predicting chronic disease risk using machine learning on health survey data: A case study on diabetes. IEEE Journal of Biomedical and Health Informatics, 23(6), 2548–2556. https://doi.org/10.1109/JBHI.2018.2887383 [Google Scholar] [Crossref]

Metrics

Views & Downloads

Similar Articles