Diabetes Prediction and Analysis Using Machine Learning: A Comparative Study
Abstract - Diabetes is a chronic and widespread disease caused by the lack of insulin production or improper utilisation of insulin produced by the pancreas. It most commonly affects the middle- aged and the elderly however it can be diagnosed in younger people as well due to lifestyle or genetic reasons. The blood glucose level rises which leads to health problems. Diabetes is an incurable disease, but early and precise detection of diabetes allows precautions and lifestyle changes to significantly reduce the tremendous health problems associated with it. However, accurately diagnosing diabetes has issues due to scarcity of labelled datasets that properly capture all the outliers. This paperuses the PIMA Indian diabetes dataset from the National Institute of Diabetes and Digestive and Kidney diseases. We will use various features like age, glucose, BMI, blood pressure, insulin etc. to predict the occurrence of diabetes in patients. Pre-processing and EDA of the imbalanced data is also conducted. A form of median imputation to handle the imbalanced dataset is created. For the prediction and analysis of the dataset multiple machine learning techniques are used and their performance metrics are obtained to generate the most robust predictions possible. The predictions will be made by using machine learning techniques like Logistic Regression, KNN, LightGBM both individually and by ensembling them.
Keywords - Diabetes Prediction, PIMA Indian Diabetes Dataset, K-Nearest Neighbours (KNN), Logistic Regression (LR), Light Gradient Boosting Machine (LGBM), Ensemble Learning