Computer Science Capstone - joseph-curtis/data-science-diabetes-classifier GitHub Wiki

Author: Joseph Curtis

Western Governors University C964

List of Abbreviations

Abbreviation Definition
API Application Programming Interface
AWS Amazon Web Services
BMI Body Mass Index
BRFSS Behavioral Risk Factor Surveillance System
CDC Centers for Disease Control and Prevention
CRISP-DM Cross Industry Standard Process for Data Mining
HIPAA Health Insurance Portability and Accountability Act
IDE Integrated Development Environment
PHI Protected Health Information
PII Personally Identifiable Information
URL Uniform Resource Locator


Application Programming Interface (API) This allows different software programs to talk to each other. As a set of tools for building software applications, it makes it easier to connect and integrate different systems.

Behavioral Risk Factor Surveillance System (BRFSS) Annual survey conducted by the Centers for Disease Control and Prevention. It collects information on health-related risk behaviors, chronic health conditions, and preventive service utilizations among adults in the United States.

Body Mass Index (BMI) A key indicator of obesity, which is a significant risk factor for diabetes. BMI is a numerical value derived from a person's weight and height, indicating if an individual is underweight, normal weight, overweight, or obese.

Cross Industry Standard Process for Data Mining (CRISP-DM) Methodology A framework for organizing and implementing data mining projects. It involves six phases: (1) understanding the business context and objectives, (2) understanding the data, (3) preparing the data for analysis, (4) building models to find patterns and insights, (5) evaluating these models against business objectives, and (6) deploying the model for practical use. CRISP-DM is a systematic method that helps teams navigate the complexity of extracting valuable insights from large data sets across multiple industries and functions.

Hypertension High Blood Pressure; a condition where the blood pressure in the arteries is persistently elevated. Hypertension often occurs alongside diabetes and can exacerbate its complications.

Integrated Development Environment (IDE) An advanced text editor specifically for coding. It includes tools to help programmers write programs, test them out and debug the software. Typically, it combines a code editor, compiler, and debugger into one package.

Logistic Regression A predictive analysis technique used to determine the outcome of a binary scenario, such as pass/fail, win/lose, healthy/sick, based on identifying patterns from previous data. It helps in estimating the likelihood of an event occurring by analyzing the relationship between multiple factors.

Personally Identifiable Information (PII) This is any data that can be used to identify a specific individual. Examples include names, addresses, phone numbers, social security numbers, and email addresses. PII is sensitive information that, if disclosed, could compromise an individual's privacy and security.

Protected Health Information (PHI) Any information in a medical record or other health-related information that can be used to identify an individual and that was created, used, or disclosed while providing a healthcare service such as diagnosis or treatment. PHI includes a wide range of identifiers, from names and social security numbers to biometric records.

Random Forest A method for making predictions or decisions based on combining insights from several simpler models. It builds multiple decision-making trees and merges them together to get a more accurate and stable prediction. This approach is useful for both classification (categorizing items) and regression (predicting a number) tasks and is known for its accuracy and ability to handle large datasets with numerous variables.

A. Letter of Transmittal

March 25, 2024

From: Joseph Curtis, Data Analyst Data Solutions Consulting Inc.

To: Senior Leadership Team Docs'R'Us Medical Clinic

Subject: Proposal for Development of a Diabetes Prediction Application

Dear Senior Leadership Team,

I am writing to propose the development of an innovative application of machine learning, which has been designed to vastly improve your clinic's predictive ability. This will aid in managing the risk of diabetes among your patients. The initiative for this project stems from a recognized need to enhance patient care services at Docs'R'Us Medical Clinic by leveraging the power of machine learning in health risk assessment. This letter outlines a plan for creating a completely new type of early warning tool that will enable clinicians to identify which patients are at considerable risk for diabetes.

Summary of the Problem

More attention has been given to the prevalence of diabetes across the world, specifically on early identification so that effective management may be taken as soon as possible. A big challenge your clinic has is the early risk identification and intervention in patients suffering from diabetes. These are patients with whom the impact on their health is tremendous, with increasing treatment expenses. Most current systems of prediction and management are strongly reactive and rely on individual health assessments. In turn these may delay some key interventions. However, existing manual screening processes are not only time-consuming but also do not pick out all those at risk.

Proposed Solution

I propose a machine learning-based application to predict patients being at risk of getting diabetes. The tool would use data pulled from health questionnaires and trained using sophisticated algorithms to alert at-risk individuals and offer early intervention. This will help your medical staff identify beforehand those patients who may be at risk in a proactive way, and then efficient action can take place.

Benefits to the Organization

Implementing this application will enable us to:

  • Enhance the effectiveness of your preventive care measures.

  • Allocate medical resources more efficiently.

  • Reduce the workload on medical staff, streamlining operational efficiency.

  • Improve patient outcomes through early detection and management of diabetes risk.

  • Strengthen your position as a leader in innovative healthcare solutions.

Implementation Summary

  • Costs: The project is estimated to cost approximately $81,000 for labor, and $1,600 per month of ongoing costs covering hardware, software, maintenance, and environment costs.

  • Timeline: The application's development and deployment are projected to be completed within 4-6 months.

  • Data: The datasets used are publicly available and have been anonymized (free from Personally Identifiable Information), to minimize any sensitivity that may occur due to privacy issues.

  • Ethical Concerns: We will adhere to all ethical guidelines in data handling and patient privacy.

As a recent Computer Science graduate, I have recently worked with data analysis, machine learning, and software development among the background skills needed to successfully execute the project. My academic and project experience provides a solid foundation in the technologies needed for this initiative.

I am confident that this predictive tool will significantly contribute to your clinic's mission of providing exceptional patient care. This application is an opportunity to provide substantial benefits to your organization by improving the efficiency and effectiveness of your diabetes risk screening processes. I look forward to discussing this proposal further and answering any questions.

Thank you for considering this initiative.


Joseph Curtis

Joseph Curtis, Data Analyst
Data Solutions Consulting Inc.

B. Project Proposal Plan

Project Summary

Problem Description

Diabetes is a disease with dramatic implications for both patient well-being and healthcare costs. Identifying those at risk of conditions likely to become chronic, such as diabetes, constitutes a daunting challenge for the modern healthcare industry.

A lot of complications could be averted with timely detection and intervention. However, the manual screening and diagnostic process is very labor-intensive, and still misses individuals who are at-risk. There exists a critical need for automated tools that can accurately predict diabetes risk based on readily available health indicators.

Client Needs

Docs'R'Us is a clinic for everybody. It must screen many patients from various demographics requiring correct and effective screening. Such a clinic requires a solution that enhances its strategy in preventive care against the detection of diabetes or pre-diabetes among its high-risk patients. This need arises from a wish to better allocate follow-up resources, ensuring timely intervention for those in most need. The clinic requires a tool that integrates seamlessly with existing workflows, offers clear and actionable insights, and supports staff in making informed decisions regarding patient care.

Project Deliverables

  1. Data Analysis and Prediction Jupyter Notebook A comprehensive Jupyter notebook that outlines the entire process of data analysis, model training, evaluation, and model selection. This document will serve as a report and guide on how the methodologies used developed the predictive model.

  2. Diabetes Prediction Model The core deliverable. This predictive model shall be developed using Scikit Learn and aims to evaluate an individual's risk for having diabetes on the health-specific predictors such as BMI, age, smoking, hypertension, and high cholesterol of the individual.

  3. User Guide A simple step-by-step guide for the clinic staff to deploy and use the prediction model. This guide includes the instructions on how to access the online interactive Jupyter notebook.

Client Benefit Justification

The proposed diabetes prediction tool therefore has the potential for huge advancement to the clinic for improved patient preventive care initiatives and to be able to forecast diabetes for those in need. For example, complex health data can be analyzed by the power of machine learning in the clinic to screen for a person who is at risk months or even years earlier than traditional methodologies can. The proactive approach not only may bring change that would improve the quality of the patients' lives through earlier interventions but also assures optimization of the clinic's resources into efforts where most needed. If this tool is implemented, Docs'R'Us will be the leader in innovative patient care and will establish a new benchmark in preventive health strategies within the medical fraternity.

Data Summary

Data Sourcing

The data for this project will be sourced from two comprehensive datasets available on Kaggle, which have been derived from the Centers for Disease Control and Prevention's (CDC) Behavioral Risk Factor Surveillance System (BRFSS) for the years 2015 and 2021. These datasets can be found at the following URLs:

2015 Dataset: Kaggle - Diabetes Health Indicators Dataset (2015)

2021 Dataset: Kaggle - Diabetes Health Indicators Dataset (2021)

The raw data originates from the BRFSS, an ongoing, state-based telephone survey conducted annually by the CDC. The health risk behavior survey is a collection of information about health-related risk behaviors, some of which are connected to major chronic diseases, and how adults aged 18 years and older in the US make use of healthcare services. All the selected datasets will be cleaned and consolidated to provide a balanced representation of respondents with no diabetes and with either prediabetes or diabetes. This will aid in building up a good training dataset for predictive modeling.

Data Processing and Management

Throughout the application development life cycle the data will undergo meticulous processing and management to ensure integrity, relevance, and usability. Initially, both datasets will be merged into a single dataframe. From this, all the irrelevant features will be removed so that the remaining features relate to diabetes risk. Further preprocessing the data will see the optimization in storage by changing data types, and scaling of numerical values to optimize logistic regression training.

Data management practices will adhere to best practices for data science projects, including version control for modeling scripts to enable collaboration and assured reproducibility of results. Regular audits of the data processing and handling procedures will be conducted, ensuring maintenance in quality and relevance throughout the project's lifespan.

Justification of Data Selection

The datasets selected are particularly appropriate for this project, considering high coverage in health indicators that are relevant for diabetes risk, such as BMI, smoking status, hypertension, and high cholesterol, among others. The balanced nature of the datasets, having the same number for both categories of respondents, diabetic and non-diabetic, makes the data perfect for training a prediction model. It helps in avoiding biased results in the prediction. Further, anomalies will be dealt with carefully so as not to skew the predictions of the model and hence retain the reliability and accuracy of the tool.

Ethical and Legal Considerations

The datasets are publicly available from the BRFSS survey, and since they were licensed under Creative Commons Zero (CC0: Public Domain), there are no direct ethical or legal concerns regarding their use. The data presented herein does not contain Personally Identifiable Information (PII) or Protected Health Information (PHI) and, therefore, raises no issue of privacy or confidentiality. It is understood that, in this project, there shall be the respect and adherence to ethical data handling practices in relation to the anonymity and integrity of the data subjects who will only be indirectly involved. This approach assures that the project abides by ethical and legal requirements, hence building the relationship of trust and transparency between the client and end-users of the predictive tool.



The project will apply the Cross-Industry Standard Process for Data Mining (CRISP-DM), which is an industry-standard methodology widely recognized for its robustness and flexibility in guiding data mining projects. This methodology is conducted in six phases: Business Understanding, Data Understanding, Data Preparation, Modeling, Evaluation, and Deployment. Each phase will implement the machine learning solution in an iterative cycle of model development and refinement. This structured approach assures effective progress in a stepwise manner.

Implementation Plan

  1. Business Understanding:

    • Define the objectives and requirements from a business perspective.

    • Translate these objectives into a data analytics problem definition. This will focus on predicting diabetes risk among patients.

  2. Data Understanding:

    • Explore the integrated dataset to get an overview of the data, issues in quality, and discover the preliminary insights.

    • Assess the distribution and relationship of key features related to diabetes risk.

  3. Data Preparation:

    • Clean the features as needed (handling missing values and outliers) and further transform the features through scaling and encoding, if required, to make the data ready for modeling.

    • Split the data into training and testing sets in a way that allows training the model on one subset of the data and testing it with another unseen subset for validation.

  4. Modeling:

    • Select Logistic Regression and Random Forests as the initial machine learning algorithms for comparison due to their suitability for binary classification problems.

    • Train both models on the training dataset, adjusting parameters through grid search and cross-validation to find the optimal model configurations.

  5. Evaluation:

    • The models should be evaluated for accuracy, precision, recall, and F1 score in binary classification problems using the testing dataset.

    • Use confusion matrices to analyze the models' performance in correctly predicting diabetes risk and minimizing false negatives.

    • With the evaluation metrics, compare the performance of the Logistic Regression and Random Forest models to select the most effective algorithm.

  6. Deployment:

    • Finalize the model that demonstrates the best performance in accurately predicting at-risk patients.

    • Implement this developed model into a user- friendly interface or tool that will allow medical staff at Docs'R'Us to enter patient health indicators and get predictions on diabetes risk.

    • Develop a deployment plan that includes training medical staff on how to use the tool effectively and a maintenance plan to regularly update the model based on new data or changes in predictive variables.

This implementation plan focuses on the machine learning aspect of the project, which gives a clear step-by-step systematic approach to how a prediction tool that meets the needs of clients for the early identification of risk for diabetes can be developed. By adhering to the CRISP-DM methodology, progress in this project is going to be systematic: from understanding the business problem to deploying a working, efficient machine learning solution.


Milestone Duration (days) Projected start date Anticipated end date
Business Understanding 5 Apr 1, 2024 Apr 8, 2024
Data Understanding 10 Apr 8, 2024 Apr 22, 2024
Data Preparation 15 Apr 22, 2024 May 13, 2024
Model Selection and Initial Training 9 May 13, 2024 May 27, 2024
Model Evaluation and Refinement 15 May 27, 2024 Jun 17, 2024
Final Model Selection 8 Jun 17, 2024 Jun 28, 2024
Deployment Preparation 14 Jul 1, 2024 Jul 22, 2024
User Interface Development and Testing 15 Jul 15, 2024 Aug 5, 2024
Staff Training and Model Deployment 10 Aug 5, 2024 Aug 19, 2024

Evaluation Plan

Verification Methods

  1. During Data Preparation:

    • Consistency Checks: Apply consistency checks to ensure that the data transformations and cleaning procedures apply to the dataset without causing any errors or biases.

    • Data Type and Range Verification: Check that the converted data types are correct. Verify that all numeric values fall within the expected ranges post-transformation (such as age and BMI).

  2. Modeling Phase:

    • Code Review: Conduct code reviews at various stages of model development to ensure that the implementation aligns with project specifications and machine learning best practices.

    • Unit Testing: Write and run unit tests on functions that process data and scripts that train the model using those functions to verify each component's proper behavior in isolation.

    • Integration Testing: After unit testing, the next step is integration testing. This is carried out to ensure that the pipeline of data processing and modeling steps is working together to churn out output without any hiccups.

  3. Before Deployment:

    • System Testing: The system tests of the machine learning solution will be done with the included interface or tool for the medical staff to ensure it meets all the specifications and requirements.

    • Performance Testing: The model will be tested for performance analysis under different loads, focusing on the prediction speed and resource utilization. This will ensure that the tool can handle the expected volume of queries efficiently.

Validation Method

A K-Fold Cross-Validation method will be used to validate the model. This method involves dividing the original dataset into an equal-sized set number of folds. The process follows these steps:

  1. Partitioning: The input data is partitioned into 'k' equal-sized folds randomly. A suggested value of 'k', typically five or ten, will be chosen for robust validation to balance between variance and bias in model evaluation.

  2. Model Training and Validation Loop: For each fold:

    • Treat the current fold as the testing set, and the remaining k-minus-one-fold as the training set.

    • Train the model on the k-minus-one training folds.

    • Validate the model on the current testing fold.

    • Record the model's performance metrics (accuracy, precision, recall, F1 score, etc.) for this iteration.

  3. Aggregation of Results: After cycling through all k-folds, aggregate the performance metrics from each iteration to compute the overall performance of the model. This aggregation will provide an overall view of how the model is expected to perform on unseen data, considering the variance introduced by different training and testing sets.

The K-Fold cross-validation process will ensure a thorough and unbiased assessment is made in relation to the predictive potential of the model, instilling confidence in its ability to generalize over new data. This is a key step to demonstrate the tool's effectiveness in accurately predicting the risk of diabetes among patients and ensuring the tool's reliability for clinical use at Docs'R'Us.

Resources and Costs

Hardware Costs

Development Workstations: High-performance computers that will be used by the development team to model data and develop the application. Estimated cost: $2,000 per unit.

Server Infrastructure: Used in model training, especially with large datasets and complex models. Cost varies using cloud services or in-house servers. Estimated monthly cost for cloud-based services: $500.

Software Costs

Development Tools: Includes integrated development environments (IDEs) include Jupyter Notebook, along with data analysis and visualization tools (Python, Scikit Learn). Most of these tools are open-source and free.

Cloud Services: These are services that a third party provides over the internet and bills a user based on the usage of cloud platforms (like AWS, Google Cloud, and Azure) during development or deployment. Estimated cost: $300 per month.

Labor Costs

Data Scientists: Carries out the preparations for data, model building, and evaluation.

  • Timeframe: 3 months
  • Estimated Monthly Cost: $10,000

Software Developers: Responsible for the development of the user interface and embedding the developed model into the application.

  • Timeframe: 2 months
  • Estimated Monthly Cost: $8,000

Project Manager: Oversees the project to ensure timely completion and quality.

  • Timeframe: 4 months
  • Estimated Monthly Cost: $9,000

Environment Costs

Deployment: Includes server cost to host the model and application, and the cost for deployment of the machine learning model. Estimated cost: $500 per month.

Hosting: For the web-based application or tool that the clinic staff will use. Estimated cost: $100 per month.

Maintenance: This includes the costs that would have to be incurred while keeping the model updated with current updates, security patches, and scaling. Estimated cost: $200 per month.

Total Estimated Costs Summary

  • Hardware Costs: $2,000 per development workstation, $500 per month for server infrastructure.

  • Software Costs: Primarily cloud services at approximately $300 per month.

  • Labor Costs: Around $82,000 should be allocated for the labor force for the whole project duration. This assumes the project team to be composed of data scientist, software developer, and a project manager.

  • Environment Costs: Deployment and hosting estimated at $600 per month, with ongoing maintenance at $200 per month.

These estimates provide a framework for budgeting the project. Actual costs may differ, depending on the specific needs, the duration of development and maintenance, and the hardware and software solutions selected. Additionally, potential cost savings from using open-source software and the flexibility and scalability offered by cloud services in adjusting resources to the project's phase and demand should also be considered.

C. Application

The following files are included in the submitted application: The .zip archive containing all application files, including:

notebook-local.ipynb Jupyter notebook for running the application in Binder, or in a local environment.

colab-notebook.ipynb Jupyter notebook for running the application in a Google Colaboratory environment.

kaggle-notebook.ipynb Jupyter notebook for displaying the application on Kaggle. A readme file explaining the objective of the project with links to run the application in various environments. This is the front-facing file when viewing the project on GitHub.

To launch the application, please go to the following link:

and scroll down to the "Usage" section of the Readme where there are links to open the application in Google Colab or Binder. There is also a link to download the project files in a .zip container. See the User Guide for more instruction.

D. Post-implementation Report

Solution Summary

The project that the Docs'R'Us medical clinic developed was responsive to one of the most urgent healthcare needs -- the early identification of patients at risk for diabetes or pre-diabetes. Since diabetes rates have been on the rise globally and the condition comes with serious health implications, it is paramount to flag patients with potential health risks as soon as possible to orchestrate efficient intervention and management. The solution developed to meet this need is a predictive tool that relies on machine learning algorithms to analyze the patients' health indicators and determine the probability of developing diabetes.

For solving the problem, the project team decided to work with the data from CDC's Behavioral Risk Factor Surveillance System for 2015 and 2021, that included a half-and-half ratio of diabetic and non-diabetic respondents in the sample. The dataset included responses to measurements of such health indicators as Body Mass Index, smoking, hypertension, high cholesterol, etc. The team had to clean, preprocess, and optimize the dataset to be able to apply machine learning methods to analyze it.

To summarize, the solution involved developing and comparing two machine learning models: logistic regression and random forests. Thus, the models trained on the preprocessed data aimed to predict the probability of developing diabetes. To evaluate the results' quality and comparing the accuracy of the predictions and identifying the balance between multiple necessary and fewer positive cases, the confusion matrices had been applied. The logistic regression model demonstrated better predictive outcomes and was selected for the implementation of the clinic's application.

In conclusion, the provided application is a crucial advancement in patient's care. Therefore, an individual could be easily identified as having a high possibility of developing diabetes due to this innovative technology. As such, the medical personnel would have more chances to perform further investigations and interventions in time with this proactive approach. Moreover, the implementation of the application not only improves healthcare services but leads to fewer unfavorable outcomes for the patient since it prevents further developments of the indicated risks regarding diabetes. Hence, this project means great progress for Docs'R'Us clinic through utilizing innovative technology for patient well-being.

Data Summary

Our analysis was based on two major datasets from the years 2015 and 2021 collected by the Behavioral Risk Factor Surveillance System, carried out by the Centers for Disease Control and Prevention. The datasets were retrieved from Kaggle, a popular data sourcing platform for data science projects. BRFSS surveys are conducted through state-based telephone interviews every year, focusing on health-related risk behaviors, chronic health conditions, and the use of preventive services among U.S. adults.

The datasets selected offer a clean and balanced representation of the respondent groups, with an equal division between individuals without diabetes and those with pre-diabetes or diabetes. Comprising twenty-one feature variables, each dataset was carefully curated to ensure a balanced representation conducive to the development of a predictive model. Acknowledging the data's potential, the project team embarked on a sequence of data processing and management steps essential for transforming the raw data into actionable insights.

As such, these two datasets were combined into one dataframe, crucial for creating a comprehensive dataset that reflected a broader spectrum of populations over different periods. This merge was followed by a rigorous data cleaning process, where unnecessary features potentially biasing the predictive model's accuracy were carefully removed. Additionally, to enhance both the application's performance and data storage efficiency, certain data types were modified, ensuring dataset flexibility at runtime while preserving computational efficiency.

The project's development lifecycle, encompassing design, development, and maintenance phases, was managed with precision and professionalism. During the design phase, features providing significant insights into diabetes risk factors were meticulously selected. In the development phase, various data visualization techniques were employed to deeply understand the dataset's characteristics, ensuring the machine learning models chosen were optimally suited for the task. The maintenance phase involved establishing protocols for periodic model reviews and updates in response to new data or changes in the prevalence of diabetes risk factors over time.

Through strict data processing and management practices, the project maintained the integrity and reliability of the data, ensuring the developed application would be a valuable tool in the early detection and management of diabetes risk among patients.

Machine Learning

During the development of the predictive tool to identify patients at elevated risk of diabetes, two major machine learning techniques were central to the project's success: logistic regression for prediction and datatype scaling for memory efficiency.

Logistic Regression Model

What: Logistic regression, in statistical analysis, assesses a dataset with one or more independent variables to determine an outcome. The outcome is measured with a dichotomous variable (in this context, diabetes, or no diabetes), estimating the probability that a given input belongs to a certain category.

How: The model was refined following the pre-processing of the combined dataset to retain only relevant features indicative of diabetes risk, including the removal of irrelevant features, and addressing missing data. It was then trained on a portion of the dataset with known outcomes, allowing the model to learn the significance of each feature in predicting diabetes risk. The training process involved fine-tuning the model's parameters to minimize prediction errors, using strategies like cross-validation for generalization.

Why: Logistic regression was chosen for its computational efficiency, interpretability, and robust performance in binary classification tasks. With the target variable taking binary values (diabetes or no diabetes), logistic regression provided a straightforward yet powerful approach to model the complex relationships between various independent health indicators and the probability of having diabetes. Its simplicity ensures that medical professionals at Docs'R'Us can easily understand and utilize the model's predictions effectively.

Scaling Data Types

What: The process of scaling down data types involved changing the storage format of variables in the dataset to formats that occupy less memory, without significantly losing precision. This step was crucial for optimizing memory usage, especially with large datasets.

How: During the data preprocessing phase, the team evaluated each feature and adjusted their data type as necessary. Integer features with a range that fit within a byte's capacity were downscaled from their default 64-bit integers to 8-bit integers. A similar approach was applied to floating-point numbers where applicable. This optimization, carried out before the model training phase, ensured that machine learning algorithms ran more efficiently in terms of memory usage and computational speed.

Why: The necessity to scale down data types stemmed from the need to manage computational resources more effectively. Large datasets, especially those combined from multiple sources as in this project, can significantly burden memory resources, leading to longer processing times and a higher risk of computational errors. By minimizing the memory footprint of the dataset, the project team facilitated smoother development and execution of machine learning algorithms. This optimization proved to be particularly beneficial due to the iterative nature of model training and validation, enabling quicker adjustments and enhancements to the model.


The validation of the machine learning methodologies employed in the project, specifically the logistic regression model and the optimization of data types to reduce memory footprint, was critical to ensuring both the effectiveness and efficiency of the predictive tool. Validation methods were carefully selected to assess the performance and impact of these methodologies.

Logistic Regression

Validation Method

The K-Fold cross-validation technique was central to the validation of the logistic regression model. This approach involves dividing the dataset into a fixed number of subsets or "folds," then using one subset for testing the model and the rest for training. This process is repeated until each subset has been used for testing once, with the average performance across all folds computed to determine the model's overall predictive accuracy.


The K-Fold cross-validation yielded a strong indication of the logistic regression model's ability to generalize to unseen data. The model exhibited an average accuracy of 74.25% in predicting diabetes risk, affirming its efficacy, and highlighting its utility in real-world applications for identifying patients at risk of diabetes. This validation not only confirmed the model's effectiveness but also highlighted its potential as a valuable tool in diabetes risk assessment.

Scaling Data Types

Validation Method

The process of validating the data type optimization involved comparing memory usage before and after the implementation of data type scaling. This method provided a quantitative measure of the optimization's impact on memory efficiency.


The comparison highlighted a significant reduction in memory usage, with the optimized dataset consuming only 2,692 kibibytes compared to the initial 18,305 kibibytes. This substantial decrease in memory usage validated the effectiveness of the data type optimization process, demonstrating the feasibility of processing large datasets on limited computational resources. It set a precedent for future data management strategies in similar projects, showing that significant reductions in the memory footprint of a dataset are achievable. This success in reducing the dataset's memory consumption underscores the critical role of data optimization techniques in enhancing the performance and scalability of machine learning applications.


Table 1

Category 1 2 3 4 5 6 7 8 9 10 11 12 13
Age range 18-24 25-29 30-34 35-39 40-44 45-49 50-54 55-59 60-64 65-69 70-74 75-79 80+

To better understand the subsequent visualizations, first we define the "Age Category" feature in the data using a table. Age is grouped into thirteen "buckets" spanning approximately five years each.

Figure A

The next visualization (Figure A) is a histogram showing the age distribution of the combined datasets. Note that category 10, or the age range of 65-69 has the highest distribution (i.e., the age group most represented in the data is between 65 and 69 years old).

Figure B

Next, we show the distributions of General Health Ratings, and the differences between Diabetic and non-diabetic populations (Figure B). General Health Ratings are measured between one and five (poor to excellent health). Comparisons between the two populations can be made visually: more diabetics had General Health Rating scores of 1, 2 or 3, while more individuals without diabetes had scores of 4 or 5 compared to diabetics. This visualization aids in understanding whether there is a noticeable difference in self-reported general health status among individuals based on their diabetes condition.

Figure C

Lastly, we show a scattergram (Figure C) showing the averages of General Health scores per Age Category. Diabetics are compared to those without diabetes (orange versus blue dots). We can see that diabetics have poorer general health. We also see a trend in healthy people where the older the individual is, the poorer their general health (blue line). Diabetics do not show this trend.

User Guide

The application is saved as a Jupyter notebook, and will run python code to process the datasets, train a model, and run predictions based on new data the user inputs for a particular patient.

First, please go to the following link:

and scroll down to the "Usage" section of the Readme. From here there are two ways to run the application interactively:

Open in Google Colaboratory

Click on the button "Open in Colab" on the Readme page, or go to:

When the notebook opens, scroll down to the section titled "Data Exploration" and click on the cell. In the menu above, go to Runtime--->Run before; or use the keyboard shortcut Ctrl+F8. This will run the python code in each cell sequentially, up to the "Patient Outcome Prediction" section. When a warning pops up, ("Warning: This notebook was not authored by Google") click Run anyway below. The Colab notebook only requests data stored on Kaggle's servers.

Wait for several seconds, then you will see the patient health form displayed (scroll back up to the beginning):

Enter patient data to predict risk of diabetes. Use the dropdown menu to select patient sex, slider bars to enter numerical data, and radio buttons for answers to the questions given. Click the button at the bottom "Predict Patient Risk" to obtain a risk prediction.

If you would like to see the validation steps and visualizations created, click on the "Data Exploration" cell, and in the menu above go to Runtime--->Run after; or use the keyboard shortcut Ctrl+F10. This will take a couple minutes to complete, after which all cells with code will display outputs. Going to Runtime--->Run all; or using the keyboard shortcut Ctrl+F9 also runs all the code for the entire notebook.


If the input form does not load, and the output from the first cell shows:

Failed to load (likely expired) {download_url}...

Then the Google Colab token has expired. You will need to use Binder, see next section.

Launch in Binder

Click on the button "Launch Binder" on the Readme page, or go to:

Wait a few minutes for the notebook to open. When the notebook opens, scroll down to the section titled "Data Exploration" and click on the cell. In the menu above, go to Run--->Run All Above Selected Cell. This will run the python code in each cell sequentially, up to the "Patient Outcome Prediction" section.

Wait for several seconds, then you will see the patient health form displayed (scroll back up to the beginning):

Enter patient data to predict risk of diabetes. Use the dropdown menu to select patient sex, slider bars to enter numerical data, and radio buttons for answers to the questions given. Click the button at the bottom "Predict Patient Risk" to obtain a risk prediction.

If you would like to see the validation steps and visualizations created, in the menu above go to Run--->Run All Cells. This will take a couple minutes to complete, after which all cells with code will display outputs.

Using Kaggle Project Link to Run Interactively

If you are having trouble with Binder and the Colab page has an expired token (see Troubleshooting above), you may open the notebook with a new Colab token as follows:

Click on the button "Open in Kaggle" on the Readme page, or go to:

This will display the project on Kaggle as a static page. To run interactively, use the hamburger menu (click the "three dots") in the top-right corner of the page, and click "Open in Colab." This will launch a new Colab environment with a new token. You must be signed in to a Google account to run the notebook.

From here, follow the directions laid out in the previous section (Open in Google Colaboratory) as normal.


A female patient is 47 years old, with a BMI of 25.0. She has high blood pressure and cholesterol and rates her general health a 3.

Select "Female" using the dropdown menu. Move the slider for "Age category" to 6 (between 45-49) and click "Yes" for "High Blood Pressure" and "High Cholesterol". Move the slider for "General Health scale" to 3.00. Keep the other options at their default states. Click the "Predict Patient Risk" button below the form. The following is the output:

Prediction: At risk of diabetes

This should be noted in the patient's health record and discussed by the patient's doctor.