Data Analysis : Heart Failure Prediction - clumsyspeedboat/Decision-Tree-Neo4j GitHub Wiki

About the Dataset

Cardiovascular diseases (CVDs) are the number 1 cause of death globally, taking an estimated 17.9 million lives each year, which accounts for 31% of all deaths worldwide. Heart failure is a common event caused by CVDs and this dataset contains 12 features that can be used to predict mortality by heart failure.

Most cardiovascular diseases can be prevented by addressing behavioural risk factors such as tobacco use, unhealthy diet and obesity, physical inactivity and harmful use of alcohol using population-wide strategies.

People with cardiovascular disease or who are at high cardiovascular risk (due to the presence of one or more risk factors such as hypertension, diabetes, hyperlipidaemia or already established disease) need early detection and management wherein a machine learning model can be of great help.

Acknowledgement

Citation
Davide Chicco, Giuseppe Jurman: Machine learning can predict survival of patients with heart failure from serum creatinine and ejection fraction alone. BMC Medical Informatics and Decision Making 20, 16 (2020). (link)

License
CC BY 4.0

We have compared 3 datasets to choose from and build our Machine Learning Algorithm

  • Heart Failure Prediction Dataset from Kaggle which seems to fulfil a lot of the criteria. It compiles demographic & pathogenic information of 299 patients for the prediction of heart failures
  • Metaproteomic Study Dataset
  • Biological Processes Study

Dataset Comparison

Dataset Survey

Other than having a low no. of rows in the dataset, Heart Failure Prediction Dataset seems like a sensible choice for building a machine learning algorithm while implementing on Neo4j


R

Variables in the Dataset

Numeric Variables

Variable Histogram Box Plot
Age: Numeric Age of Patient Hist_Age Box_Age
Creatinine Phosphokinase: Also known as creatine kinase (CK) is the enzyme that catalyzes the reaction of creatine and adenosine triphosphate (ATP) to phosphocreatine and adenosine diphosphate (ADP). The phosphocreatine created from this reaction is used to supply tissues and cells that require substantial amounts of ATP like the brain, skeletal muscles, and the heart with their required ATP. The normal CPK level is considered to be 20 to 200 IU/L. Many conditions can cause derangement in CPK levels, including rhabdomyolysis, heart disease, kidney disease, or even certain medications source: "Creatine Phosphokinase", NCBI Hist_CreatinineP Box_CreatinineP
Ejection Fraction: A measurement of the percentage of blood leaving the heart each time it contracts. A reduced ejection fraction might indicate weakness of heart muscle, heart attack, heart valve problems &/or uncontrolled high blood pressure. An ejection fraction of 55 per cent or higher is considered normal. source: "Ejection Fraction: What does it measure?", Mayo Clinic Hist_EjectF Box_EjectF
Platelets: Platelets are specialized disk-shaped cells in the bloodstream that are involved in the formation of blood clots that play an important role in heart attacks, strokes, and peripheral vascular disease. The number of platelets is routinely tested as part of the complete blood count (CBC). Normal counts range from 150 000 to 450 000. source: "Platelets and Cardiovascular Disease", Circulation
Serum Creatinine: Creatinine is a waste product that forms when creatine, which is found in your muscle, breaks down. Creatinine levels in the blood can provide your doctor with information about how well your kidneys are working. People who are more muscular tend to have higher creatinine levels. Results may also vary depending on age and gender. In general, however, normal creatinine levels range from 0.9 to 1.3 mg/dL in men and 0.6 to 1.1 mg/dL in women who are 18 to 60 years old. Normal levels are roughly the same for people over 60. serum creatinine levels may be slightly elevated or higher than normal due to - dehydration, blocked urinary tract, reduced blood flow to the kidneys due to shock, congestive heart failure, or complications of diabetes source: "Creatinine Blood Test", healthline Hist_SerumC Box_SerumC
Serum Sodium: A normal blood sodium level is between 135 and 145 milliequivalents per litre (mEq/L). Hyponatremia or low serum sodium level is typically defined as a serum sodium concentration of <135 mEq/L and is one of the most common biochemical disorders featured in heart failure patients, with a prevalence close to 25%. source: "The prognosis of heart failure patients: Does sodium level play a significant role?", NCBI Hist_SerumS Box_SerumS
Time: Follow up period (days) Hist_Time Box_Time

Categorical(Nominal) Variables (with respect to Target Label - Death Event)

Variable Bar Plot
Anaemia is a decrease in the total amount of red blood cells (RBCs) or haemoglobin in the blood, or a lowered ability of the blood to carry oxygen which can worsen cardiac function because it causes cardiac stress source: "The role of anaemia in the progression of congestive heart failure. Is there a place for erythropoietin and intravenous iron?", PubMed Bar_Anaemia
Diabetes is a metabolic disease that causes high blood sugar. People who have Type 2 diabetes, characterized by elevated blood sugar levels, are two to four times more likely to develop heart failure than someone without diabetes. source: "Diabetes and heart failure are linked; treatment should be too", heart.org Bar_Diabetes
High Blood Pressure Bar_HighBP
Sex Bar_Sex
Smoking Bar_Smoking

Correlation between Variables

Scatter Plots (Pairplot in Python)

Scatter Plots


Unsupervised Learning Algorithms

After our preliminary analysis on each variable, we decided to run Principal Component Analysis on our data, implemented on the numeric variable columns only to visualize strong correlating variables and realize their weights on the Principal Components. We have plotted a Biplot to visualize both variable loading and individual points

PCA Biplot


Python

Class Label

Summary of Variables

Data_Analysis_Heart Failure Prediction

Pairplot