Kaggle HCT risk prediction model - s-joshid/bioinformatics_projects GitHub Wiki

Background

This project was done with me and a friend through a Kaggle competition. This competition aimed to improve equality in post-hematopoietic stem cell transplant by analyzing hoe survival outcomes may differ among different ethnic groups. Our goal was to create a model that predicted rick scores accurately among different ethnic groups.

The measure used to determine models ability to correctly predict patients survival is the Concordance Index or c-Index where a score one 1 is a perfect model, 0.5 is what we would expect from random predictions, and 0 is perfectly wrong. Here the competition utilized a stratified c-index where it separated patients by racial groups and then combined the mean of the c indices across racial groups minus the standard deviation of the c-index across racial groups. Synthetic data was provided via Kaggle to test and train our models.

Approach

While many models or an aggregation of models can be utilized, we choose to build Cox Proportional Hazard model from the Scikit survival library. This was due to the great documentation, wanting to learn a model we had not worked with yet, and wanted to deepen our regression analysis background. This model is also optimized and developed for survival time analysis. We did the basic data cleaning steps, delt with missing values through imputation when possible and realistic, and utilized feature selection and K-fold cross validation to pick the best model, achieving an C-index score of 0.658.