Impute Missing Data Using XGBoost: mixgb - CoBrALab/documentation GitHub Wiki

I used this R library to impute missing tabular data extracted from the UK Biobank. The library is called mixgb and performs multiple imputation using XGboost.

This method is faster than Random Forest for data imputation and imputes in a way that is more representative of the data set than imputing using the median or mode of a column to impute all missing values.

To Install:

Bash: 
module load cobralab
module load gcc
module load cmake

R:
.libPaths("~/R/x86_64-pc-linux-gnu-library/4.1")

install.packages("devtools")
install.packages("usethis")
install.packages("cli")

library(htmlwidgets)
library(cli)
library(usethis)
library(devtools)

install.packages("nloptr")
install.packages("jomo")
install.packages("pan")
install.packages("mitml")
install.packages("xgboost")
install.packages("mice")

install.packages("dplyr", version = "1.0.10")
library(dplyr)

install.packages("mixgb")

To Run:

  • Run from within a compute node

Libraries to load:


library(cli, lib.loc = "~/R/x86_64-pc-linux-gnu-library/4.1")
library(htmlwidgets, lib.loc = "~/R/x86_64-pc-linux-gnu-library/4.1")
library(usethis, lib.loc = "~/R/x86_64-pc-linux-gnu-library/4.1")
library(devtools, lib.loc = "~/R/x86_64-pc-linux-gnu-library/4.1")
library(dplyr, lib.loc = "~/R/x86_64-pc-linux-gnu-library/4.1")
library(mixgb, lib.loc = "~/R/x86_64-pc-linux-gnu-library/4.1")
library(readr)

Then load your data frame and put through mixgb:

  • mixgb has default value of m=5, meaning mixgb will generate 5 versions of imputed data.
  • to look into other method parameters refer to resources at the bottom of the page
imputed_data = mixgb(data=your_df,m=5)

imputed_data will contain m copies of each column in your_df, version number will be concatenated to the end of each column name to differentiate between the iterations.

Resources:

https://cran.r-project.org/web/packages/mixgb/vignettes/Using-mixgb.html

https://github.com/agnesdeng/mixgb

https://towardsdatascience.com/how-to-handle-missing-data-b557c9e82fa0

⚠️ **GitHub.com Fallback** ⚠️