RF train doc - mbosio85/ediva GitHub Wiki
How to train your own Random Forest model
Get the data
You can Download the files from : https://public_docs.crg.es/sossowski/MicrobeGenomes/human/eDiVA/eDiVA_score/
- train_model_RF.R : this is the base R script to load training data, impute the default values, and generate a novel model. Currently it includes all variants used for eDiVA model
- trainset.rds : This is the training set used to generate the model for eDiVA. It contains all annotation fields from eDiVA-Annotate. Any available annotation field from this file can be included in the random forest model.
- sample_test_set.rds: This is a sample test set to test your model.
Train your model
- Edit train_model_RF.R to include/exclude variables you prefer
- Execute it in R, provided you installed
caret
andrandomForest
packages - Your model is ready
Key points
- Choose the variables you want.
- Set default values if missing: e.g.
trainSet$Cadd2[is.na(trainSet$Cadd2) ] <- 0
trainSet$MaxAF[is.na(trainSet$MaxAF) ] <- 0
trainSet$ABB_score[is.na(trainSet$ABB_score) ] <- 1
- Train your model :
model <- train(rank ~
MaxAF +
Cadd2 +
Condel +
MutAss +
SegMentDup +
Eigen_Phred +
ABB_score +
PrimatesPhyloP +
PlacentalMammalPhastCons +
PrimatesPhastCons + PlacentalMammalPhyloP,
data = trainSet, # Use the trainSet dataframe as the training data
method = 'rf', # Use the random forest algorithm
ntree = 1000,
trControl = trainControl(method = 'cv', # Use cross-validation
number = 5) # Use 5 folds for cross-validation
)
- save the model as a '.rds' file such as
mymodel.rds
- save the test set as a CSV file sich as
test.csv
Test your model
There is a test set sample available at the same link so you can test it right away using standard caret tools If you want to substitute the eDiVA model with yours, simply replace the model file (.rds) in the Prioritize folder and edit the predict_model_file.R accordingly and you are ready to go.
From the code, how to test your model: Edit the predict_model_file.R in eDiVA to set defaults for the variants you put in the model setting data_to_predict$XXX
data_to_predict$Cadd2[is.na(data_to_predict$Cadd2) ] <- 0
- Now you can launch :
Rscript mymodel.rds test.csv output.csv
- Your predicted score will be in the last column of output.csv (rank or rank.1 name)