RF train doc - mbosio85/ediva GitHub Wiki

How to train your own Random Forest model

Get the data

You can Download the files from : https://public_docs.crg.es/sossowski/MicrobeGenomes/human/eDiVA/eDiVA_score/

  • train_model_RF.R : this is the base R script to load training data, impute the default values, and generate a novel model. Currently it includes all variants used for eDiVA model
  • trainset.rds : This is the training set used to generate the model for eDiVA. It contains all annotation fields from eDiVA-Annotate. Any available annotation field from this file can be included in the random forest model.
  • sample_test_set.rds: This is a sample test set to test your model.

Train your model

  • Edit train_model_RF.R to include/exclude variables you prefer
  • Execute it in R, provided you installed caret and randomForest packages
  • Your model is ready

Key points

  • Choose the variables you want.
  • Set default values if missing: e.g.

trainSet$Cadd2[is.na(trainSet$Cadd2) ] <- 0
trainSet$MaxAF[is.na(trainSet$MaxAF) ] <- 0
trainSet$ABB_score[is.na(trainSet$ABB_score) ] <- 1

  • Train your model :

model <- train(rank ~
MaxAF +
Cadd2 +
Condel +
MutAss +
SegMentDup +
Eigen_Phred +
ABB_score +
PrimatesPhyloP +
PlacentalMammalPhastCons +
PrimatesPhastCons + PlacentalMammalPhyloP,
data = trainSet, # Use the trainSet dataframe as the training data
method = 'rf', # Use the random forest algorithm
ntree = 1000,
trControl = trainControl(method = 'cv', # Use cross-validation
number = 5) # Use 5 folds for cross-validation
)

  • save the model as a '.rds' file such as mymodel.rds
  • save the test set as a CSV file sich as test.csv

Test your model

There is a test set sample available at the same link so you can test it right away using standard caret tools If you want to substitute the eDiVA model with yours, simply replace the model file (.rds) in the Prioritize folder and edit the predict_model_file.R accordingly and you are ready to go.

From the code, how to test your model: Edit the predict_model_file.R in eDiVA to set defaults for the variants you put in the model setting data_to_predict$XXX

data_to_predict$Cadd2[is.na(data_to_predict$Cadd2) ] <- 0

  • Now you can launch : Rscript mymodel.rds test.csv output.csv
  • Your predicted score will be in the last column of output.csv (rank or rank.1 name)