RF train doc - mbosio85/ediva GitHub Wiki

How to train your own Random Forest model

Get the data

You can Download the files from : https://public_docs.crg.es/sossowski/MicrobeGenomes/human/eDiVA/eDiVA_score/

train_model_RF.R : this is the base R script to load training data, impute the default values, and generate a novel model. Currently it includes all variants used for eDiVA model
trainset.rds : This is the training set used to generate the model for eDiVA. It contains all annotation fields from eDiVA-Annotate. Any available annotation field from this file can be included in the random forest model.
sample_test_set.rds: This is a sample test set to test your model.

Train your model

Edit train_model_RF.R to include/exclude variables you prefer
Execute it in R, provided you installed caret and randomForest packages
Your model is ready

Key points

Choose the variables you want.
Set default values if missing: e.g.

trainSet$Cadd2[is.na(trainSet$Cadd2) ] <- 0
trainSet$MaxAF[is.na(trainSet$MaxAF) ] <- 0
trainSet$ABB_score[is.na(trainSet$ABB_score) ] <- 1

Train your model :

model <- train(rank ~
MaxAF +
Cadd2 +
Condel +
MutAss +
SegMentDup +
Eigen_Phred +
ABB_score +
PrimatesPhyloP +
PlacentalMammalPhastCons +
PrimatesPhastCons + PlacentalMammalPhyloP,
data = trainSet, # Use the trainSet dataframe as the training data
method = 'rf', # Use the random forest algorithm
ntree = 1000,
trControl = trainControl(method = 'cv', # Use cross-validation
number = 5) # Use 5 folds for cross-validation
)

save the model as a '.rds' file such as mymodel.rds
save the test set as a CSV file sich as test.csv

Test your model

There is a test set sample available at the same link so you can test it right away using standard caret tools If you want to substitute the eDiVA model with yours, simply replace the model file (.rds) in the Prioritize folder and edit the predict_model_file.R accordingly and you are ready to go.

From the code, how to test your model: Edit the predict_model_file.R in eDiVA to set defaults for the variants you put in the model setting data_to_predict$XXX

data_to_predict$Cadd2[is.na(data_to_predict$Cadd2) ] <- 0

Now you can launch : Rscript mymodel.rds test.csv output.csv
Your predicted score will be in the last column of output.csv (rank or rank.1 name)