P2Rank model training and evaluation - cusbg/p2rank-framework GitHub Wiki

What are you going to learn here:

  • how to evaluate existing p2rank model
  • how to train a new p2rank model

Prerequisities

You should have p2rank installed and have relevant datasets downloaded (described in the setup guide). All commands must be run from the directory created in the setup guide.

Evaluation of Existing Model

An existing model can be evaluated on a dataset using the eval-predict command. You can use p2rank to predict coach420 and carry out evaluation using the following command (p2rank on Linux):

.\p2rank\prank.bat eval-predict .\datasets\coach420.ds -threads 4

p2rank will now make predictions for all the proteins in the coach420 dataset. Then, p2rank evaluates how good/bad the predictions are. After the command finishes you should see output similar to the following one:

tolerances:     [0]    [1]    [2]    [4]    [10]   [99]
DCA(1.0)        20.4   22.7   24.1   26.0   26.2   26.2 
DCA(2.0)        53.6   58.1   59.9   61.6   61.6   62.0 
DCA(3.0)        65.4   69.7   71.6   73.4   73.4   73.8 
DCA(4.0)        71.6   75.5   77.1   78.9   79.1   79.5 
DCA(5.0)        76.1   79.8   81.4   83.2   83.4   83.8 
DCA(6.0)        78.5   82.4   83.8   85.3   85.5   85.9 
DCA(7.0)        80.6   84.7   85.9   87.7   88.3   88.8 
DCA(8.0)        82.4   86.7   87.5   89.2   89.6   90.0 
DCA(9.0)        83.8   88.5   89.2   90.6   91.0   91.4 
DCA(10.0)       85.3   89.6   90.8   92.2   92.6   93.0 
DCA(11.0)       86.1   90.4   91.4   92.6   92.8   93.2 
DCA(12.0)       87.3   91.4   92.4   93.3   93.5   93.9 
DCA(13.0)       88.3   92.4   93.3   94.3   94.5   94.7 
DCA(14.0)       89.4   93.2   94.1   95.1   95.3   95.5 
DCA(15.0)       91.2   94.5   95.3   96.3   96.3   96.3 
DCC(1.0)         5.1    5.1    5.1    5.1    5.1    5.1 
DCC(2.0)        16.8   17.2   17.2   17.2   17.2   17.2 
DCC(3.0)        32.5   33.3   33.9   34.2   34.2   34.2 
DCC(4.0)        45.4   47.0   48.3   49.5   49.7   49.9 
DCC(5.0)        54.4   57.5   59.1   60.7   61.1   61.4 
DCC(6.0)        62.2   66.1   68.3   70.1   70.3   70.6 
DCC(7.0)        68.7   73.4   75.5   77.5   77.7   78.1 
DCC(8.0)        73.0   78.1   80.2   82.2   82.2   82.6 
DCC(9.0)        76.3   80.8   83.0   85.1   85.5   86.1 
DCC(10.0)       79.3   83.8   85.5   87.7   88.1   88.5 
DSO(0.7)         5.7    5.9    6.1    6.1    6.1    6.1 
DSO(0.6)        12.9   13.1   13.3   13.5   13.5   13.5 
DSO(0.5)        27.2   28.0   28.2   29.4   29.4   29.5 
DSO(0.4)        43.6   46.4   46.8   47.9   47.9   48.1 
DSO(0.3)        58.5   62.8   63.8   65.2   65.2   65.6 
DSO(0.2)        70.6   74.6   76.1   78.1   78.1   78.5 
DSO(0.1)        79.8   83.6   85.1   86.7   86.7   87.1 
DSWO(1.0;0.2)   21.1   21.5   21.5   21.5   21.5   21.5 
DSWO(0.9;0.2)   36.2   36.8   36.8   36.8   36.8   36.8 
DSWO(0.8;0.2)   45.0   45.8   46.0   46.2   46.2   46.2 
DSWO(0.7;0.2)   52.3   53.4   53.6   53.8   53.8   53.8 
DSWO(0.6;0.2)   60.1   62.2   62.4   63.6   63.6   63.6 
DSWO(0.5;0.2)   65.2   68.1   68.5   69.9   69.9   70.1 
DSWO(0.4;0.2)   69.1   72.4   73.6   75.0   75.0   75.1 
DSWO(0.3;0.2)   71.2   74.8   76.1   77.7   77.7   78.1 
DSWO(0.2;0.2)   71.8   75.7   77.3   79.1   79.1   79.5 
DSWO(0.1;0.2)   71.8   75.7   77.3   79.1   79.3   79.6 

predicting pockets finished in 0 hours 3 minutes 47.36 seconds 
results saved to directory [D:\Projects\protein-ligand-binding-site-prediction\p2rank\test_output\eval_predict_coach420]

----------------------------------------------------------------------------------------------
 finished successfully in 0 hours 3 minutes 49.24 seconds
----------------------------------------------------------------------------------------------

The table contains scores evaluating how well the model performed on a given dataset. For the purpose of this guide, we will focus on DCA (4.0). This metric is used in the p2rank article, although misspelled as DCC. Roughly speaking the DCA (4.0) measures how much the predicted binding sites overlap with the real binding sites using 4 Ångström (Å) tolerance. However, there can be more sites predicted than there in fact are. For example, there may be 2 binding sites on the protein, and p2rank predicts 5. the DCA (4.0) [0] computes overlap using only the best 2 predictions, as there are only 2 real binding sites. DCA (4.0) [2] would use 4 predicted binding sites, i.e. we take the number of binding sites and add 2.

So from the output above, we got the following numbers:

DCA (4.0) n n + 2
default model 71.6 77.1
article 72.0 78.3

You can also find the results in the p2rank/test_output/runs_pred.csv file.

Note: The results are not the same as in the p2rank article, but they are close enough.

Training a New Model

Although p2rank is provided with several pre-trained models, it also allows the user to train new models. In order to train a model, we need two datasets called train and validation dataset. For this purpose, we can utilize chen11 and joined datasets.

P2rank model utilizes a random forest classifier. As the random forest classifier is built over a number of decision trees, to train it, we need to specify the number of trees and the maximum size (depth) of a tree. The number of trees and maximum tree depth are examples of hyper-parameters. The hyper-parameters are not changed during training and validation phase and must be provided upfront. p2rank offers the possibility to optimize hyper-parameters but that is out of the scope of this guide.

We can train a new model using following command:

.\p2rank\prank.bat traineval -t .\datasets\chen11.ds -e .\datasets\joined.ds -threads 4 -rf_trees 200 -delete_models 0 -loop 1 -seed 42

In the command above we use chen11 as the training dataset and joined dataset as the validation dataset. P2rank should use 4 threads. The random forest classifier consists of 200 trees with no depth limit (In the p2rank article, the final default model has 200 trees, each grown with no depth limit using 6 features) Using a smaller number of trees speeds up the training process, but may cause sub-optimal performance. By default, p2rank would delete any trained model and keep only the evaluation scores. In order to preserve the model, we use -delete_models 0 option. The last two commands specify that we run only one training cycle and that the random seed should be 42. Despite specifying the random seed the training process is not fully deterministic and repeated runs thus may provide slightly different results. This command may take a little longer to execute. On i7-3632QM it takes about 30 minutes to finish.

The results reported at the end of the script execution are computed on the evaluation dataset. As a next step, we take our new model and evaluate it on the coach420 dataset. This can be easily done using the following command:

.\p2rank\prank.bat eval-predict .\datasets\coach420.ds -threads 4 -label our -model .\p2rank\test_output\traineval_chen11_joined\runs\seed.42\FastRandomForest.model

We add -label our that will prefix the output in the .\p2rank\test_output so we can easily distinguish between the first and this run. Once the command is executed we can compare the results with the evaluation of the default model.

DCA (4.0) n n + 2
default model 71.6 77.1
our model 70.5 76.5

As we can see the performance of our model is inferior to the default model. This may be due to poor selection of the seed value, lack of any hyper-parameter optimization, or just bad luck.