P2Rank conservation - cusbg/p2rank-framework GitHub Wiki

What are you going to learn here:

  • use a model with conservation
  • train and evaluate a new p2rank model with conservation

Disclaimer

This tutorial provides a general introduction to using evolutionary conservation as a feature in P2Rank. It was mostly written before the HMMER-based conservation calculation pipeline (which is currently utilized by PrankWeb) was developed. While most of the information provided here is still relevant, specific commands may differ when the HMMER-based pipeline is used (particularly commands related to training a model). In any case, users wishing to train and use conservation-aware models are advised to check out the updated tutorial for training models utilizing the HMMER-based pipeline which may present more up-to-date information.

Prerequisities

You should have p2rank 2.3 installed and have relevant datasets downloaded (described in the setup guide). All commands must be run from the directory created in the setup guide. You should also know how to train and evaluate a model without conservation.

Conservation files

p2rank can use conservation information from files produced by, for example, the sequence conservation pipeline.

Example of the structure of the conservation file is as follows (lines starting with # are treated as comments):

# /tmp/msa8880677536751254144.fasta -- js_divergence - window_size: 3 - background: blosum62 - seq. weighting: True - gap penalty: 1 - normalized: False
# align_column_number	score	column

0	-1000.00000	-KT-T-KTSTT----E-TTNKDT-D-K-NTT-EDTT-D-TSD--TS---------TTNS---TTTNNSKTTTT-TTK-DT---TTNTTT
1	0.32670	TILFL-ILILL----V-TLLLIVFILI-LII-LIIIIIITIIFVIV-MMMFFL-LLLIL-I-TLVLLIILITL-LFV-IL--ILLLVTI
2	0.43770	FFFLFFFFAFFF-V-L-FFVVVFLVVF-VFFFIVYFVVVFVVLLFV-FLYVVV-VFFVFYV-FFFYVAFFFVF-FFV-VFF-VFFVVFY
3	0.52723	VVVVVVVVMVVV-I-V-VVIVVVVVVVVIVVVVVVVVIIVVVVLVL-IIVVVV-VVVIVIV-VIVLIIVTVLV-VVV-VVV-VIVVLVV
4	0.58142	AAAAAAAAAAAAAA-A-AAAAAAAAAAAAAAAAAGAAAAAAAAAAA-GAAAAA-AAAAAAA-AAAAAAAAAAA-AAA-AAA-AAAAAAG

The p2rank will extract only the position, score, and AA code (the list of AA codes on i-th line correspond to i-th column of the MSA from which the conservation is computed). In this example, p2rank uses the following values:

index score letter
1 0.32670 T
2 0.43770 F
3 0.52723 V
4 0.58142 A

As you can see the first row with score -1000.00000 is not used. Nevertheless, it is still loaded by p2rank, but the negative value is replaced with zero. Next, the value is ignored as it corresponds to a gap, represented by '-' in the file above.

As you can see the conservation file has no information about chain. For this reason, a single conservation file needs to be provided for each chain. The chain is encoded in the file name. For example, file 1a0qH.pdb.seq.fasta.hom.gz corresponds to PDB record 1a0q and chain H.

Running the Default Model

p2rank comes with a pre-trained conservation-aware model. We can use the following command to evaluate the coach420 model using this model.

.\p2rank\prank.bat eval-predict .\datasets\coach420.ds -threads 4 -label default-conservation -c .\p2rank\config\conservation -conservation_dirs .\coach420\conservation\e5i1\scores

We use a custom label default-conservation (-label default-conservation) to recognize the result files. This is needed if multiple experiments are run so that the results of one experiment do not overwrite the results of the previous one. The label serves as the prefix of the results. The -c argument specifies the JSON configuration file for the conservation. Finally, we need to provide a path to directory with computed configurations (-conservation_dirs). The conservation path is relative to the dataset definition file.

You may also want to check ./p2rank/test_output/eval_predict_coach420_default-conservation/run.log for any conservation-related errors. The log file also includes information about loading the conservation:

[INFO] ConservationScore - Loading conservation scores from file [.\datasets\.\coach420\conservation\e5i1\scores\1afkAA.pdb.seq.fasta.hom.gz]

We use results from the Editing Model Training and Evaluation tutorial to get an estimate of the impact of conservation on the result. Keep in mind, that the training is not a completely deterministic process and your results may vary slightly.

DCA (4.0) n n + 2
p2rank default 71.6 77.1
our model 70.5 76.5
p2rank default conservation 73.4 78.1

As we can see the performance is similar to the default model. However, after visual inspection of the resulting pockets, you often observe improvement in the shape of the pockets as these seem to be more compact.

Training a New Model with Conservation

p2rank can load conservation files only from one directory. As the training requires test and validation dataset we need to merge conservation into a single directory. This can be done in a few steps:

  • Create a new directory conservation
  • Copy content of chen11/conservation/e5i1/scores into the conservation directory
  • Copy content of joined/conservation/e5i1/scores into the conservation directory

Now we are ready to train a new model. p2rank can be configured using a wide range of parameters and options. To make things easier we are going to re-use the default conservation configuration (.\p2rank\config\conservation).

A new model can be trained using the following command:

.\p2rank\prank.bat traineval -t .\datasets\chen11.ds -e .\datasets\joined.ds -threads 4 -rf_trees 200 -delete_models 0 -loop 1 -seed 42  -c .\p2rank\config\conservation -label conservation -conservation_dirs .\..\conservation

The conservation path is relative to the dataset definition file. On i7-3632QM it takes about 30 minutes to finish.

Next, we can evaluate our newly trained model on the coach420 dataset using command:

.\p2rank\prank.bat eval-predict .\datasets\coach420.ds -threads 4 -label conservation -model .\p2rank\test_output\traineval_chen11_joined_conservation\runs\seed.42\FastRandomForest.model -c .\p2rank\config\conservation -conservation_dirs .\coach420\conservation\e5i1\scores

As you can notice the command is almost the same as for running the default conservation model, the only the difference is that we specify a custom model file.

For my run, I got the following results:

DCA (4.0) n n + 2
p2rank default 71.6 77.1
our model 70.5 76.5
p2rank default conservation 73.4 78.1
our conservation 72.8 76.7

Conservation File Directories

P2Rank can load conservation files from the provided directory specified using the -conservation_dirs path/to/dir argument. You can specify multiple conservation directories using -conservation_dirs "(path/to/dir1, path/to/dir2)" syntax. Alternatively, if no directory is set, p2rank will look for the conservation files in the same directory where the structure files are located.

Conservation File Names

For each chain in each protein P2Rank will look for conservation score file named {base_protein_file_name}(_){chain_code}.(***).hom(.gz) in all conservation directories. Example of valid conservation score file names for pdb file 1a0q.pdb and chain H:

  • 1a0q_H.hom
  • 1a0q_H.hom.gz
  • 1a0q_H.whatever.hom
  • 1a0q_H.whatever.hom.gz
  • 1a0qH.whatever.hom
  • 1a0qH.hom