Training a new model utilizing a HMMER based conservation pipeline - cusbg/p2rank-framework GitHub Wiki

This page presents an updated (as of Sep 29, 2021/P2Rank version 2.3), step-by-step tutorial for training a P2Rank model utilizing the HMMER-based conservation pipeline. This pipeline becomes the default PrankWeb conservation calculation method in October 2021. It also serves as a documentation detailing the training of the conservation_hmm.model model used in PrankWeb with the new conservation calculation pipeline.

Before going over the commands below, users are advised to read the model training and evaluation and conservation feature tutorials, as these provide a good introduction to the files and commands used in P2Rank model training. Nevertheless, some of the commands presented in these older tutorials may be out-of-date; users are thus advised to use the commands presented on this page and, if necessary, refer to the older tutorials only for additional information.

Update (February 16, 2022): Users may also wish to study the script model-training-updated.sh with versions of the commands below updated for P2Rank version 2.4. The updated script also illustrates other topics, such as training of transformers and preparation of other types of P2Rank models.

Prerequisites

Create a new directory and download the files from the model-training/ directory to it; then, change to the directory. The conservation_hmm.groovy file is a simple P2Rank configuration we will use. The fix_filenames.py file is a helper script necessary to avoid some errors related to unorthodox filenames present in the datasets (see below). The model-training.sh file is a script containing all the commands executed in this tutorial. You will further need to have the HMMER software package installed; for details see the related documentation.

Training a model

The commands below are intended to be executed in the displayed order (note the cds, etc.). It is assumed that the user is using a UNIX-like operating system with standard system utilities installed.

We start with downloading and configuring P2Rank to utilize more memory (adjust the value based on your machine):

wget https://github.com/rdk/p2rank/releases/download/2.3/p2rank_2.3.tar.gz
tar xzf p2rank_2.3.tar.gz
rm p2rank_2.3.tar.gz
cd p2rank_2.3/
sed -i 's/2048m/8G/' prank

Next, we download the datasets:

git clone https://github.com/rdk/p2rank-datasets.git

We then set up the files needed for the conservation calculations:

mkdir conservation
cd conservation/
wget https://raw.githubusercontent.com/cusbg/prankweb/master/conservation/conservation_hmm/conservation_hmm.py
chmod +x conservation_hmm.py
mkdir databases fastas homs
cd databases/
wget https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz
gunzip uniprot_sprot.fasta.gz
wget https://ftp.uniprot.org/pub/databases/uniprot/uniref/uniref50/uniref50.fasta.gz
gunzip uniref50.fasta.gz
wget https://ftp.uniprot.org/pub/databases/uniprot/relnotes.txt
cd ../

The conservation_hmm.py file is the "master" script performing the calculations. In this tutorial, we download two sequence databases (UniProtKB/Swiss-Prot and UniRef50) to perform the similarity searches against. We did this when training the model used in PrankWeb because we wanted to compare the performance of the models trained against the various databases. This is not normally necessary and a single database suffices, of course. Also note that the database files to which the URLs are pointing are updated over time; be sure to save the files used during training for further use.

We next use the new P2Rank functionality to generate FASTA files with the sequences of the chains in the datasets:

cd ../
for dataset in chen11 coach420 holo4k joined
do
	./prank analyze fasta-masked p2rank-datasets/${dataset}.ds -o conservation/fastas/${dataset} &> conservation/fastas/${dataset}.log
done

Using this functionality to extract the sequences is extremely important as it assures that we calculate the conservation scores for the sequences as they are "seen" by P2Rank. Also note that we are using the -masked version of the command, as HMMER fails when nonalphabetic symbols are encountered in the sequences.

Next, we calculate the conservation scores for the FASTA files we have created:

cd conservation/
for dataset in chen11 coach420 holo4k joined
do
	for database in uniprot_sprot uniref50
	do
		mkdir -p homs/${dataset}/${database}
		for fasta_file in fastas/${dataset}/*.fasta
		do
			cp "${fasta_file}" no_spaces.fasta
			mkdir tmp_dir
			./conservation_hmm.py no_spaces.fasta databases/${database}.fasta tmp_dir/ homs/${dataset}/${database}/"$(basename "${fasta_file}")".hom --max_seqs 1000 &> homs/${dataset}/${database}/"$(basename "${fasta_file}")".log
			rm no_spaces.fasta
			rm -r tmp_dir/
		done
	done
done

This command can take a lot of time (i.e., weeks) depending on the size of your datasets and databases. We use a trick with the no_spaces.fasta file here because conservation_hmm.py has an issue with processing files containing spaces in their names. Using --max_seqs 1000 leads to only 1,000 sequences being selected for the construction of a weighted multiple sequence alignment file; this results in greater stability and speed of the sequence weighting algorithm.

The HMMER programs will calculate conservation scores even for positions containing many "gap" characters (typical for sequence termini or internal deletions). The conservation scores at these positions will frequently be very high, despite the positions not actually being conserved. We may thus want to "mask" these positions somehow. This is what we do next:

wget https://raw.githubusercontent.com/cusbg/prankweb/master/conservation/conservation_hmm/examples/mask_ic_file.py
chmod +x mask_ic_file.py
cd homs/
for dataset in chen11 coach420 holo4k joined
do
	cd ${dataset}/
	for database in uniprot_sprot uniref50
	do
		cd ${database}/
		for max_freqgap in 30 50 70 90
		do
			mkdir -p masked_${max_freqgap}
			for hom_file in *.hom
			do
				../../../mask_ic_file.py "${hom_file}" "${hom_file}".freqgap masked_${max_freqgap}/"$(basename "${hom_file}" .hom)".masked_${max_freqgap}.hom 0.${max_freqgap} -1000.0
			done
		done
		cd ../
	done
	cd ../
done

The mask_ic_file.py script takes (in order) the file containing the original conservation scores, the file containing the per-position "gap" character frequencies, the name of the "masked" output file, a frequency threshold in the range <0.0, 1.0>, and the "masking" value (-1000.0 here). The produced "masked" file will contain the original conservation scores at the positions where the respective "gap" character frequencies are at most the specified frequency threshold; other positions will contain the "masking" value. As we did not know the optimal frequency threshold when training our model, four values (0.3, 0.5, 0.7, 0.9) were tested.

P2Rank has a quirky behavior where, when the chain identifier for a particular chain is blank ( ) in the PDB file, it looks for a conservation scores file with the chain identifier A. We thus have to "fix" the filenames:

for dataset in joined
do
	cd ${dataset}/
	for database in uniprot_sprot uniref50
	do
		cd ${database}/
		for max_freqgap in 30 50 70 90
		do
			cd masked_${max_freqgap}/
			cp ../../../../../../fix_filenames.py .
			./fix_filenames.py
			rm fix_filenames.py
			cd ../
		done
		cd ../
	done
	cd ../
done

Note that this should not be necessary when working with datasets containing PDB files downloaded directly from the PDB.

We can now finally get to training the models:

cd ../../
cp ../conservation_hmm.groovy config/
for database in uniprot_sprot uniref50
do
	for max_freqgap in 30 50 70 90
	do
		./prank traineval -t p2rank-datasets/chen11.ds -e p2rank-datasets/joined.ds -c conservation_hmm.groovy -conservation_dirs "(../conservation/homs/chen11/${database}/masked_${max_freqgap}, ../conservation/homs/joined/${database}/masked_${max_freqgap})" -delete_models false -loop 10 -o new_models/${database}/masked_${max_freqgap} -threads 4
#		./prank traineval -t p2rank-datasets/chen11-fpocket.ds -e p2rank-datasets/joined.ds -c conservation_hmm.groovy -conservation_dirs "(../conservation/homs/chen11/${database}/masked_${max_freqgap}, ../conservation/homs/joined/${database}/masked_${max_freqgap})" -delete_models false -loop 10 -o new_models/${database}/masked_${max_freqgap} -threads 4
	done
done

In our case, we trained 10 models for each combination of a sequence database and a frequency threshold. We use chen11.ds as the training dataset and joined.ds as the validation dataset. Please refer to the earlier tutorials (linked at the beginning of this page) for a detailed description of the prank traineval command and its options.

We can now evaluate the performance of the new models on the coach420.ds and holo4k.ds datasets:

for dataset in coach420 holo4k
do
	for database in uniprot_sprot uniref50
	do
		for max_freqgap in 30 50 70 90
		do
			for seed in $(seq 42 1 51)
			do
				./prank eval-predict p2rank-datasets/${dataset}.ds -c conservation_hmm.groovy -conservation_dirs ../conservation/homs/${dataset}/${database}/masked_${max_freqgap} -m new_models/${database}/masked_${max_freqgap}/runs/seed.${seed}/FastRandomForest.model -o new_models_evaluation/${dataset}/${database}/masked_${max_freqgap}/runs/seed.${seed} -threads 4
			done
		done
	done
done

One should now go over the reported metrics and select the "best" model. In our case, we decided that the coverage of sequence space offered by the UniProtKB/Swiss-Prot database was insufficient, so we focused mainly on the models trained against the UniRef50 database (the latter models yielded better metrics anyways). Among these, the "masking" frequency threshold of 0.5 yielded the best results on average. We then inspected the performance of the 10 models trained with this combination of parameters and selected one to serve as the model to be used in PrankWeb:

cp new_models/uniref50/masked_50/runs/seed.45/FastRandomForest.model new_models/uniref50/masked_50/runs/seed.45/conservation_hmm.model

This model yielded the highest values of the DCA_4_0 and DCA_4_2 metrics on the validation dataset. Please note that one normally selects the "best" model based on the results obtained on the validation dataset, and uses the testing datasets only to score the selected model in an unbiased way. In our case, we did not perform any model (hyperparameter) optimization which would improve the performance of the models on the joined.ds dataset specifically, and could have used the coach420.ds or holo4k.ds datasets to score the models directly. Importantly, in any case, the combination of the UniRef50 database and the "masking" frequency threshold of 0.5 yielded the best results on average for all three datasets used for validation or evaluation.

New model performance evaluation

The table below compares the performance of the default (no conservation) and the old conservation-aware models with the model we have just trained and selected (conservation_hmm):

default old conservation conservation_hmm
DCA_4_0 0.702 0.737 0.747
DCA_4_2 0.756 0.784 0.787

These results were obtained on the holo4k.ds evaluation dataset; however, the conservation_hmm models yield better values of other important metrics (such as AUPRC, LIGAND_COVERAGE, or MCC) on the validation dataset as well.

⚠️ **GitHub.com Fallback** ⚠️