Large scale predictions - cusbg/p2rank-framework GitHub Wiki

This page provides tips for calculating P2Rank predictions for very large datasets, namely the entire Protein Data Bank (PDB) and the AlphaFold database (DB). Readers are expected to have familiarized themselves with performing small-scale predictions in the previous tutorials; usage of a UNIX-like operating system is assumed.

The files described below can also be found here; please note that the files, as well as the instructions below, may need some updating for your particular use case or due to evolution of the used tools.

PDB

We start in an empty directory; the ../p2rank_2.4-beta.3.tar.gz file is the P2Rank release we are going to use for the predictions, and ../../model-training/p2rank_2.4-beta.3/config/ and ../../model-training/p2rank_2.4-beta.3/models/ are directories in which appropriate P2Rank configurations and models, respectively, are found (for example, if these were previously prepared for this purpose and are not yet present in the latest P2Rank release). We set up P2Rank:

cp ../p2rank_2.4-beta.3.tar.gz .
tar xzf p2rank_2.4-beta.3.tar.gz
rm p2rank_2.4-beta.3.tar.gz
cd p2rank_2.4-beta.3/
sed -i 's/2048m/16G/' prank
rm -r config/ models/
cp --recursive ../../model-training/p2rank_2.4-beta.3/config/ .
cp --recursive ../../model-training/p2rank_2.4-beta.3/models/ .

Next, we download the PDB archive (see here for additional information):

mkdir datasets
cd datasets/
rsync -rlpt -v -z --delete \
rsync.ebi.ac.uk::pub/databases/pdb/data/structures/divided/pdb/ \
./pdb > download-pdb.log

We then create a dataset file (pdb.ds) for further use and utilize P2Rank to extract protein sequences in FASTA format:

for f in $(grep .ent.gz download-pdb.log)
do
        echo pdb/${f} >> pdb.ds
done
cd ../
./prank analyze fasta-masked datasets/pdb.ds -o datasets/fastas/ &> datasets/fastas.log

We use the following Python script (in an executable file named generate_chunks.py found in our starting directory):

#!/usr/bin/env python3

from glob import glob
from os import mkdir

CHUNK_SIZE = 100

input_fasta_files = glob('fastas/*.fasta')

sequence_to_ids = dict()

for input_fasta_file in input_fasta_files:
    with open(input_fasta_file) as f:
        next(f)
        sequence = next(f).strip()
        if sequence in sequence_to_ids:
            sequence_to_ids[sequence].append(input_fasta_file.split('/')[1][:-6])
        else:
            sequence_to_ids[sequence] = []
            sequence_to_ids[sequence].append(input_fasta_file.split('/')[1][:-6])

sequences = sorted(sequence_to_ids)
sequence_chunks = (sequences[i:i + CHUNK_SIZE] for i in range(0, len(sequences), CHUNK_SIZE))

mkdir('chunks/')
for chunk_number, chunk_sequences in enumerate(sequence_chunks):
    mkdir('chunks/chunk_{}/'.format(chunk_number))
    for sequence_number, sequence in enumerate(chunk_sequences):
        with open('chunks/chunk_{}/sequence_{}.fasta'.format(chunk_number, sequence_number), mode='w') as f:
            print('chunks/chunk_{}/sequence_{}.fasta'.format(chunk_number, sequence_number), ';'.join(sequence_to_ids[sequence]), sep='\t')
            f.write('>' + ';'.join(sequence_to_ids[sequence]) + '\n' + sequence + '\n')

to find unique sequences and split them into chunks of 100 for parallel processing:

cd datasets/
../../generate_chunks.py > chunks.log

We now get to computing the conservation scores for individual chunk sequences. How you do this (i.e., the specific commands) is completely dependent on the kind of computational infrastructure you are using. I was using MetaCentrum, so the instructions below are specific to it. First, I created an executable file run_chunks.sh in the datasets/ directory with the following content:

#!/bin/bash

cd chunks/

for chunk_number in $(seq 0 1 2451)
do
	cd chunk_${chunk_number}/
	if [ ! -f conservation/sequence_99.fasta.conservation ]
	then
		cp ../../run_chunk.sh .
		qsub -q @elixir-pbs.elixir-czech.cz run_chunk.sh
	fi
	cd ../
done

cd ../

This script copies and starts the job specified in the file run_chunk.sh with the following content:

#!/bin/bash
#PBS -l select=1:ncpus=3:mem=8gb:scratch_local=30gb
#PBS -l walltime=24:00:00

export HMMER_DIR="/storage/praha1/home/davidjakubec/Apps/hmmer-3.3.2/bin/"

COMPUTE_CONSERVATION_DIR="/storage/praha1/home/davidjakubec/Projects/P2Rank/PDBe-predictions/compute-conservation"

cp ${COMPUTE_CONSERVATION_DIR}/uniref50.fasta ${SCRATCHDIR}/uniref50.fasta

cd ${PBS_O_WORKDIR}

mkdir -p conservation

for sequence_number in $(seq 0 1 99)
do
	if [ ! -f conservation/sequence_${sequence_number}.fasta.conservation ]
	then
		mkdir -p ${SCRATCHDIR}/tmp
		${COMPUTE_CONSERVATION_DIR}/compute-conservation.py sequence_${sequence_number}.fasta ${SCRATCHDIR}/uniref50.fasta ${SCRATCHDIR}/tmp/ conservation/sequence_${sequence_number}.fasta.conservation
		rm -r ${SCRATCHDIR}/tmp/
	fi
done

rm ${SCRATCHDIR}/uniref50.fasta

On MetaCentrum, I was able to run up to 600 jobs simultaneously and calculate the conservation scores for a dataset of about 350,000 sequences in about 5 days. However, I highly recommend to exercise extreme caution (i.e., paranoia) when utilizing large-scale computational infrastructures for mass parallel processing of your data. It is almost certain that some sequences will not be processed on the first attempt, due to, e.g., machine crashing, disks being unavailable, limits being reached, or any other unforseeable reason. You will always have to identify cases which failed to finish and amend these somehow. Always perform sanity (e.g., do all conservation score files have the same length as the corresponding sequence?) and completeness checks before retrieving and further processing your data!

It is now expected that all chunk sequences have had conservation scores calculated in the corresponding chunks/ subdirectories. We now need to unpack these so that P2Rank can match them with the appropriate structure files. We use the following Python script (in an executable file named unpack_homs.py found in our starting directory) to do so:

#!/usr/bin/env python3

from glob import glob
from os import mkdir
from shutil import copy

conservation_files = glob('chunks/chunk_*/conservation/*.conservation')

fasta_file_to_pdb_chain_ids = dict()

with open('chunks.log') as f:
    for line in f:
        fasta_file, pdb_chain_ids = line.strip().split()
        fasta_file_to_pdb_chain_ids[fasta_file] = pdb_chain_ids.split(';')

mkdir('homs')

for conservation_file in conservation_files:
    for pdb_chain_id in fasta_file_to_pdb_chain_ids[conservation_file.replace('/conservation', '').replace('.conservation', '')]:
        copy(conservation_file, 'homs/{}.hom'.format(pdb_chain_id))

We now call the script:

../../unpack_homs.py

We are now finally ready to run the predictions:

mkdir predictions
cd ../
for configuration in conservation_hmm default
do
        ./prank eval-predict datasets/pdb.ds -c ${configuration}.groovy -conservation_dirs homs/ -o datasets/predictions/${configuration}/ -threads 16 &> datasets/predictions/${configuration}.log
done

This command is relatively fast and should finish within 24 hours.

AlphaFold DB

AlphaFold DB downloads are split into multiple sections with different file structures; these instructions are thus split into the corresponding sections as well. Overall, the instructions are very similar or analogous to those for processing the PDB archive; only the major differences will thus be discussed. Please note that as AlphaFold DB is relatively new, its structure is prone to change. In particular, the filenames or URLs used below may no longer function and updated commands may thus need to be used.

Model organism proteomes (previously proteome-wide predictions)

The complete script for performing P2Rank predictions for this dataset looks like this:

#!/bin/bash

# Set up P2Rank
cp ../p2rank_2.4-beta.3.tar.gz .
tar xzf p2rank_2.4-beta.3.tar.gz
rm p2rank_2.4-beta.3.tar.gz
cd p2rank_2.4-beta.3/
sed -i 's/2048m/16G/' prank
rm -r config/ models/
cp --recursive ../../model-training/p2rank_2.4-beta.3/config/ .
cp --recursive ../../model-training/p2rank_2.4-beta.3/models/ .

# Download the predicted structures
mkdir datasets
cd datasets/
for proteome_ID in $(cat ../../proteome_IDs)
do
	mkdir ${proteome_ID}
	cd ${proteome_ID}/
	wget -q http://ftp.ebi.ac.uk/pub/databases/alphafold/${proteome_ID}.tar
	tar xf ${proteome_ID}.tar
	rm ${proteome_ID}.tar
	cd ../
	for f in ${proteome_ID}/*.pdb.gz
	do
		echo ${f} >> alphafold.ds
	done
done

# Extract sequences in FASTA format
cd ../
./prank analyze fasta-masked datasets/alphafold.ds -o datasets/fastas/ &> datasets/fastas.log

# Find unique sequences and split them into chunks for parallel processing
cd datasets/
../../generate_chunks.py > chunks.log

# Compute conservation scores for individual chunk sequences
# ...

# Unpack the conservation score files
../../../PDBe-predictions/unpack_homs.py

# Run the predictions
mkdir predictions
cd ../
for configuration in alphafold_conservation_hmm alphafold
do
	./prank eval-predict datasets/alphafold.ds -c ${configuration}.groovy -conservation_dirs homs/ -o datasets/predictions/${configuration}/ -threads 16 &> datasets/predictions/${configuration}.log
done

The major difference from the PDB predictions is how we download the dataset. For this purpose, we need an additional regular file named proteome_IDs in our starting directory with the following content:

UP000000437_7955_DANRE
UP000000559_237561_CANAL
UP000000589_10090_MOUSE
UP000000625_83333_ECOLI
UP000000803_7227_DROME
UP000000805_243232_METJA
UP000001450_36329_PLAF7
UP000001584_83332_MYCTU
UP000001940_6239_CAEEL
UP000002195_44689_DICDI
UP000002296_353153_TRYCC
UP000002311_559292_YEAST
UP000002485_284812_SCHPO
UP000002494_10116_RAT
UP000005640_9606_HUMAN
UP000006548_3702_ARATH
UP000007305_4577_MAIZE
UP000008153_5671_LEIIN
UP000008816_93061_STAA8
UP000008827_3847_SOYBN
UP000059680_39947_ORYSJ

Please note that the large-scale parallel conservation score calculations are not described in detail; these were performed as described above (of course, after modifying the appropriate files as necessary).

Swiss-Prot

The Swiss-Prot predictions are again very similar to those above; the major difference is in how we download the dataset:

#!/bin/bash

# Set up P2Rank
cp ../p2rank_2.4-beta.3.tar.gz .
tar xzf p2rank_2.4-beta.3.tar.gz
rm p2rank_2.4-beta.3.tar.gz
cd p2rank_2.4-beta.3/
sed -i 's/2048m/16G/' prank
rm -r config/ models/
cp --recursive ../../model-training/p2rank_2.4-beta.3/config/ .
cp --recursive ../../model-training/p2rank_2.4-beta.3/models/ .

# Download the predicted structures
mkdir datasets
cd datasets/
mkdir swissprot
cd swissprot/
wget -q https://ftp.ebi.ac.uk/pub/databases/alphafold/latest/swissprot_pdb_v2.tar
tar xf swissprot_pdb_v2.tar
rm swissprot_pdb_v2.tar
cd ../
for f in swissprot/*.pdb.gz
do
	echo ${f} >> swissprot.ds
done

# Extract sequences in FASTA format
cd ../
./prank analyze fasta-masked datasets/swissprot.ds -o datasets/fastas/ &> datasets/fastas.log

# Find unique sequences and split them into chunks for parallel processing
cd datasets/
../../generate_chunks.py > chunks.log

# Compute conservation scores for individual chunk sequences
# ...

# Unpack the conservation score files
../../../PDBe-predictions/unpack_homs.py

# Run the predictions
mkdir predictions
cd ../
for configuration in alphafold_conservation_hmm alphafold
do
        ./prank eval-predict datasets/swissprot.ds -c ${configuration}.groovy -conservation_dirs homs/ -o datasets/predictions/${configuration}/ -threads 16 &> datasets/predictions/${configuration}.log
done